S-Words Analysis Tool

Version 1. Created by Paul Baker 2026.

1) Load corpora

Reference corpus

No files chosen

No folder chosen

Target corpus

No files chosen

No folder chosen

Paragraph breaks count as document separators in a single file

Supported formats: TXT, CSV, TSV, JSON. CSV/TSV/JSON may include text, date, author, genre, source.

No corpora loaded yet. Choose files or folders for the reference and target corpora.

2) Divergence settings

Candidate words from
metric

Collocation measure
Window

Min pair freq
Min doc spread
Spread type

Min candidate collocates n (Target)
Min candidate collocates n (Reference)

Top N words

Top K for comparison
Top collocates shown

MI / LL smoothing

Filter stopwords

Editable stopword list (one word per line)

Add or remove words as needed. The list is used when “Filter stopwords” is ticked.

Corpus summary

Ref docs

Ref tokens

Tgt docs

Tgt tokens

Reference sources

Target sources

Divergence ranking

Load both corpora, choose settings, then compare all eligible words.

Reference collocates

No word selected.

Target collocates

No word selected.

KWIC

No query yet.

KWIC window
Show corpus
Sort KWIC
Highlight collocate

Full source text

Click a KWIC row to show the full file/document text here.

This table explains what each setting does and how it affects the analysis.

Setting	Options	Comments
Candidate words from	Target Corpus Reference Corpus	This determines which corpus is used to generate the list of words to compare. If you choose Target, the tool finds words that behave unusually in the target corpus compared to the reference corpus. If you choose Reference, it finds words whose behaviour in the reference corpus is different from the target.
Similarity metric	Jaccard on top-K Weighted overlap on top-K	Controls how similarity between collocate sets is calculated. Jaccard compares how many collocates the two corpora share. Weighted overlap also considers how important those collocates are (their rank or score).
Collocation measure	Raw Frequency LogDice Log-likelihood MI (Mutual Information)	Raw frequency finds common neighbours. LogDice is a balanced, reliable general-purpose collocation measure. Log-likelihood highlights statistically significant collocates. MI highlights very strong but often rare collocations.
Window	1-8 (Default 5)	How many words to the left and right are searched when looking for collocates. Smaller windows (1–3) are useful for grammatical or phrase-level patterns. Larger windows (4–8) can retrieve topical or semantic associations.
Minimum pair frequency	Any number (Default 5)	The minimum number of times a word and its collocate must occur together before being counted. Increasing this removes rare or accidental collocations, giving fewer but more reliable results.
Minimum document spread	Any number (Default 1)	The collocate pair must appear in at least this many different files. This helps remove collocations that only occur in one document and are thus idiosyncratic.
Spread Type	Raw number of texts (Default) Percentage of texts	Determines how Minimum document spread is interpreted. Raw number of texts requires a collocate pair to occur in a specified number of files. Percentage of texts requires the collocate pair to occur in a specified percentage of files in the relevant corpus. Percentage values are converted separately for each corpus, making this option particularly useful when comparing corpora with different numbers of texts.
Minimum candidate collocates n (Target)	Any number (Default 10)	A word must have at least this many collocates before it is included in the comparison. This removes very rare words where there is not enough data for a reliable comparison.
Minimum candidate collocates n (Reference)	Any number (Default 10)	As above.
Top N words	Any number (Default 5,000)	The number of candidate words to analyse and rank.
Top K for comparison	Any number (Default 10)	The number of top collocates used when comparing a word across the two corpora. The similarity score is based only on these top collocates. Smaller values focus on the most important collocates; larger values include more peripheral ones.
Top collocates shown	Any number (Default 10)	How many collocates are displayed in the reference and target collocate panels. This does not affect the similarity calculation, only what is shown on screen.
MM/LL smoothing	Any number (Default 0.5)	A small value added to frequencies to prevent division-by-zero and extreme scores when a collocate appears in one corpus but not the other. Increasing this makes the comparison more conservative and less sensitive to very rare collocates.
Filter stopwords	Yes or no (Default yes)	Removes very common grammatical words (the, and, of, to, etc.) from collocate lists so that results focus on meaningful lexical patterns. You can edit these words in the window, to add or remove them as needed (if you are mainly interested in grammatical analyses, you may want to switch this option to “no”).

1) Load corpora

2) Divergence settings ?

Corpus summary

Divergence ranking

Reference collocates

Target collocates

KWIC

Full source text

2) Divergence settings