This table explains what each setting does and how it affects the analysis.
| Setting | Options | Comments |
|---|
| Candidate words from | Target Corpus Reference Corpus | This determines which corpus is used to generate the list of words to compare. If you choose Target, the tool finds words that behave unusually in the target corpus compared to the reference corpus. If you choose Reference, it finds words whose behaviour in the reference corpus is different from the target. |
| Similarity metric | Jaccard on top-K Weighted overlap on top-K | Controls how similarity between collocate sets is calculated. Jaccard compares how many collocates the two corpora share. Weighted overlap also considers how important those collocates are (their rank or score). |
| Collocation measure | Raw Frequency LogDice Log-likelihood MI (Mutual Information) | Raw frequency finds common neighbours. LogDice is a balanced, reliable general-purpose collocation measure. Log-likelihood highlights statistically significant collocates. MI highlights very strong but often rare collocations. |
| Window | 1-8 (Default 5) | How many words to the left and right are searched when looking for collocates. Smaller windows (1–3) are useful for grammatical or phrase-level patterns. Larger windows (4–8) can retrieve topical or semantic associations. |
| Minimum pair frequency | Any number (Default 5) | The minimum number of times a word and its collocate must occur together before being counted. Increasing this removes rare or accidental collocations, giving fewer but more reliable results. |
| Minimum document spread | Any number (Default 2) | The collocate pair must appear in at least this many different files. This helps remove collocations that only occur in one document and are thus idiosyncratic. |
| Spread Type |
Raw number of texts (Default) Percentage of texts |
Determines how Minimum document spread is interpreted. Raw number of texts requires a collocate pair to occur in a specified number of files. Percentage of texts requires the collocate pair to occur in a specified percentage of files in the relevant corpus. Percentage values are converted separately for each corpus, making this option particularly useful when comparing corpora with different numbers of texts. |
Minimum candidate collocates n (Target) | Any number (Default 10) | A word must have at least this many collocates before it is included in the comparison. This removes very rare words where there is not enough data for a reliable comparison. | | Minimum candidate collocates n (Reference) | Any number (Default 10) | As above. |
| Top N words | Any number (Default 5,000) | The number of candidate words to analyse and rank. |
| Top K for comparison | Any number (Default 20) | The number of top collocates used when comparing a word across the two corpora. The similarity score is based only on these top collocates. Smaller values focus on the most important collocates; larger values include more peripheral ones. |
| Top collocates shown | Any number (Default 20) | How many collocates are displayed in the reference and target collocate panels. This does not affect the similarity calculation, only what is shown on screen. |
| MM/LL smoothing | Any number (Default 0.5) | A small value added to frequencies to prevent division-by-zero and extreme scores when a collocate appears in one corpus but not the other. Increasing this makes the comparison more conservative and less sensitive to very rare collocates. |
| Filter stopwords | Yes or no (Default yes) | Removes very common grammatical words (the, and, of, to, etc.) from collocate lists so that results focus on meaningful lexical patterns. You can edit these words in the window, to add or remove them as needed (if you are mainly interested in grammatical analyses, you may want to switch this option to “no”). |