Other Corpora Resources
(Last updated on: 28 November 2018)
Online Corpus
PolyU Language bank | Over 36 mil words of multilingual, multi-genre corpora | free |
RCPCE Profession-specific Corpora | A large collection of texts used in different professions in Hong Kong | free |
A Query to Internet Corpora (Leeds U) | Updated general-purpose online corpora with different languages | |
British National Corpus (1980-1993) | A standard English corpus often used as a reference corpus. | |
British Academic Written corpus (BAWE) | A 6- mil- word collection of student essays in different disciplines | |
Business Letter Corpus | A corpus with different English letters | |
BYU Corpora | A collection of mega-corpora, including such as BNC and NOW (New words from 2010 to yesterday) | |
The Corpus of Contemporary American English (COCA ,1990-present) |
Representative of modern American English | |
Time Magazine (1923-2006) | A corpus for diachronic language study | free |
GloWbE (Global Web-Based English) | 1.9 billion words of English used in 20 countries | free |
MICASE | Transcripts of a wide range of spoken academic texts from Michigan University. | free |
WebCorp | Allows corpus-type searches of documents in English on the Internet. | free |
CQPweb at Beijing Foreign Studies University. | CQPWeb with multilingual corpora. | free |
Enron email corpus | Enron email data sets compiled at UK Berkeley | free |
Corpora maintained by Geoffrey Sampson | A collection of different texts |
Parallel Corpus
Bilingual Parallel Corpora of Chinese Classics | Parallel texts of Chinese classic novels and government documents |
Text Archive
The Gutenberg Project | The pioneering project designed to make non-copyright text available electronically | free |
Internet Archive | The Internet Archive Text Archive contains a wide range of fiction, popular books, children's books, historical texts and academic books. | free |
Internet Archive: Wayback Machine | The Wayback Machine is a digital archive of the World Wide Web and other information on the Internet. You can check the Wayback Machine for archives of a website. | free |
Word Cloud
Voyant Tools | To create word cloud based on frequency | free |
WordClouds.com | WordClouds.com generate Wordle from text that you provide. | free |
Corpus Tools
AntConc | A freeware concordance program for Windows. Please visit Laurence Anthony's Website for the complete list of software. | free |
AntCorGen | A freeware discipline-specific corpus creation tool. | free |
ConcGram 1.0 | ConcGram 1.0 is a corpus linguistics software package which is specifically designed to find all the co-occurrences of words in a text or corpus irrespective of variation. | |
ConcGramCore | ConcGramCore is an open source corpus linguistics software package for corpus linguists to find all the co-occurrences of words in a text or corpus irrespective of variation. The software is in continous development. | free |
ParaConc | A bilingual or multilingual concordancer that can be used in contrastive analyses and translation studies | free trial |
WordSmith Tools | Concordancing, word lists, key words | |
Leximancer | Lexical analysis | free trial |
WMatrix | In addition to frequency lists and concordances, WMatrix extends the keywords method to key grammatical categories and key semantic domains. | free trial |
Sketch Engine | Sketch Engine can provide a one-page summary of the word’s grammatical and collocational behavior, showing the word’s collocates categorised by grammatical relations. | |
ATLAS.ti (7) | For qualitative data analysis and discourse analysis | free trial |
NVivo (10) | For qualitative data analysis and discourse analysis | |
kfNgram | kfNgram makes n-gram indices of any text(s) you give it, similar to WordSmithTools' Cluster function. | free |
The IMS Open Corpus Workbench | free |
Lexical Analysers
The Ultimate Research Assistant | Lexical semantic thematic analysis of web documents | free |
Taggers
CLAWS | Word class (part-of-speech) tagger | free |
Stanford Log-linear Part Of Speech tagger | Different software for POS tagging | free |
Stanford CoreNLP online engine | Online interface of the Stanford CoreNLP software. Click here for more information of the package. |
free |
GUM | The Georgetown University Multilayer Corpus | free |
BFSU Corpora annotation tools | Many useful taggers, including the Windows GUI version Stanford POS Tagger. Click here for more information of the package. |
free |
Phonetic Analysis
Praat | Praat (the Dutch word for "talk") is a free scientific software program for the analysis of speech in phonetics. | free |
EMU (The Emu Speech Database System) | EMU is a collection of software tools for the creation, manipulation and analysis of speech databases. | free |
WaveSurfer | WaveSurfer is an Open Source tool for sound visualization and manipulation. | free |
SpeechAnalyzer | Speech Analyzer is a computer program for acoustic analysis of speech sounds. | free |
Development Workbench
KPML | Workbench for developing grammatical descriptions and defining computational grammars | free |
TermBase | Database for developing and storing terminologies | free |
Descriptive Resources
WordNet | A lexical database organizing nouns, verbs, adjectives and adverbs into synonym sets, each representing one underlying lexical concept. | free |
FrameNet | A lexical database containing around 1,200 semanticframes, 13,000lexical units and over 190,000 example sentences. | free |
Statistical Tools
SPSS | A famous advanced statistical and analytic tools. | |
R Project | A free package for statistical computing and graphics | free |
GNU PSPP | A free program for statisical analysis. It is a free as in freedom replacement for the proprietary program SPSS, and appears very similar to it with a few exceptions. | free |
Sample Size Calculator | An online calculator to find out the sample size based on the set confidence level and confidence interval. Useful for quantitative research sampling. | free |