Corpus Research
What is a Corpus in Linguistics?
A corpus (plural: corpora) is a large collection of texts—spoken, written, or digital—that researchers use to study language. Think of it as a language database. A corpus might contain:
-
Millions of words from newspapers, novels, or academic articles
-
Transcripts of conversations or interviews
-
Social media posts, blogs, or online forums
The key idea is that a corpus represents real-life language use, not just invented examples.

How is Corpus Research Conducted?
Corpus linguistics is about using this “language database” to find patterns. Researchers usually:
-
Collect or choose a corpus
-
Ready-made corpora (e.g., British National Corpus, COCA)
-
Custom-built corpora (e.g., gathering tweets about a certain topic)
-
-
Search the corpus with software tools
-
Programs like AntConc, Sketch Engine, or LancsBox allow you to search for words, phrases, and grammatical structures.
-
-
Look for patterns
-
Frequency: Which words or expressions appear most often?
-
Collocations: Which words often appear together (e.g., “climate change”)?
-
Concordances: How is a word used in different contexts?
-
-
Interpret the findings
-
Numbers and patterns are only the start—researchers ask: What do these patterns tell us about meaning, discourse, or society?
-