Corpus Research

What is a Corpus in Linguistics?

A corpus (plural: corpora) is a large collection of texts—spoken, written, or digital—that researchers use to study language. Think of it as a language database. A corpus might contain:

Millions of words from newspapers, novels, or academic articles
Transcripts of conversations or interviews
Social media posts, blogs, or online forums

The key idea is that a corpus represents real-life language use, not just invented examples.

How is Corpus Research Conducted?

Corpus linguistics is about using this “language database” to find patterns. Researchers usually:

Collect or choose a corpus
- Ready-made corpora (e.g., British National Corpus, COCA)
- Custom-built corpora (e.g., gathering tweets about a certain topic)
Search the corpus with software tools
- Programs like AntConc, Sketch Engine, or LancsBox allow you to search for words, phrases, and grammatical structures.
Look for patterns
- Frequency: Which words or expressions appear most often?
- Collocations: Which words often appear together (e.g., “climate change”)?
- Concordances: How is a word used in different contexts?
Interpret the findings
- Numbers and patterns are only the start—researchers ask: What do these patterns tell us about meaning, discourse, or society?