Corpus Research

What is a Corpus in Linguistics?

A corpus (plural: corpora) is a large collection of texts—spoken, written, or digital—that researchers use to study language. Think of it as a language database. A corpus might contain:

 

  • Millions of words from newspapers, novels, or academic articles

  • Transcripts of conversations or interviews

  • Social media posts, blogs, or online forums

The key idea is that a corpus represents real-life language use, not just invented examples.

How is Corpus Research Conducted?

Corpus linguistics is about using this “language database” to find patterns. Researchers usually:

 

  1. Collect or choose a corpus

    • Ready-made corpora (e.g., British National Corpus, COCA)

    • Custom-built corpora (e.g., gathering tweets about a certain topic)

  2. Search the corpus with software tools

    • Programs like AntConc, Sketch Engine, or LancsBox allow you to search for words, phrases, and grammatical structures.

  3. Look for patterns

    • Frequency: Which words or expressions appear most often?

    • Collocations: Which words often appear together (e.g., “climate change”)?

    • Concordances: How is a word used in different contexts?

  4. Interpret the findings

    • Numbers and patterns are only the start—researchers ask: What do these patterns tell us about meaning, discourse, or society?