Web Activity 10.2 Digging for data in a corpus

Speaking: From Planning to Articulation

There is now a wealth of data and search tools that are available for examining how language is actually used in the real world across a broad range of situations. Here are three especially large and easy-to-access corpora for English:

Corpus of Contemporary American English (COCA)

http://corpus.byu.edu/coca/

This is the largest current corpus of American English, consisting of more than 560 million words of text from 1990 to the present. It is publicly available for free on the internet.

It is evenly divided among five sections, or genres:

  • Spoken language (transcriptions of unscripted conversations on TV and radio programs)
  • Fiction
  • Popular magazines
  • Newspapers
  • Academic journals

Users can define searches by year and by genre. New texts are added to the corpus each year, in the same balance across the five genres, to allow for a direct year-to-year comparison of usage.

British National Corpus (BNC)

http://www.natcorp.ox.ac.uk

The BNC is a large collection (100 million words) of written and spoken British English from a broad range of sources sampled in the late 20th century. The written portion, which makes up 90% of the corpus, includes (among other sources) text from regional and national newspapers, specialist journals for various ages and interests, published and unpublished letters and memos, and school and university essays. The spoken portion consists of transcriptions of informal conversations involving volunteer participants from various ages, regions, and social classes, and spoken language from radio programs, government and business meetings, and various public lectures.

The aim of the BNC is to include contemporary British English from a very broad variety of genres, contexts, regions, and social settings. It is publicly available for free on the internet.

Corpus of Historical American English (COHA)

http://corpus.byu.edu/coha/

This corpus contains more than 400 million words of text of American English from 1810 to 2009 taken from newspapers, magazines, and books of fiction and non-fiction. The corpus is balanced by genre across decades to allow for direct comparisons of usage over time. Works of fiction make up about half of the corpus. This corpus is best suited for tracking patterns of language change and shifting usage over the last two centuries.

What kinds of questions can you ask using a corpus?

The corpora listed above can be queried using a variety of different search terms and flexible tools. Here is just a small sampling of the general questions that can be addressed:

What words and structures make up the English language?

Prescriptive grammarians often make claims about what counts as “correct” English usage—for example, that it is ungrammatical to “split infinitives,” as in the sentence It’s wrong to carelessly insert an adverb before the verb. Scientific linguists are not preoccupied with “correctness,” but they do try to characterize the knowledge that a community of speakers share about their language. So, for example, they might claim that speakers of English would agree that it is strange to say a sentence like What did he buy without tasting? A corpus search can be done to check whether these structures really are excluded from English usage, or whether they actually occur with some regularity. The corpus can also provide data about the specific contexts in which certain marginal structures occur most freely.

Here are a couple of examples:

http://languagelog.ldc.upenn.edu/nll/?p=11924

http://languagelog.ldc.upenn.edu/nll/?p=1876

What are the typical language patterns that an English speaker has been exposed to?

You have seen throughout the textbook how closely we track statistical patterns in language as we learn language, and how important frequency distributions are for language comprehension and production. Using a corpus, you can get an approximate sense of the distribution patterns that make up the experience of a typical English speaker. This can be important in making experimental predictions that hinge on relating language learning or processing to previous experience with language.

You can also get a sense of how these patterns might change depending on the kind of language that a speaker encounters. For example, are there certain structures that tend to be encountered only in written language, or in academic written texts? If so, the language experience of someone who regularly reads scholarly journals could be very different from that of someone who rarely reads anything at all.

How have language patterns changed over time? Are there changes that are taking place right now?

For example, the words lamehot, and random have changed recently. When did these changes start, and what were the most important language contexts in which their new uses spread?

Are there certain structures that are becoming more frequent and that may end up replacing previous structures? For instance, is the “get passive” structure on the rise relative to the regular passive form (He got killed versus He was killed)?

Here is an example:

http://languagelog.ldc.upenn.edu/nll/?p=10089

How do language patterns reflect cultural norms?

What adjectives are most likely to co-occur with the words Democrat versus Republican, and what does this say about popular opinion? Are we saying different things about immigrantsequality, or teenagers than we did a decade or two ago? What societal changes does this reflect?

Here is an example:

http://languagelog.ldc.upenn.edu/nll/?p=9712

Are the materials in my psycholinguistics experiment properly balanced?

Researchers commonly use data from a corpus to make sure they do not inadvertently introduce biases due to frequency or expectations about the relationship between certain words, phrases, or structures.

Using corpus tools

Learning to use search tools for these corpora is not hard. Here is a brief instructional video that gives an overview of how to use COCA:

https://www.youtube.com/watch?v=sCLgRTlxG0Y

This video shows how to perform more advanced searches using parts-of-speech tags in COCA:

https://www.youtube.com/watch?v=KP-7thiUnLM

In this video, one can learn how to perform searches to see which words or phrases tend to co-occur with each other:

https://www.youtube.com/watch?v=t_SxpfiPo_o

Try using these tools to investigate a linguistic question that has aroused your curiosity.

Back to top