Tuesday, September 20, 2016

Some online text corpora and interfaces: BYU corpora and others.


Brigham Young University (BYU) corpora.

Example: Wikipedia
This corpus contains the full text of Wikipedia (2014), and it contains 1.9 billion words in more than 4.4 million articles.
http://corpus.byu.edu/wiki/


List of BYU corpora:

The most widely used online corpora -- more than 130,000 distinct researchers, teachers, and students each month.
English
# wordslanguage/dialecttime period compare
NOW Corpus   NEW 2.8 billion+20 countries / Web2010-yesterday
Global Web-Based English (GloWbE)1.9 billion20 countries / Web2012-13
Wikipedia Corpus1.9 billionEnglish-2014Info
Hansard Corpus (British Parliament)1.6 billionBritish1803-2005Info
Corpus of Contemporary American English (COCA)520 millionAmerican1990-2015* * * * *
Corpus of Historical American English (COHA)400 millionAmerican1810-2009* *
TIME Magazine Corpus100 millionAmerican1923-2006
Corpus of American Soap Operas100 millionAmerican2001-2012*
British National Corpus (BYU-BNC)*100 millionBritish1980s-1993* *
Strathy Corpus (Canada)50 millionCanadian1970s-2000s
CORE Corpus  NEW 50 millionWeb registers-2014
Other languages
Corpus del Español   (see also...)100 millionSpanish1200s-1900s*
Corpus do Português   (see also...)45 millionPortuguese1300s-1900s
N-grams
Google Books: American English155 billionAmerican1500s-2000s*
Google Books: British English34 billionBritish1500s-2000s
Google Books: One Million Books89 billionAm/Br1500s-2000s
Google Books: Spanish45 billionSpanish1500s-2000s


-------------------
https://en.wikipedia.org/wiki/List_of_text_corpora




No comments:

Post a Comment