English Corpora: most widely used online corpora. Billions of words of data: free online access
English-Corpora.org
Corpora
AI / LLMs
Overview
PDF
Videos
Resources
Help / FAQ
My account
corpus.byu.edu
English-Corpora.org
Overview
Language learning
Insight into variation
Expanded help files
Search types
Queries
Variable length queries
Collocates+
Representativity
Compare COCA and BNC
Size
Speed
Training / workshops
History / updates
corpus.byu.edu
Overview / guided tour
Architecture
Association measures
Collocates (cf Sketch Engine)
Topics (and collocates)
Word sketches
Browsing words
Analyzing texts
KWIC -> analyze text
Saved words and phrases
Saving KWIC entries
Customized word lists
Search history
External resources
Monitor corpus
Virtual Corpora
VC: quick overview
Overview
Language learning and teaching
Compare to AI / LLMs
Word sketches
Browsing words
Analyze texts
Search history
Customized word lists
Saved words (favorites)
KWIC lines: limiting and sorting
Saved KWIC lines
Analyze KWIC lines
External resources
Virtual Corpora
Examining recent change
Unable to access YouTube?
Overview
Full-text data
Word frequency
Collocates
N-grams
Overview
Number of users
Researchers
Register / profile
Log out
Name of university
Reset password
Delete account
Premium (individual) license
Academic (group) license
FAQs
Citing / screenshots
Problems
Workshops
Third party materials
Corpus
Size
Countries
Time
Genre
IWEB
13.9
2017
Web
NOW
16.2
20
2010-now
Web: News
CORONA
1.58
20
2020-now
Web: News
GLOWBE
1.9
20
2012-13
Web/blogs
WIKI
1.9
(+)
2014
Wikipedia
COCA
1.0
Am
1990-2019
Balanced
COHA
400m
Am
1810-2009
Balanced
TV
325m
1950-2018
TV shows
MOVIES
200m
1930-2018
Movies
SOAP
100m
Am
2001-2012
TV shows
HANSARD
1.6
Br
1803-2005
Parliament
EEBO
755m
Br
1470s-1690s
Various
SUP CRT
130m
Am
1790s-2010s
Legal
TIME
100m
Am
1923-2006
Magazine
BNC
100m
Br
1980s-1993
Balanced
CAN
50m
Can
1970s-2000s
Balanced
CORE
50m
2014
Web
Overview
brief
detailed
Now available:
AI/LLM insights integrated into corpus results
These corpora (most of which were created by
Mark Davies
) are the most
widely used
online corpora, and they
serve
many different purposes
for teachers and
researchers
at
universities
throughout the world. In addition, the corpus data (e.g.
full-text
word frequency
) has been employed by a
wide range of companies
in many different fields, especially technology and
language learning
(These include tech companies like Amazon,
Google, Facebook, Microsoft, IBM, Sony, Disney, Intel, Adobe, and Samsung, as well as language-related companies like Merriam-Webster, Dictionary.com, Grammarly,
Duolingo, TurnItIn, Oxford University Press, Sketch Engine; and many more.)
The links below are for the
free online interface. You can also
purchase and download
the
corpora for use on your own computer.
Corpus
# words
Dialect
Time period
Genre(s)
News on the Web (NOW)
24.7
billion+
20 countries
2010-
yesterday
Web: News
iWeb: The
Intelligent Web-based Corpus
14
billion
6 countries
2017
Web
Global
Web-Based English (GloWbE)
1.9
billion
20 countries
2012-2013
Web (incl blogs)
Wikipedia
Corpus
1.9
billion
(Various)
2014
Wikipedia
Coronavirus Corpus
1.5
billion
20 countries
2020-2023
Web: News
Corpus of Contemporary American
English (COCA)
1.0
billion
American
1990-2019
Balanced
Corpus of
Historical American English (COHA)
475 million
American
1820-2019
Balanced
The TV Corpus
325 million
6 countries
1950-2018
TV shows
The Movie
Corpus
200 million
6 countries
1930-2018
Movies
Corpus of American Soap Operas
100 million
American
2001-2012
TV shows
Hansard Corpus
1.6
billion
British
1803-2005
Parliament
Early English
Books Online (EEBO)
755 million
British
1470s-1690s
(Various)
Corpus of US Supreme Court Opinions
130 million
American
1790s-2017
Legal opinions
TIME Magazine Corpus
100 million
American
1923-2006
Magazine
British National
Corpus (BNC)
100 million
British
1980s-1993
Balanced
Strathy Corpus
(Canada)
50 million
Canadian
1970s-2000s
Balanced
CORE Corpus
50 million
6 countries
2014
Web
From
Google Books n-grams
compare
American English
155 billion
American
1500s-2000s
(Various)
British English
34 billion
British
1500s-2000
(Various)
US