89. The Common Crawl dataset includes at least 16 million unique records of content
90. Critically, OpenAI admits that “datasets we view as higher-quality are sampled
Figure 2: Number of tokens from the 25 most represented top-level domains (left) and websites (right) in C4.
EN
.
3 Corpus-level statistics
Understanding the provenance of the texts that com-
prise a dataset is fundamental to understanding the
dataset itself, so we begin our analysis of the meta-
data of C4.EN by characterizing the prevalence of
different internet domains as sources of text, the
date the websites were first indexed by the Internet
Archive, and geolocation of IP addresses of hosted
websites.
3.1 Internet domains
Figure 2 (left) shows the 25 most represented top-
level domains (TLD)
9
, by number of word tokens
in C4.
EN
(measured using the SpaCy English to-
kenizer).
10
Unsurprisingly, popular top-level do-
mains such as
.com
,
.org
, and
.net
are well
represented. We note that some top-level domains
reserved for non-US, English-speaking countries
are less represented, and even some domains for
countries with a primary language other than En-
glish are represented in the top 25 (such as ru).
11
A significant portion of the text comes from
.gov
websites, reserved for the US government.
Another potentially interesting top-level domain is
.mil
, reserved for the US government military.
While not in the top 25 TLDs, C4.
EN
contains
33,874,654 tokens from
.mil
top-level domain
sites, coming from 58,394 unique URLs. There are
an additional 1,224,576 tokens (from 2,873 unique
9
https://en.wikipedia.org/wiki/List_
of_Internet_top-level_domains
10
https://spacy.io/api/tokenizer
11
We use the TLDExtract (
https://pypi.org/
project/tldextract/) package to parse the URLs.
URLs) from
.mod.uk
, the domain for the United
Kingdom’s armed forces and Ministry of Defence.
Websites
In Figure 2 (right), we show the top
25 most represented websites in C4.
EN
, ranked by
total number of tokens. Surprisingly, the cleaned
corpus contains substantial amounts of patent text
documents, with the single-most represented web-
site in the corpus is
patents.google.com
and
patents.com
being in the top 10. We discuss
the implications of this in §4.1.
Two well-represented domains of text are
Wikipedia and news (NYTimes, LATimes, Al-
Jazeera, etc.). These have been extensively used in
the training of large language models (Devlin et al.,
2019; Liu et al., 2019; Brown et al., 2020, e.g.,
BERT, RoBERTa, GPT-3). Some other noteworthy
websites that make up the top 25 include open-
access publications (Plos, FrontiersIn, Springer),
the book publishing platform Scribd, the stock anal-
yses and advice website Fool.com, and the dis-
tributed file system ipsf.io.
12
3.2 Utterance Date
Language changes over even short timescales, and
the truth or relevance of many statements depends
on when they were made. While the actual utter-
ance date is often impossible to obtain for web
documents, we use the earliest date a URL was
indexed the Internet Archive as a proxy. We note
that using the Internet Archive is not perfect, as it
12
Note that the distribution of websites in C4.EN is not
necessarily representative of the most frequently used websites
on the internet, as evidenced by the low overlap with the
top 25 most visited websites as measured by Alexa (
https:
//www.alexa.com/topsites)