Google releases shitloads of corpus data

By Filip Salo; published on August 04, 2006.

The people at Google Research have been digging into the Google datacenters to compile a corpus of one trillion words from public web pages. Now they're releasing some of the data they've extracted from it:

We processed 1,011,582,453,213 words of running text and are publishing the counts for all 1,146,580,664 five-word sequences that appear at least 40 times. There are 13,653,070 unique words, after discarding words that appear less than 200 times.

Official Google Research Blog: All Our N-gram are Belong to You

One trillion words. That's fourteen kilobillies. And the n-gram counts are released on a set of 6 DVDs (whew!). I'm not thrilled the distribution will be handled by the LDC, but I guess even I can live with that. It is a bit big for a download.

Oh, and a pearl of wisdom from the announcement:

there's no data like more data