Google releases shitloads of corpus data
The people at Google Research have been digging into the Google datacenters to compile a corpus of one trillion words from public web pages. Now they're releasing some of the data they've extracted from it:
We processed 1,011,582,453,213 words of running text and are publishing the counts for all 1,146,580,664 five-word sequences that appear at least 40 times. There are 13,653,070 unique words, after discarding words that appear less than 200 times.
Official Google Research Blog: All Our N-gram are Belong to You
One trillion words. That's fourteen kilobillies. And the n-gram counts are released on a set of 6 DVDs (whew!). I'm not thrilled the distribution will be handled by the LDC, but I guess even I can live with that. It is a bit big for a download.
Oh, and a pearl of wisdom from the announcement:
there's no data like more data