Say hello to Web 1T 5-gram

By Filip Salo; published on September 25, 2006.

Google's ~~shitloa~~ Web 1T 5-gram data collection (mentioned earlier) is now available from the LDC.

This data set contains English word n-grams and their observed frequency counts. The length of the n-grams ranges from unigrams (single words) to five-grams. We expect this data will be useful for statistical language modeling, e.g., for machine translation or speech recognition, as well as for other uses.

The n-gram counts were generated from approximately 1 trillion word tokens of text from publicly accessible Web pages.

I don't think the LDC license agreement really allows it, but it would be nice if someone (I suppose it's not likely, but how about Google themselves?) would put up just a minimal web interface for querying the data.

Or a web service (Google API, anyone?).

Or something. Please. Pretty please?

Fat hobbit wants it! (But is so not coughing up $150 for the six DVDs.)