Say hello to Web 1T 5-gram
Google's shitloa Web 1T 5-gram data collection (mentioned earlier) is now available from the LDC.
This data set contains English word n-grams and their observed frequency counts. The length of the n-grams ranges from unigrams (single words) to five-grams. We expect this data will be useful for statistical language modeling, e.g., for machine translation or speech recognition, as well as for other uses.
The n-gram counts were generated from approximately 1 trillion word tokens of text from publicly accessible Web pages.
I don't think the LDC license agreement really allows it, but it would be nice if someone (I suppose it's not likely, but how about Google themselves?) would put up just a minimal web interface for querying the data.
Or a web service (Google API, anyone?).
Or something. Please. Pretty please?
Fat hobbit wants it! (But is so not coughing up $150 for the six DVDs.)