Mmmm, words!

By Filip Salo; published on June 14, 2006.

I'm compiling a corpus based on the Swedish Wikipedia. Thanks to their database dumps, it's a fairly simple thing to do (depending on your requirements, of course).

I haven't tokenized it properly, but it looks like I'll end up with about 26 million tokens. I've got about 100 million tokens from other corpora as well, so I expect to be toying around with 125 million(-ish) tokens some time soon. Good times!

When talking about the size of a linguistic corpus in thousands or millions of words, it can be quite difficult to get a sense of how much it really is.

Of course, it varies wildly with the layout, but a single A4-sized page can easlily hold 800 words or so. A fairly normal magazine article perhaps 2,000. The Swedish translation of the Bible from 1917 contains around 800,000 words, or 1,000 pages of regular paper. That's a stack of about five centimeters, so my 125,000,000 words would take almost eight meters of shelves. That's more than one and a half full-sized Billy.

Of course, this stuff is morphosyntactically tagged - every word is annotated with information on part-of-speech and inflection, which would take up a Billy or two by itself. (Come to think of it, I think "billies" would be an excellent measure of corpus size. You could have a corpus of one millibilly. Beautiful!)

This entry, by the way, has 241 words (in the simple unix wc -w sense).