ngrams for n diverging to infinite

Submitted by Matteo Vescovi on Wed, 2007-04-25 16:45

That Google released a massive corpus of data based on the vast amount of web pages that their GoogleBot crawler indexes is old news.

Don't you just love the subtle reference to the "All your base are belong to us" phenomenon. Incidentally, all my X display manager sessions great me with the catchy "All your base are belong to us" motto. So do my screensavers. I know...

That the 6 DVDs that contains the data are now available for free to the first fifty universities that express their interest in obtanining them is news!

I pulled some string to try and get my contacts within the Politecnico di Milano to request the DVDs for me.

I am planning on using the massive ngram database as data to train a better, more accurate, more knowledgeable language model for Soothsayer, the ubiquitous intelligent predictive framework.

I am counting on the fact that the ngrams harvested by Google are much more representative than the ngrams I am currently using to train the current frequentist language model. It should not be a wild assumption to make, as I am currently restricting the training corpus to a single work of fiction, Oscar Wilde's "The Picture of Dorian Gray".

In the event I cannot get my hands on Google's ngrams or I cannot use them as I wish due to licensing constraints, I plan on training the model on recent e-texts available from Project Gutemberg. That should make for a decent ngram harvesting source, even if not quite as massive as Google's ngrams!

Matteo Vescovi's blog
Login to post comments

ngrams for n diverging to infinite

Sections

Topics

Projects

Links

Recent blog posts

RSS feeds

Blogs