ngrams for n diverging to infinite

That Google released a massive corpus of data based on the vast amount of web pages that their GoogleBot crawler indexes is old news.

That the 6 DVDs that contains the data are now available for free to the first fifty universities that express their interest in obtanining them is news!

I pulled some string to try and get my contacts within the Politecnico di Milano to request the DVDs for me.

I am planning on using the massive ngram database as data to train a better, more accurate, more knowledgeable language model for Soothsayer, the ubiquitous intelligent predictive framework.

I am counting on the fact that the ngrams harvested by Google are much more representative than the ngrams I am currently using to train the current frequentist language model. It should not be a wild assumption to make, as I am currently restricting the training corpus to a single work of fiction, Oscar Wilde's "The Picture of Dorian Gray".

In the event I cannot get my hands on Google's ngrams or I cannot use them as I wish due to licensing constraints, I plan on training the model on recent e-texts available from Project Gutemberg. That should make for a decent ngram harvesting source, even if not quite as massive as Google's ngrams!