Open Source Dictation: Language Model

A language model defines probable word succession probabilities: For example "now a daze" and "nowadays" are pronounced exactly the same, but because of context we know that "Now a daze I have a smartphone" is far less likely than "Nowadays I have a smartphone". To model such contextual information, speech recognition systems usually use an n-gram that contains information of how likely a specific word is, given the context of the sentence.

When comparing different, existing speech models, the Gigaword language model (with a 64000 words vocabulary) outperformed all other language models. I decided to try to improve upon that model.

Existing Language Models

To see what's what, I again set up a test set. This time, we only want to look at the language model performance with no influence from other components.

This is done by measuring the perplexity of the language model given an input text. The perplexity value basically tells you how "confused" the language model was when seeing the test text. Think about it like this: "The weather today is round" is certainly more confusing than "The weather is good". (for the more mathematically inclined: the perplexity is two to the power of the entropy of the test set). Ideally, word successions that make sense would yield low perplexity and sentences that don't, very high ones. That way, the recognizer is discouraged from outputting hypothesis like "Now a daze I have a smartphone".
Additionally, we'll also be looking at "out of vocabulary" words. Naturally, a language model only contains information about a certain amount of words. Words that are not in the LM can not be recognized. Therefore, one might be tempted to use ever growing vocabulary sizes to mitigate the issue. However, this makes the recognizer not only slower but also more inaccurate: because the language model encodes more transitions, the gap between common and very rare transitions also becomes smaller, increasing perplexity. Moreover, implementation details in CMU SPHINX discourage the use of vocabularies much larger than roughly 64000 words.

Because I didn't need to record this test set, I elected to use a bigger one to get more accurate results. The used test set therefore consists of 880 sentences from 495 chat messages, 169 email fragments, 175 sentences from scientific texts and 30 sentences from various news sources. The extracted sentences were not cleaned of e.g., names or specialized vocabulary. Because we're aiming for dictation, I modified the used corpora to use what's called "verbalized punctuation": punctuation marks are replaced with pseudo-words like ".period" or "(open-parenthesis" that represent what a user should say in the finished system.

To sum up: We are looking for a language model with around 64 thousand words that has the lowest possible perplexity score and the lowest amount of out-of-vocabulary words on our given test set.

So, let's first again compare what's currently out there given our new test set.

Language model OOVs [%] Perplexity
HUB 4 (64k) 15.03% 506.9
Generic (70k) 14.51% 459.7
Gigaword, (64k) 9.40% 458.5

From this, we can already see why the Gigaworld corpus performed much better than the other two language models in our earlier test. However, it still has almost 10% out of vocabulary words on our test set. To understand why, we have to look no further than what the corpus is built from: various news wire sources that are about a decade old by now. Given our problem domain for this experiment, this is obviously not ideal.

Can we do better?

Building a language model isn't exceptionally hard, but you need to find an extensive amount of data that should closely resemble what you want the system to recognize later on.

After a lot of experimenting (tests of all data sets below are available upon request), I settled on the following freely available data sources:

  • English Wikipedia
    Solid foundation over a diverse set of topics. Encyclopaedic writing style.
  • U.S. Congressional Record (2007)
    Somber discussions over a variety of topics using sophisticated vocabulary
  • Corpus of E-Mails of Enron Employees
    Mixture of business and colloquial messages between employees.
  • Stack Exchange (split between Stack Overflow and all other sites)
    Questions and answers from experts over a variety of domains, many of which are technical (fitting our problem domain).
  • Open (dump graciously provided upon request)
    Everyday, spoken speech.
  • Newsgroups (alt.* with a few exceptions)
    Informal conversations.

I built separate models of each of these corpora which were then combined to one large "ensemble" model with mixture weights optimizing the perplexity scores on the test set. These mixture weights are visualized in the graph below.

For each of the data sets, I also calculated word counts, selected the top 20000 to 35000 words (depending on the variability of the corpus) and removed duplicates to end up with a word list of about 136000 common words across the above corpora. I then further pruned this word list with a large dictionary of valid English words (more than 400000 entries) and manually removed a couple of e.g., foreign names to arrive at a list of around 65000 common English words to which I limited the ensemble language model.

The end result is a model with significantly fewer out of vocabulary words and lower perplexity on our test set than the Gigaword corpus.

Language model OOVs [%] Perplexity
Gigaword, (64k) 9.40% 458.5
Ensemble (65k) 4.53% 327.8

In order to perform recognition, we also need a phonetic dictionary. Of the 65k words in the ensemble language model, about 55k were already in the original CMU dictionary. The pronunciations for the remaining 10k words were (mostly) automatically synthesized with the CMU SPHINX g2p framework. While I was at it, I also applied the casing of the (conventional) dictionary to the (otherwise all uppercase) phonetic dictionary and language model. While a bit crude, this takes care of e.g., uppercasing names, languages, countries, etc.

So how does our thus created language model perform compared to the next best thing we tested? On the same test, with the same acoustic model, we decreased our word error rate by almost 2 percent - a more than 5 percent relative improvement.

Acoustic model Dictionary Language model WER
Voxforge 0.4 (cont) cmudict 0.7 Gigaword, 64k 31.02 %
Voxforge 0.4 (cont) Ensemble 65k Ensemble 65k 29.31 %



Are you developing the code in a public branch?

Peter Grasch's picture

Yes, the "dictation" branch on Git. But there's not much on there so far.
The models themselves are not code but data.

I probably won't push the data sets anywhere because they are very large (several dozen GB) unless someone asks for them.
But the produced model (which is just a bit over 100 MB) will end up in a public place together with the other material produced at the end of this project.

Best regards,

Thank you!

Dear Peter,

I find this research of yours simply amazing. Speech-to-text is one of the most obvious gaping holes in the F/OSS ecosystem and I doubted anything would happen any time soon. I understand that you won't be able to present us with something like the ominous "Dragon," but your exploration thus far has produced quite a few things I didn't expect at all. I am stunned and really curious about where this will take you and us. Thank you for this work and for letting us take this walk with you through your blog.


Peter Grasch's picture

Thank you Mutlu. Appreciated.

I have waited long time to see this in the Free Software World happen. Thank you so much!

Thanks for this incredibly interesting blog series. Please keep posting :-)

Hi Peter,

so is this the surprise you promised to us (on your previous web site, around December/January)? Cool :-) And a lot to do for one week - I'm still going through your blog posts.
What kind of computing power is available to you? These things must be quite computationally expensive - do have access to the university cluster?

Viele Grüße,


Oh - I just saw you're doing this all on your laptop? Unbelievable!

Peter Grasch's picture

Yes, a Lenovo X220t which I kitted out with 16GB of main memory - exactly for dealing with large language models :)

Best regards,

Hi Peter,

Did you use any text normalization techniques ( numbers -> words ) before getting text for language model creation? If so, what were they?

Peter Grasch's picture

No, not really. Numbers are actually not really accounted for in the current LM (they would have to be treated differently when building the dictionary as well as they should not necessarily need to adhere to the same frequency requirements).

It's one of the many open todos for future versions of the speech model.

Do you have any suggestions on this topic?

Hi, Peter,

First of all: good project!

In 1991 I worked on an LPC algorithm.
Do you know the method used to recognize phonemes ?
(I understand that your project relates to the next step in the process of recognition.)

Is there already a model in Spanish for Simon?
I would be happy to assist in your project.

But I'm sure that it will extend more than a week ... :)

My greetings!

Peter Grasch's picture

Hi there!

The pocketsphinx decoder is based on HMMs. I'm unsure what exactly you'd want to know but if you have some in-depth question (e.g. about the decoding algorithms), I'm sure I could point you in the right direction.

There is a Spanish model for SPHINX which will work for Simon. It can be installed through the Simon interface for downloading new acoustic models (please refer to the manual for more information on that).

If you want to join the larger project, feel free to introduce yourself at the mailing list of the Open Speech Initiative:

Best regards,

hi, first, congrats for such a great work!
second, would it be possible to have access to your Ensemble (65k) LM?
thanks a lot,

Peter Grasch's picture

Of course. All data has been published in the follow up blog post:

Hi Grasch,
I am sorry if I am repeating any of the questions posted here.
I have downloaded your acoustic model and language model and made changes to my configuration file such that it picks the same.
I am unable to download your dictionary so I have used Voxforge dictionary with 130K words
But now I am experiencing sequence of warnings(and the program ends in Out of Memory error despite increasing heap memory to 1GB), Sample of them is below. Please help me in resolving this error
19:15:55.192 WARNING dictionary Missing word: choristers
19:15:55.192 WARNING dictionary Missing word: choristers
19:15:55.192 WARNING dictionary Missing word: choristers
19:15:55.192 WARNING dictionary Missing word: choristers
19:15:55.192 WARNING dictionary Missing word: choristers
19:15:55.192 WARNING dictionary Missing word: choristers
19:15:55.202 WARNING dictionary Missing word: chronicon
19:15:55.202 WARNING dictionary Missing word: chronicon
19:15:55.202 WARNING dictionary Missing word: chronicon
19:16:06.654 WARNING dictionary Missing word: â§paragraph-sign
19:16:06.654 WARNING dictionary Missing word: â§paragraph-sign
19:16:06.654 WARNING dictionary Missing word: â§paragraph-sign
19:16:06.654 WARNING dictionary Missing word: â§paragraph-sign
19:16:06.654 WARNING dictionary Missing word: â§paragraph-sign

Peter Grasch's picture

The archive you downloaded includes the dictionary (essential-sane-65k.fullCased). Use that.
And you may very well be running out of memory - 1GB is not that much...

Thanks for that, language is indeed an important key to communication. Speaking of communication, Near-field communication (NFC) technology permits a consumer to wave their mobile phone at a point-of-sale terminal to buy via the use of a “digital wallet.” One of the fast-growing instances of the digital wallet is the appropriately known as Google Wallet, but it pays to know just a little bit about how it works before you dive in to the cashless (and cardless) retail revolution. Article source: why now don't you see more at

so Appreciated Dictation: Language Model

Drive Jacket

Sanvhost provides customers with reseller hosting and low-cost shared hosting. Its offers cPanel, Plesk panel round-the-clock support and a range of free one-click install scripts and applications. We provide everything from affordable shared hosting to dedicated servers. The host offers free website transfers from other hosts, including all of the customer’s files and databases. Whether you are looking to host personal websites, small or large business websites, blogs, forums, audio/video streaming,reseller platform and virtual or dedicated environments, we have a solution for you. Our web hosting services are feature rich including Sanvhost wordpress, joomla, shopping carts, ecommerce scripts, and multi language panel, CGI/Perl, MySQL, PHP and much more.Affordable hosting package offered by Sanvhost which not only provides the best in terms of hosting packages but also believes in truly being there for the customer, 24x7 chat support. Cheap hosting Moreover , they offer unlimited bandwidth as well as nearly 1GB storage along with database maintenance, email facility along with storage, availability of sub domain and many other important features for a very low price.Sanvhost is dicated web hosting company providing quality VPS hosting for websites and has plans ( Windows cheap VPS, Forex VPS, Plesk VPS, Shared Hosting, LinuxVps and Windows cloud VPS ) catering to everyone’s needs and we do provide 7 days money back guarantee. If your website is grown up or not running smoothly, we can provide you quality Virtual private server (VPS) hosting at just 9.99 USD per month. In VPS you will get all the features of a dedicated server for fraction of a dedicated server cost. You will get full root access, can host unlimited domains, unlimited email ids. You can install any software which need root access and can set any configuration setting as per your need.We offer high quality and professional IT solutions and services to meet the needs of businesses across the globe. We deliver innovative webhosting solutions to our clients. Sanvhost offers one of the cheapest web hosting plans around with unlimited bandwidth and unlimited web space, and many other unbeatable features in shared hosting. Sanvhost a complete Hosting solution.

For more info visit Window Hosting | Linux Hosting | Windows Vps | Linux Vps | PLesk Vps | Forex Vps | SmarterMail

Add new comment

Filtered HTML

  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <blockquote> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.