Open Source Dictation: Language Model

A language model defines probable word succession probabilities: For example “now a daze” and “nowadays” are pronounced exactly the same, but because of context we know that “Now a daze I have a smartphone” is far less likely than “Nowadays I have a smartphone”. To model such contextual information, speech recognition systems usually use an n-gram that contains information of how likely a specific word is, given the context of the sentence.

When comparing different, existing speech models, the Gigaword language model (with a 64000 words vocabulary) outperformed all other language models. I decided to try to improve upon that model.

Existing Language Models

To see what’s what, I again set up a test set. This time, we only want to look at the language model performance with no influence from other components.

This is done by measuring the perplexity of the language model given an input text. The perplexity value basically tells you how “confused” the language model was when seeing the test text. Think about it like this: “The weather today is round” is certainly more confusing than “The weather is good”. (for the more mathematically inclined: the perplexity is two to the power of the entropy of the test set). Ideally, word successions that make sense would yield low perplexity and sentences that don’t, very high ones. That way, the recognizer is discouraged from outputting hypothesis like “Now a daze I have a smartphone”.
Additionally, we’ll also be looking at “out of vocabulary” words. Naturally, a language model only contains information about a certain amount of words. Words that are not in the LM can not be recognized. Therefore, one might be tempted to use ever growing vocabulary sizes to mitigate the issue. However, this makes the recognizer not only slower but also more inaccurate: because the language model encodes more transitions, the gap between common and very rare transitions also becomes smaller, increasing perplexity. Moreover, implementation details in CMU SPHINX discourage the use of vocabularies much larger than roughly 64000 words.

Because I didn’t need to record this test set, I elected to use a bigger one to get more accurate results. The used test set therefore consists of 880 sentences from 495 chat messages, 169 email fragments, 175 sentences from scientific texts and 30 sentences from various news sources. The extracted sentences were not cleaned of e.g., names or specialized vocabulary. Because we’re aiming for dictation, I modified the used corpora to use what’s called “verbalized punctuation”: punctuation marks are replaced with pseudo-words like “.period” or “(open-parenthesis” that represent what a user should say in the finished system.

To sum up: We are looking for a language model with around 64 thousand words that has the lowest possible perplexity score and the lowest amount of out-of-vocabulary words on our given test set.

So, let’s first again compare what’s currently out there given our new test set.

Language model OOVs [%] Perplexity
HUB 4 (64k) 15.03% 506.9
Generic (70k) 14.51% 459.7
Gigaword, (64k) 9.40% 458.5

From this, we can already see why the Gigaworld corpus performed much better than the other two language models in our earlier test. However, it still has almost 10% out of vocabulary words on our test set. To understand why, we have to look no further than what the corpus is built from: various news wire sources that are about a decade old by now. Given our problem domain for this experiment, this is obviously not ideal.

Can we do better?

Building a language model isn’t exceptionally hard, but you need to find an extensive amount of data that should closely resemble what you want the system to recognize later on.

After a lot of experimenting (tests of all data sets below are available upon request), I settled on the following freely available data sources:

  • English Wikipedia
    Solid foundation over a diverse set of topics. Encyclopaedic writing style.
  • U.S. Congressional Record (2007)
    Somber discussions over a variety of topics using sophisticated vocabulary
  • Corpus of E-Mails of Enron Employees
    Mixture of business and colloquial messages between employees.
  • Stack Exchange (split between Stack Overflow and all other sites)
    Questions and answers from experts over a variety of domains, many of which are technical (fitting our problem domain).
  • Open Subtitles.org (dump graciously provided upon request)
    Everyday, spoken speech.
  • Newsgroups (alt.* with a few exceptions)
    Informal conversations.

I built separate models of each of these corpora which were then combined to one large “ensemble” model with mixture weights optimizing the perplexity scores on the test set. These mixture weights are visualized in the graph below.


For each of the data sets, I also calculated word counts, selected the top 20000 to 35000 words (depending on the variability of the corpus) and removed duplicates to end up with a word list of about 136000 common words across the above corpora. I then further pruned this word list with a large dictionary of valid English words (more than 400000 entries) and manually removed a couple of e.g., foreign names to arrive at a list of around 65000 common English words to which I limited the ensemble language model.

The end result is a model with significantly fewer out of vocabulary words and lower perplexity on our test set than the Gigaword corpus.

Language model OOVs [%] Perplexity
Gigaword, (64k) 9.40% 458.5
Ensemble (65k) 4.53% 327.8

In order to perform recognition, we also need a phonetic dictionary. Of the 65k words in the ensemble language model, about 55k were already in the original CMU dictionary. The pronunciations for the remaining 10k words were (mostly) automatically synthesized with the CMU SPHINX g2p framework. While I was at it, I also applied the casing of the (conventional) dictionary to the (otherwise all uppercase) phonetic dictionary and language model. While a bit crude, this takes care of e.g., uppercasing names, languages, countries, etc.

So how does our thus created language model perform compared to the next best thing we tested? On the same test, with the same acoustic model, we decreased our word error rate by almost 2 percent – a more than 5 percent relative improvement.

Acoustic model Dictionary Language model WER
Voxforge 0.4 (cont) cmudict 0.7 Gigaword, 64k 31.02 %
Voxforge 0.4 (cont) Ensemble 65k Ensemble 65k 29.31 %
Facebooktwittergoogle_plusredditpinterestlinkedinmailby feather

Peter Grasch


  1. Yes, the “dictation” branch on Git. But there’s not much on there so far.
    The models themselves are not code but data.

    I probably won’t push the data sets anywhere because they are very large (several dozen GB) unless someone asks for them.
    But the produced model (which is just a bit over 100 MB) will end up in a public place together with the other material produced at the end of this project.

    Best regards,

  2. Dear Peter,

    I find this research of yours simply amazing. Speech-to-text is one of the most obvious gaping holes in the F/OSS ecosystem and I doubted anything would happen any time soon. I understand that you won’t be able to present us with something like the ominous “Dragon,” but your exploration thus far has produced quite a few things I didn’t expect at all. I am stunned and really curious about where this will take you and us. Thank you for this work and for letting us take this walk with you through your blog.


  3. Thanks for this incredibly interesting blog series. Please keep posting 🙂

  4. Hi Peter,

    so is this the surprise you promised to us (on your previous web site, around December/January)? Cool 🙂 And a lot to do for one week – I’m still going through your blog posts.
    What kind of computing power is available to you? These things must be quite computationally expensive – do have access to the university cluster?

    Viele Grüße,


  5. Oh – I just saw you’re doing this all on your laptop? Unbelievable!

    • Yes, a Lenovo X220t which I kitted out with 16GB of main memory – exactly for dealing with large language models 🙂

      Best regards,

  6. Hi Peter,

    Did you use any text normalization techniques ( numbers -> words ) before getting text for language model creation? If so, what were they?

    • No, not really. Numbers are actually not really accounted for in the current LM (they would have to be treated differently when building the dictionary as well as they should not necessarily need to adhere to the same frequency requirements).

      It’s one of the many open todos for future versions of the speech model.

      Do you have any suggestions on this topic?

      • Hi, Peter,

        First of all: good project!

        In 1991 I worked on an LPC algorithm.
        Do you know the method used to recognize phonemes ?
        (I understand that your project relates to the next step in the process of recognition.)

        Is there already a model in Spanish for Simon?
        I would be happy to assist in your project.

        But I’m sure that it will extend more than a week … 🙂

        My greetings!

        • Hi there!

          The pocketsphinx decoder is based on HMMs. I’m unsure what exactly you’d want to know but if you have some in-depth question (e.g. about the decoding algorithms), I’m sure I could point you in the right direction.

          There is a Spanish model for SPHINX which will work for Simon. It can be installed through the Simon interface for downloading new acoustic models (please refer to the manual for more information on that).

          If you want to join the larger project, feel free to introduce yourself at the mailing list of the Open Speech Initiative: https://mail.kde.org/mailman/listinfo/kde-speech

          Best regards,

          • hi, first, congrats for such a great work!
            second, would it be possible to have access to your Ensemble (65k) LM?
            thanks a lot,

  7. Hi Grasch,
    I am sorry if I am repeating any of the questions posted here.
    I have downloaded your acoustic model and language model and made changes to my configuration file such that it picks the same.
    I am unable to download your dictionary so I have used Voxforge dictionary with 130K words
    But now I am experiencing sequence of warnings(and the program ends in Out of Memory error despite increasing heap memory to 1GB), Sample of them is below. Please help me in resolving this error
    19:15:55.192 WARNING dictionary Missing word: choristers
    19:15:55.192 WARNING dictionary Missing word: choristers
    19:15:55.192 WARNING dictionary Missing word: choristers
    19:15:55.192 WARNING dictionary Missing word: choristers
    19:15:55.192 WARNING dictionary Missing word: choristers
    19:15:55.192 WARNING dictionary Missing word: choristers
    19:15:55.202 WARNING dictionary Missing word: chronicon
    19:15:55.202 WARNING dictionary Missing word: chronicon
    19:15:55.202 WARNING dictionary Missing word: chronicon
    19:16:06.654 WARNING dictionary Missing word: â§paragraph-sign
    19:16:06.654 WARNING dictionary Missing word: â§paragraph-sign
    19:16:06.654 WARNING dictionary Missing word: â§paragraph-sign
    19:16:06.654 WARNING dictionary Missing word: â§paragraph-sign
    19:16:06.654 WARNING dictionary Missing word: â§paragraph-sign

    • The archive you downloaded includes the dictionary (essential-sane-65k.fullCased). Use that.
      And you may very well be running out of memory – 1GB is not that much…

Leave a Reply

Your email address will not be published. Required fields are marked *