Open Source Dictation: Acoustic Model

After working a bit on the language model last week, I spent some time on improving the used acoustic model which, simply put, is the representation of how spoken words actually sound.

Improving the general model

So far, the best acoustic model for my test set was Nickolay’s Voxforge 0.4 model built from the Voxforge database.

The Voxforge corpus is available under the terms and conditions of the GPL which means that I was free to try to improve upon that model. Sadly, I quickly realized that given the tight time constraints for this project, it was computationally infeasible to run a lot of experiments: The training procedure takes around 24 hours to complete on my laptop – the fastest machine at my disposal.

Because of that, I was not able to try some very interesting approaches like vocal tract length normalization (which tries to account for differences in the resonance properties of varyingly long vocal tracts of different speakers) or MMIE training although they have been shown to improve word error rates. I was also not able to try to fine tune the number of used Senones or clean the training database with forced alignment. Such experiments will have to wait until after the completion of this project – there’s definitely quite a bit of low hanging fruit.

However, I was still able to boost recognition rates simply by rebuilding the existing Voxforge model to incorporate all the new training data submitted since the last model was created in 2010.

Acoustic model Dictionary Language model WER
Voxforge 0.4 Ensemble 65k Ensemble 65k 29.31 %
Voxforge new Ensemble 65k Ensemble 65k 27.79 %

This also nicely shows what an impact a growing database of recordings has on recognition accuracy. If you want to help drop that WER score further, help today!

Adapting the model to my voice

Of course, when building a dictation system for myself, it would be foolish to not adapt this general acoustic model to my own voice. Model adaption is a fairly sure-fire way to dramatically improve recognition accuracy.

To this end, I recorded about 2 hours worth of adaption data (1500 recordings). Thanks to Simon’s power training feature this only took a single afternoon – despite taking frequent breaks.

I then experimented with MLLR and MAP adaption with a range parameters. Although I fully expected this to make a big difference, the actual result is astonishing: The word error rate on the test set drops to almost half – about 15 %.

Acoustic model Dictionary Language model WER
Voxforge new Ensemble 65k Ensemble 65k 27.79 %
Voxforge new; MAP adapted to my voice Ensemble 65k Ensemble 65k 15.42 %

Because I optimized the adaption parameters to achieve the lowest possible error rate on the test set, I could have potentially found a configuration that performs well on the test set but not in the general case.
To ensure that this is not the case, I also recorded an evaluation set consisting of 42 sentences from a blog post from the beginning of this series, an email and some old chat messages I wrote on IRC. In contrast to the original test set, this time I am also using vocalized punctuation in the recordings I’m testing – simulating the situation where I would use the finished dictation system to write these texts. This also better matches what the language model was built for. The end result of this synergy? 13.30 % word error rate on the evaluation set:

Recognition results when dictating a blog post

what speech recognition application are you most looking forward to ?question-mark
with the rising popularity of speech recognition in cars and mobile devices it’s not hard to see that we’re on the cost of making speech recognition of first class input method or across all devices .period
however ,comma it shouldn’t be forgotten that what we’re seeing in our smart phones on laptops today is merely the beginning .period
I am convinced that we will see much more interesting applications off speech recognition technologies in the future .period
so today ,comma I want to ask :colon
what application of speech recognition technology an id you’re looking forwards to the most ?question-mark
for me personally ,comma I honestly wouldn’t know where to begin .period

Facebooktwittergoogle_plusredditpinterestlinkedinmailby feather

Peter Grasch


  1. WER objective

    I’ve been following your posts on this subject with great interest.

    While my interest in this area is more on the line of voice commands (e.g. “next page”, “play video X”, “pause video”, “show news feeds”, “start call to X”, “start chat to X”, “play song X”, “search for X”, “note to self BLA BLA BLA”) and not dictation, per say, having a functional dictation system would be 99.9% of the way to having a functional verbal commands system.

    I’ve tried some canned (quasi) solutions before and even hacking something myself using Voxforge but I always stumbled on the voice recognition error rates. Even under ideal conditions (no background noise, headset microphone, speaking very clearly and paused) it worked unreliably (having to repeat commands frequently). Under relaxed couch conditions (some background noise, open microphone, speaking normally) it was mostly a failure.

    From your post you are closing on 10% WER (under optimal conditions I assume), so, I was wondering what is your WER objective? What is the maximum WER do you think is tolerable for a dictation system? And how tolerant of sub optimal condition do you want/need it to be?

    Finally, thanks for your work and for keeping us informed.

    p.s. I wonder if we will be able to have a HAL 9000 like interaction (minus the murderous intent) in our life time.

    • Re: WER objective

      Hi Artur,

      actually, Simon already does very well in regards to voice control. You really should try it out. For simple commands (let’s say around 25), you should easily hit > 95 % accuracy with training – even in sub-optimal conditions (depends on a range of factors of course).

      For dictation, I think it really depends on the use-case. For e.g. mobile devices even systems with ~ 20 % WER can be useful in a pinch. There is no percentage that suddenly becomes “tolerable”. It also depends a lot on how easy it is to correct inevitable mistakes.

      However, you’re of course correct that I am using optimal conditions for this experiment. I’m using a high quality microphone, no background noise, etc. Again, this is meant as a demonstration, not a product (which would need to incorporate e.g., noise cancellation). I’ll talk about the steps to turn this into a finished product in another post (probably next week).

      As for HAL 9000: I definitely think we’ll see more and more AI creep up in everyday gadgets. Can’t promise it won’t have murderous intent, though.

      Best regards,

  2. Hi Peter,

    Thanks to build such a awesome product. I am following up your posts to research the speech recognition technology. In this post, You mentioned WER can be improved to 27.79 % with the “VoxForge new” model. But the only voxForge model I can found from internet is 0.4. Can you kindly let me know where can I find “VoxForge new” ? Billions of Thanks in advance!


  3. Source Code

    Is there any chance you will posting the source code? If not is there any where I can find a good tutorial on how to use sphinx4 with voxforge in a setup similar to yours? I want to setup a transcriber and the sphinx4 tutorials aren’t overly helpful.

    Here are a few of the resources I looked over:


    Then, I read over the source in https://www.assembla.com/code/sonido/subversion/nodes/3/sphinx4/src/apps

    Any help would be great Peter, thanks in advance.

  4. I posted the source code

    I posted the source code (everything was developed in the open from the get go) but I’m uncertain how that may be helfpul for you at this stage.

    For SPHINX support, please contact their support team. I especially recommend the #cmusphinx channel on freenode for any questions you have.

    Best regards,

Leave a Reply

Your email address will not be published. Required fields are marked *