Open Source Dictation: Acoustic Model

After working a bit on the language model last week, I spent some time on improving the used acoustic model which, simply put, is the representation of how spoken words actually sound.

Improving the general model

So far, the best acoustic model for my test set was Nickolay’s Voxforge 0.4 model built from the Voxforge database.

The Voxforge corpus is available under the terms and conditions of the GPL which means that I was free to try to improve upon that model. Sadly, I quickly realized that given the tight time constraints for this project, it was computationally infeasible to run a lot of experiments: The training procedure takes around 24 hours to complete on my laptop – the fastest machine at my disposal.

Because of that, I was not able to try some very interesting approaches like vocal tract length normalization (which tries to account for differences in the resonance properties of varyingly long vocal tracts of different speakers) or MMIE training although they have been shown to improve word error rates. I was also not able to try to fine tune the number of used Senones or clean the training database with forced alignment. Such experiments will have to wait until after the completion of this project – there’s definitely quite a bit of low hanging fruit.

However, I was still able to boost recognition rates simply by rebuilding the existing Voxforge model to incorporate all the new training data submitted since the last model was created in 2010.

Acoustic model Dictionary Language model WER
Voxforge 0.4 Ensemble 65k Ensemble 65k 29.31 %
Voxforge new Ensemble 65k Ensemble 65k 27.79 %

This also nicely shows what an impact a growing database of recordings has on recognition accuracy. If you want to help drop that WER score further, help today!

Adapting the model to my voice

Of course, when building a dictation system for myself, it would be foolish to not adapt this general acoustic model to my own voice. Model adaption is a fairly sure-fire way to dramatically improve recognition accuracy.

To this end, I recorded about 2 hours worth of adaption data (1500 recordings). Thanks to Simon’s power training feature this only took a single afternoon – despite taking frequent breaks.

I then experimented with MLLR and MAP adaption with a range parameters. Although I fully expected this to make a big difference, the actual result is astonishing: The word error rate on the test set drops to almost half – about 15 %.

Acoustic model Dictionary Language model WER
Voxforge new Ensemble 65k Ensemble 65k 27.79 %
Voxforge new; MAP adapted to my voice Ensemble 65k Ensemble 65k 15.42 %

Because I optimized the adaption parameters to achieve the lowest possible error rate on the test set, I could have potentially found a configuration that performs well on the test set but not in the general case.
To ensure that this is not the case, I also recorded an evaluation set consisting of 42 sentences from a blog post from the beginning of this series, an email and some old chat messages I wrote on IRC. In contrast to the original test set, this time I am also using vocalized punctuation in the recordings I’m testing – simulating the situation where I would use the finished dictation system to write these texts. This also better matches what the language model was built for. The end result of this synergy? 13.30 % word error rate on the evaluation set:

Recognition results when dictating a blog post

what speech recognition application are you most looking forward to ?question-mark
with the rising popularity of speech recognition in cars and mobile devices it’s not hard to see that we’re on the cost of making speech recognition of first class input method or across all devices .period
however ,comma it shouldn’t be forgotten that what we’re seeing in our smart phones on laptops today is merely the beginning .period
I am convinced that we will see much more interesting applications off speech recognition technologies in the future .period
so today ,comma I want to ask :colon
what application of speech recognition technology an id you’re looking forwards to the most ?question-mark
for me personally ,comma I honestly wouldn’t know where to begin .period

Facebooktwitterredditpinterestlinkedinmailby feather

Peter Grasch