Results are in: One Open Source Dictation System Coming Up

Thanks to everyone who participated in the poll last week about what speech recognition project you'd most like to see.
The week is over, and Dictation has emerged as a clear winner!

As promised, I'll now try to build a proof of concept level prototype of such a system in time for this years Akademy.

With a system as complex as continuous dictation, there are obviously a wide range of challenges.
Here's just a few of the problems I'll need to tackle in the next two weeks:

  • Acoustic model: The obvious elephant in the room: Any good speech recognition system needs an accurate representation of how it expects users to pronounce the words in its dictionary.
  • Language model: "English" is simply not good enough - or when have you last tried to write "Huzzah!"? We not only need to restrict vocabulary to a sensible subset but also gain a pretty good understanding what a user might intend to write. This is not only important to avoid computationally prohibitive vocabulary sizes but to differentiate i.e. "@ home" and "at home".
  • Dictation application: Even given a perfect speech recognition system, dictation is still a bit off. You'll also need some form of software that handles the resulting recognition result and applies formatting (casing, etc.) and allows users to correct recognition mistakes, change the structure, etc.

Obviously, I won't be able to solve all these issues in this short time frame but I'll do my very best to show off a presentable prototype that addresses all these areas. Watch this blog for updates over the coming weeks!



Hi :-)
It was never clear to me how to produce - or perfect - an acoustic model. Which is the best way to obtain something usable on most freesoftware speech recognization softwares?

Peter Grasch's picture

I'll do a blog post about the creation of the acoustic model that I'll be using for this prototype soon.

In the mean time have a look at the voxforge tutorials on model creation.

Basically you'll have the choice of SPHINX and HTK / Julius whereas I'd recommend the SPHINX route (as HTK is non-free and SPHINX is more actively developed). Then you compile a corpus of training data and define environment parameters (dictionary, for example) and let the machine learning algorithms do their job. Of course there's some major maths involved if you want to understand what's actually going on but that's pretty much the gist of it.

Linked tutorials seem complicated to implement, a shrinked version would be usefull (first of all to non-technical potential contributors unable to understand the math but able to produce voice recordings).
Anyway my major doubt was just about the needing of actual recording of read text and you solved this, thanks :-)

The current state of the art speech recognition algorithms relies on doing an fft of the audio stream, splitting it by some fixed length (eg 10ms), and applying deep neural networks on the resulting slices (ie maxout, autoencoders, rbm, et.c).

My guess is - as on regular laptops/desktops can't afford the processing time of really deep networks - that a big gain can be had by intelligently splitting the fft stream by phenomes and resizing them to a fixed length using bi-linear filtering could give a big accuracy gain in comparison to its cost in CPU time.

Time permitting, I'd be glad to help out with the machine learning portion!

Oh, and for a huge corpus, movies and subtitles might work ;)

Oh, and another thing: I believe the cpu cost saved by cropping the fft 'image' to only contain 200-4000hz (where most voice data lives) could allow a deeper network which would leed to better even better performance :)

Peter Grasch's picture

Low- and high-pass filters are extremely common preprocessing steps in speech recognition systems, yes.

Peter Grasch's picture

Yes, neural networks are making a comeback in several applications - including ASR.
However, I am not aware of any such systems being released under a free license so far. As I'm intending to showcase what's possible *today* (and only gave myself a weeks worth of time), developing a new recognizer and a model for that recognizer is sadly not an option (also because one of the drawbacks of DNN based approaches is the need of a *significant* amount of processing power).
Writing a DNN based recognizer would be an interesting long-term project though. Feel free to contact me if you want to work on this.

@Corpus: Using movies will not only give you copyright problems but also means you'll be working with extreme amounts of background noise, digital postprocessing, etc. Subtitles, are also legally troublesome in theory but given the fact that the original text can no longer be retrieved after building a language model, and the actual files used are fan-made, I really don't think publishers would mind. But yeah, I already had that same idea a while back and already did some experience on a corpus of > 200k movie subtitles - I'll blog about that as well.

Add new comment

Filtered HTML

  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <blockquote> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.