Results are in: One Open Source Dictation System Coming Up

Thanks to everyone who participated in the poll last week about what speech recognition project you'd most like to see.
The week is over, and Dictation has emerged as a clear winner!

As promised, I'll now try to build a proof of concept level prototype of such a system in time for this years Akademy.

With a system as complex as continuous dictation, there are obviously a wide range of challenges.
Here's just a few of the problems I'll need to tackle in the next two weeks:

  • Acoustic model: The obvious elephant in the room: Any good speech recognition system needs an accurate representation of how it expects users to pronounce the words in its dictionary.
  • Language model: "English" is simply not good enough - or when have you last tried to write "Huzzah!"? We not only need to restrict vocabulary to a sensible subset but also gain a pretty good understanding what a user might intend to write. This is not only important to avoid computationally prohibitive vocabulary sizes but to differentiate i.e. "@ home" and "at home".
  • Dictation application: Even given a perfect speech recognition system, dictation is still a bit off. You'll also need some form of software that handles the resulting recognition result and applies formatting (casing, etc.) and allows users to correct recognition mistakes, change the structure, etc.

Obviously, I won't be able to solve all these issues in this short time frame but I'll do my very best to show off a presentable prototype that addresses all these areas. Watch this blog for updates over the coming weeks!

Tags:

Comments

Hi :-)
It was never clear to me how to produce - or perfect - an acoustic model. Which is the best way to obtain something usable on most freesoftware speech recognization softwares?

Peter Grasch's picture

I'll do a blog post about the creation of the acoustic model that I'll be using for this prototype soon.

In the mean time have a look at the voxforge tutorials on model creation.

Basically you'll have the choice of SPHINX and HTK / Julius whereas I'd recommend the SPHINX route (as HTK is non-free and SPHINX is more actively developed). Then you compile a corpus of training data and define environment parameters (dictionary, for example) and let the machine learning algorithms do their job. Of course there's some major maths involved if you want to understand what's actually going on but that's pretty much the gist of it.

Linked tutorials seem complicated to implement, a shrinked version would be usefull (first of all to non-technical potential contributors unable to understand the math but able to produce voice recordings).
Anyway my major doubt was just about the needing of actual recording of read text and you solved this, thanks :-)

The current state of the art speech recognition algorithms relies on doing an fft of the audio stream, splitting it by some fixed length (eg 10ms), and applying deep neural networks on the resulting slices (ie maxout, autoencoders, rbm, et.c).

My guess is - as on regular laptops/desktops can't afford the processing time of really deep networks - that a big gain can be had by intelligently splitting the fft stream by phenomes and resizing them to a fixed length using bi-linear filtering could give a big accuracy gain in comparison to its cost in CPU time.

Time permitting, I'd be glad to help out with the machine learning portion!

Oh, and for a huge corpus, movies and subtitles might work ;)

Oh, and another thing: I believe the cpu cost saved by cropping the fft 'image' to only contain 200-4000hz (where most voice data lives) could allow a deeper network which would leed to better even better performance :)

Peter Grasch's picture

Low- and high-pass filters are extremely common preprocessing steps in speech recognition systems, yes.

Peter Grasch's picture

Yes, neural networks are making a comeback in several applications - including ASR.
However, I am not aware of any such systems being released under a free license so far. As I'm intending to showcase what's possible *today* (and only gave myself a weeks worth of time), developing a new recognizer and a model for that recognizer is sadly not an option (also because one of the drawbacks of DNN based approaches is the need of a *significant* amount of processing power).
Writing a DNN based recognizer would be an interesting long-term project though. Feel free to contact me if you want to work on this.

@Corpus: Using movies will not only give you copyright problems but also means you'll be working with extreme amounts of background noise, digital postprocessing, etc. Subtitles, are also legally troublesome in theory but given the fact that the original text can no longer be retrieved after building a language model, and the actual files used are fan-made, I really don't think publishers would mind. But yeah, I already had that same idea a while back and already did some experience on a corpus of > 200k movie subtitles - I'll blog about that as well.

Sanvhost provides customers with reseller hosting and low-cost shared hosting. Its offers cPanel, Plesk panel round-the-clock support and a range of free one-click install scripts and applications. We provide everything from affordable shared hosting to dedicated servers. The host offers free website transfers from other hosts, including all of the customer’s files and databases. Whether you are looking to host personal websites, small or large business websites, blogs, forums, audio/video streaming,reseller platform and virtual or dedicated environments, we have a solution for you. Our web hosting services are feature rich including Sanvhost wordpress, joomla, shopping carts, ecommerce scripts, and multi language panel, CGI/Perl, MySQL, PHP and much more.Affordable hosting package offered by Sanvhost which not only provides the best in terms of hosting packages but also believes in truly being there for the customer, 24x7 chat support. Cheap hosting Moreover , they offer unlimited bandwidth as well as nearly 1GB storage along with database maintenance, email facility along with storage, availability of sub domain and many other important features for a very low price.Sanvhost is dicated web hosting company providing quality VPS hosting for websites and has plans ( Windows cheap VPS, Forex VPS, Plesk VPS, Shared Hosting, LinuxVps and Windows cloud VPS ) catering to everyone’s needs and we do provide 7 days money back guarantee. If your website is grown up or not running smoothly, we can provide you quality Virtual private server (VPS) hosting at just 9.99 USD per month. In VPS you will get all the features of a dedicated server for fraction of a dedicated server cost. You will get full root access, can host unlimited domains, unlimited email ids. You can install any software which need root access and can set any configuration setting as per your need.We offer high quality and professional IT solutions and services to meet the needs of businesses across the globe. We deliver innovative webhosting solutions to our clients. Sanvhost offers one of the cheapest web hosting plans around with unlimited bandwidth and unlimited web space, and many other unbeatable features in shared hosting. Sanvhost a complete Hosting solution.

For more info visit Window Hosting | Linux Hosting | Windows Vps | Linux Vps | PLesk Vps | Forex Vps | SmarterMail

Add new comment

Filtered HTML

  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <blockquote> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.