Open Source Dictation: Demo Time

Over the last couple of weeks, I've been working towards a demo of open source speech recognition. I did a review of existing resources, and managed to improve both acoustic- and language model. That left turning Simon into a real dictation system.

Making Simon work with large-vocabulary models

First of all, I needed to hack Simond a bit to accept and use an n-gram based language model instead of the scenarios grammar when the first was available. With this little bit of trickery, Simon was already able to use the models I built in the last weeks.

Sadly, I immediately noticed a big performance issue: Up until now, Simon basically recorded one sample until the user stopped speaking and then started recognizing. While not a problem when the "sentences" are constrained to simple, short commands, this would cause significant lag as the length of the sentences, and therefore the time required for recognition, increased. Even when recognizing faster than real time, this essentially meant that you had to wait for ~ 2 seconds after saying a ~ 3 second sentence.
To keep Simon snappy, I implemented continuous recognition in Simond (for pocketsphinx): Simon now feeds data to the recognizer engine as soon as the initial buffer is filled, making the whole system much more responsive.

Revisiting the Dictation plugin

Even before this project started, Simon already had a "Dictation" command plugin. Basically, this plugin would just write out everything that Simon recognizes. But that's far from everything there is to dictation from a software perspective.

First of all, I needed to take care of replacing the special words used for punctuation, like ".period", with their associated signs. To do that, I implemented a configurable list of string replaces in the dictation plugin.

An already existing option to add a given text at the end of a recognition result takes care of adding spaces after sentences if configured to do so. I also added the option to uppercase the first letter of every new spoken sentence.

Then, I set up some shortcut commands that would be useful while dictating ("Go to the end of the document" for ctrl+end or "Delete that" for backspace, for example).

To deal with incorrect recognition results, I also wanted to be able to modified already written text. To do that, I made Simon aware of the currently focused text input field by using AT-SPI 2. I then implemented a special "Select x" command that would search through the current text field and select the text "x" if found. This enables the user to select the offending word(s) to either remove them or simply dictate the correction.

Demonstration

So without much ado, this is the end result:

What's next?

Of course, this is just the beginning. If we want to build a real, competitive open source speech recognition offering we have to tackle - among others - the following challenges:

  • Turning the adaption I did manually into an integrated, guided setup procedure for Simon (enrollment).
  • Continuing to work towards better language- and acoustic models in general. There's a lot to do there.
  • Improving the user interface for the dictation: We should show off the current (partial) hypothesis even while the user is speaking. That would make the system feel even more responsive.
  • Better accounting for spontaneous input: Simon should be aware of (and ignore) filler words, support mid-sentence corrections, false starts, etc.
  • Integrating semantic logic into the language model; For example, in the current prototype, recognizing "Select x" is pretty tricky because e.g., "Select hear" is not a sentence that makes sentence according to the language model - it does in the application, though (select the text "hear" in the written text for correction / deletion).
  • Better incorporating the dictation with traditional command & control: When not dictating texts, we should still exploit the information we do have (available commands) to keep recognition accuracy as high as it is for the limited-vocabulary use case we have now. A mixture (or switching) between grammar and language model should be explored.
  • Better integration in other apps: The AT-SPI information used for correcting mistakes is sadly not consistent across toolkits and widgets. Many KDE widgets are in fact not accessible through AT-SPI (e.g. the document area of Calligra Words does not report to be a text field). This is mostly down to the fact that no other application currently requires the kind of information Simon does.

Even this rather long list is just a tiny selection of what I can think of right off the top of my head - and I'm not even touching on improvements in e.g. CMU SPHINX.
There's certainly still a lot left to do, but all of it is very exciting and meaningful work.

I'll be at the Akademy conference for the coming week where I'll also be giving a talk about the future of open source speech recognition. If you want to get involved in the development of an open source speech recognition system capable of dictation: Get in touch with me - either in person, or - if you can't make it to Akademy - write me an email!

Tags:

Comments

awesome!!!

This is amazing, and would really help out in the open source community to assist users that require additional accessibility features.

Unbelievable!!!

You didn't really do all this in the time of a week (week as in 'amount of working hours', not as in 'last week')? I guess people will fall off their chairs in Bilbao! Have fun there and 'Buen viaje!'! (And don't forget to relax a bit... ;-))

Peter Grasch's picture

All together it was probably almost 2 weeks. But I've been experimenting in similar areas for quite some time now, so I knew both where to look and what to do. That certainly helped a lot :)

Best regards,
Peter

Excellent Work Peter!
I would never have thought that such accurateness would be possible with free software, let alone it that time span.
As german is my mother language, I wonder if you could build a similiar language & acoustic model for the german language?

Cheers

Peter Grasch's picture

I actually tried to build a German speech model back in February but it's much harder as there is far less data publicly available.

I guess we should concentrate on English for now and add more languages later on. But don't worry, I'm a native German speaker as well (as are many other KDE hackers), so I'm sure the German model won't be forgotten.

Best regards,
Peter

Sounds really great, I hope will be avaliable for other languages in the future :P

Not perfect? It is shockingly brilliant.

This is very impressive! Having such a high accurracy, this will be very useful in the future.

This feels like a revolution! :)

Thanks for your continued work on Simon. It is the selfless hard work of programmers like you that give people the opportunity to make the world a better place. Best of luck in your future endeavors.
Peace!

I can't say I'm interested in getting him off (who is he anyways? why can't you europeans keep your youtube videos G-rated?) but that computery stuff you got going on seems really great.

Hi there Peter,
To just tackle this area is a gigantic job. To do it in a short span of time like weeks - even if you were "just" putting it all together is very impressive.
I have been "stuck" with Windoze for so many years simply because I NEED Dragon voice-2-text.
Worse, I had to use MSWord with it.
I have been agitating with Nuance Corp for years trying to get them to port to Linux and they finally did a partial job with OpenOffice for Windows but then quit.

I'm no programmer, but I am a Writer and will help with testing feedback anytime.

BTW: Your English is probably not the best model acoustically. I have what was called a "mid-Atlantic" English accent - born in Australia but with years in Asia, which suits the Dragon model very well, greatly reducing the error rate.
I've also responded by email to you.

Well done mate. This is awesome. Looking forward to future updates.
Keep up good work. This is a good example of an Open source project which could have a wider impact!

I manage the IT at a Law firm. We are rolling out Dragon and I have it working on Linux but a truly opensource model would be preferred. How can I help this project?

Peter Grasch's picture

Hey Richard,

in general, the best way end-users can help is by sharing the recordings they amass, coupled with a transcription of what is actually being said. However, I am aware that this will probably not be an option when it comes to legal dictates.

Given that, I don't think there is really anything you can do at this early stage.
However, it is always very motivating to hear about potential users, so please consider your comment here your small, but very valuable contribution to the Simon project.
Thanks.

Best regards,
Peter

Hi,

I am trying out various open-source Voice Recognition apps, and have not been able to find the GUI that seems avaible for Simon. Can you please point me at a link?

Thanks in advance for your response.

Peter Grasch's picture

Simon is a graphical application. You can get it here: http://simon.kde.org.

Please be aware that the dictation capabilities showcased in that article are not yet integrated in the latest version available for download.

Best regards,
Peter

Hi Peter,

THX for that! I did follow a few Links, and landed here: http://userbase.kde.org/Simon/Development_Environment#Linux. I run Backbox Linux, which uses Ubuntu as a base, and XFCE as the Desktop Manager. So, I had to make sure that I had all those KDE requirements installed.

I compiled the source code, and ALL IS GOOD!

Now, I need to add some Libraries, and train a few things. BTW, I did watch your YouTube Video here: http://www.youtube.com/watch?v=ghfMMYNOwXo&feature=youtu.be&t=3m19s.

Thanks again!

Best Regards,

JJMacey
http://www.jjmacey.net/

Hi Peter,

Is the fancy dictation abilities you showed now integrated in the current Simon? How do I use it as you just did? I want to get my hands on it ASAP.
Besides building from source, how do I get a binary ?

Peter Grasch's picture

Sorry, this is not yet integrated.
Even if you build from source you'd need to do a lot of manual hacking to get it where I got it in the video. Please be a bit patient.

I'm excited for many reasons.
1) You explain the process and operationals superbly, making it easy for the lay user to grasp. Your success is assured because of this alone.
2) I have been looking for precisely this type of product for years
3) The interface looks modern
4) You're passionate about something that will actually transform computing
5) In your Akademy talk you spoke of the ability to factor languages (like code) into specific app use, rendering real world coding as a possibility

Can not wait to get my hands on this.

Peter Grasch's picture

Thank you for your kind words!

How much training (in hours) with your voice was needed to get to that quality?

Hi Peter,

I have been looking VR solution like yours for months.
This is really amazing and I hope it could help me to create educational stuff for kids.
Schools are interested in new methods of teaching and "Simon says" - I wanna it!

We need a dictation tools for edu games for schools and your solution Cmu Sphinx based would be the best answer.
When will be the Simone's Dictation Tools ready for developers and explorers?

Impressive stuff. I would be very interested to contribute as a beta tester and/or provide samples and training for the base language models.

Hello, I was looking at converting all the linux.conf.au Videos to text, and it looks like this might be good enough to do the trick. However I am rather new to this, so:
1) Simon keeps on asking for a Simon Base Model file. All I know is that it has an ".sbm" extension, but I haven't been able to find any files with an .sbm extension in any of the linked files or on the voxforge site.
2) It seems that maybe what I actually want is Transcriber from sphinx. I think I am meant to edit edu/cmu/sphinx/demo/transcriber/config.xml to point to your files, but it needs e.g. a .gram file and I can't find a file with the .gram extension in your files. Do it exist, just with a different extension?
3) BTW, do you know if it would be easy to automatically add information on precisely *when* a particular phrase was spoken in the video (useful for subtitles)?

"Sorry, this is not yet integrated. Even if you build from source you'd need to do a lot of manual hacking to get it where I got it in the video. Please be a bit patient."

..any updates on the status of the dictation plugin? I compiled / packaged simon on PCLinuxOS, and while it seems to work, the real reason I wanted it was for the dictation.

Thanks

Peter Grasch's picture

Daniel: About 2 hours.

Cybertex: I hope to have a consumer-ready prototype by this summer.

Rob: Thank you. Have a look at VoxForge.org to submit your recordings: http://www.voxforge.org/home/read

John: The dictation functionality is still not integrated in a way where you can use Simon's UI to set everything up. It is not intended for end users at the moment. Sorry. (for the record: 2. is better, yes, but .gram is a grammar - you'll need to use an n-gram instead; you may want to use the one I published in the "Wrapping up" follow up post; 3. yes, this is supported by sphinx.)

Travis: We're still working on it. Expect something by this summer.

Sanvhost provides customers with reseller hosting and low-cost shared hosting. Its offers cPanel, Plesk panel round-the-clock support and a range of free one-click install scripts and applications. We provide everything from affordable shared hosting to dedicated servers. The host offers free website transfers from other hosts, including all of the customer’s files and databases. Whether you are looking to host personal websites, small or large business websites, blogs, forums, audio/video streaming,reseller platform and virtual or dedicated environments, we have a solution for you. Our web hosting services are feature rich including Sanvhost wordpress, joomla, shopping carts, ecommerce scripts, and multi language panel, CGI/Perl, MySQL, PHP and much more.Affordable hosting package offered by Sanvhost which not only provides the best in terms of hosting packages but also believes in truly being there for the customer, 24x7 chat support. Cheap hosting Moreover , they offer unlimited bandwidth as well as nearly 1GB storage along with database maintenance, email facility along with storage, availability of sub domain and many other important features for a very low price.Sanvhost is dicated web hosting company providing quality VPS hosting for websites and has plans ( Windows cheap VPS, Forex VPS, Plesk VPS, Shared Hosting, LinuxVps and Windows cloud VPS ) catering to everyone’s needs and we do provide 7 days money back guarantee. If your website is grown up or not running smoothly, we can provide you quality Virtual private server (VPS) hosting at just 9.99 USD per month. In VPS you will get all the features of a dedicated server for fraction of a dedicated server cost. You will get full root access, can host unlimited domains, unlimited email ids. You can install any software which need root access and can set any configuration setting as per your need.We offer high quality and professional IT solutions and services to meet the needs of businesses across the globe. We deliver innovative webhosting solutions to our clients. Sanvhost offers one of the cheapest web hosting plans around with unlimited bandwidth and unlimited web space, and many other unbeatable features in shared hosting. Sanvhost a complete Hosting solution.

For more info visit Window Hosting | Linux Hosting | Windows Vps | Linux Vps | PLesk Vps | Forex Vps | SmarterMail

Add new comment

Filtered HTML

  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <blockquote> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.