Open Source Dictation: Demo Time

Over the last couple of weeks, I’ve been working towards a demo of open source speech recognition. I did a review of existing resources, and managed to improve both acoustic- and language model. That left turning Simon into a real dictation system.

Making Simon work with large-vocabulary models

First of all, I needed to hack Simond a bit to accept and use an n-gram based language model instead of the scenarios grammar when the first was available. With this little bit of trickery, Simon was already able to use the models I built in the last weeks.

Sadly, I immediately noticed a big performance issue: Up until now, Simon basically recorded one sample until the user stopped speaking and then started recognizing. While not a problem when the “sentences” are constrained to simple, short commands, this would cause significant lag as the length of the sentences, and therefore the time required for recognition, increased. Even when recognizing faster than real time, this essentially meant that you had to wait for ~ 2 seconds after saying a ~ 3 second sentence.
To keep Simon snappy, I implemented continuous recognition in Simond (for pocketsphinx): Simon now feeds data to the recognizer engine as soon as the initial buffer is filled, making the whole system much more responsive.

Revisiting the Dictation plugin

Even before this project started, Simon already had a “Dictation” command plugin. Basically, this plugin would just write out everything that Simon recognizes. But that’s far from everything there is to dictation from a software perspective.

First of all, I needed to take care of replacing the special words used for punctuation, like “.period”, with their associated signs. To do that, I implemented a configurable list of string replaces in the dictation plugin.


An already existing option to add a given text at the end of a recognition result takes care of adding spaces after sentences if configured to do so. I also added the option to uppercase the first letter of every new spoken sentence.

Then, I set up some shortcut commands that would be useful while dictating (“Go to the end of the document” for ctrl+end or “Delete that” for backspace, for example).

To deal with incorrect recognition results, I also wanted to be able to modified already written text. To do that, I made Simon aware of the currently focused text input field by using AT-SPI 2. I then implemented a special “Select x” command that would search through the current text field and select the text “x” if found. This enables the user to select the offending word(s) to either remove them or simply dictate the correction.


So without much ado, this is the end result:

What’s next?

Of course, this is just the beginning. If we want to build a real, competitive open source speech recognition offering we have to tackle – among others – the following challenges:

  • Turning the adaption I did manually into an integrated, guided setup procedure for Simon (enrollment).
  • Continuing to work towards better language- and acoustic models in general. There’s a lot to do there.
  • Improving the user interface for the dictation: We should show off the current (partial) hypothesis even while the user is speaking. That would make the system feel even more responsive.
  • Better accounting for spontaneous input: Simon should be aware of (and ignore) filler words, support mid-sentence corrections, false starts, etc.
  • Integrating semantic logic into the language model; For example, in the current prototype, recognizing “Select x” is pretty tricky because e.g., “Select hear” is not a sentence that makes sentence according to the language model – it does in the application, though (select the text “hear” in the written text for correction / deletion).
  • Better incorporating the dictation with traditional command & control: When not dictating texts, we should still exploit the information we do have (available commands) to keep recognition accuracy as high as it is for the limited-vocabulary use case we have now. A mixture (or switching) between grammar and language model should be explored.
  • Better integration in other apps: The AT-SPI information used for correcting mistakes is sadly not consistent across toolkits and widgets. Many KDE widgets are in fact not accessible through AT-SPI (e.g. the document area of Calligra Words does not report to be a text field). This is mostly down to the fact that no other application currently requires the kind of information Simon does.

Even this rather long list is just a tiny selection of what I can think of right off the top of my head – and I’m not even touching on improvements in e.g. CMU SPHINX.
There’s certainly still a lot left to do, but all of it is very exciting and meaningful work.

I’ll be at the Akademy conference for the coming week where I’ll also be giving a talk about the future of open source speech recognition. If you want to get involved in the development of an open source speech recognition system capable of dictation: Get in touch with me – either in person, or – if you can’t make it to Akademy – write me an email!

Facebooktwittergoogle_plusredditpinterestlinkedinmailby feather

Peter Grasch


  1. This is amazing, and would really help out in the open source community to assist users that require additional accessibility features.

  2. Bin platt…

    You didn’t really do all this in the time of a week (week as in ‘amount of working hours’, not as in ‘last week’)? I guess people will fall off their chairs in Bilbao! Have fun there and ‘Buen viaje!’! (And don’t forget to relax a bit… ;-))

  3. All together it was probably almost 2 weeks. But I’ve been experimenting in similar areas for quite some time now, so I knew both where to look and what to do. That certainly helped a lot 🙂

    Best regards,

  4. Great!

    Excellent Work Peter!
    I would never have thought that such accurateness would be possible with free software, let alone it that time span.
    As german is my mother language, I wonder if you could build a similiar language & acoustic model for the german language?


  5. I actually tried to build a German speech model back in February but it’s much harder as there is far less data publicly available.

    I guess we should concentrate on English for now and add more languages later on. But don’t worry, I’m a native German speaker as well (as are many other KDE hackers), so I’m sure the German model won’t be forgotten.

    Best regards,

  6. Great !!!

    Sounds really great, I hope will be avaliable for other languages in the future 😛

  7. This is very impressive! Having such a high accurracy, this will be very useful in the future.

  8. Very nice work!

    Thanks for your continued work on Simon. It is the selfless hard work of programmers like you that give people the opportunity to make the world a better place. Best of luck in your future endeavors.

  9. Off the Hook

    I can’t say I’m interested in getting him off (who is he anyways? why can’t you europeans keep your youtube videos G-rated?) but that computery stuff you got going on seems really great.

  10. VERY impressive project – and thanks!

    Hi there Peter,
    To just tackle this area is a gigantic job. To do it in a short span of time like weeks – even if you were “just” putting it all together is very impressive.
    I have been “stuck” with Windoze for so many years simply because I NEED Dragon voice-2-text.
    Worse, I had to use MSWord with it.
    I have been agitating with Nuance Corp for years trying to get them to port to Linux and they finally did a partial job with OpenOffice for Windows but then quit.

    I’m no programmer, but I am a Writer and will help with testing feedback anytime.

    BTW: Your English is probably not the best model acoustically. I have what was called a “mid-Atlantic” English accent – born in Australia but with years in Asia, which suits the Dragon model very well, greatly reducing the error rate.
    I’ve also responded by email to you.

  11. Brilliant!

    Well done mate. This is awesome. Looking forward to future updates.
    Keep up good work. This is a good example of an Open source project which could have a wider impact!

  12. OK, how can I help?

    I manage the IT at a Law firm. We are rolling out Dragon and I have it working on Linux but a truly opensource model would be preferred. How can I help this project?

  13. Re: Richard

    Hey Richard,

    in general, the best way end-users can help is by sharing the recordings they amass, coupled with a transcription of what is actually being said. However, I am aware that this will probably not be an option when it comes to legal dictates.

    Given that, I don’t think there is really anything you can do at this early stage.
    However, it is always very motivating to hear about potential users, so please consider your comment here your small, but very valuable contribution to the Simon project.

    Best regards,

  14. Simon GUI


    I am trying out various open-source Voice Recognition apps, and have not been able to find the GUI that seems avaible for Simon. Can you please point me at a link?

    Thanks in advance for your response.

  15. Re: Simon GUI

    Simon is a graphical application. You can get it here: http://simon.kde.org.

    Please be aware that the dictation capabilities showcased in that article are not yet integrated in the latest version available for download.

    Best regards,

  16. Simon GUI

    Hi Peter,

    THX for that! I did follow a few Links, and landed here: http://userbase.kde.org/Simon/Development_Environment#Linux. I run Backbox Linux, which uses Ubuntu as a base, and XFCE as the Desktop Manager. So, I had to make sure that I had all those KDE requirements installed.

    I compiled the source code, and ALL IS GOOD!

    Now, I need to add some Libraries, and train a few things. BTW, I did watch your YouTube Video here: http://www.youtube.com/watch?v=ghfMMYNOwXo&feature=youtu.be&t=3m19s.

    Thanks again!

    Best Regards,


  17. is Simon dictation capability integrated with Simon

    Hi Peter, how are you?
    so care about Simon Dictation, is it integrated recently, or is it still on your grand agenda?

  18. Dicatation capabilities in current windows version

    Hi Peter,

    Is the fancy dictation abilities you showed now integrated in the current Simon? How do I use it as you just did? I want to get my hands on it ASAP.
    Besides building from source, how do I get a binary ?

  19. Sorry, this is not yet

    Sorry, this is not yet integrated.
    Even if you build from source you’d need to do a lot of manual hacking to get it where I got it in the video. Please be a bit patient.

  20. Very exciting!

    I’m excited for many reasons.
    1) You explain the process and operationals superbly, making it easy for the lay user to grasp. Your success is assured because of this alone.
    2) I have been looking for precisely this type of product for years
    3) The interface looks modern
    4) You’re passionate about something that will actually transform computing
    5) In your Akademy talk you spoke of the ability to factor languages (like code) into specific app use, rendering real world coding as a possibility

    Can not wait to get my hands on this.

  21. How much training (in hours) with your voice was needed to get to that quality?

  22. Great work and and really impressive!

    Hi Peter,

    I have been looking VR solution like yours for months.
    This is really amazing and I hope it could help me to create educational stuff for kids.
    Schools are interested in new methods of teaching and “Simon says” – I wanna it!

    We need a dictation tools for edu games for schools and your solution Cmu Sphinx based would be the best answer.
    When will be the Simone’s Dictation Tools ready for developers and explorers?

  23. Impressive stuff. I would be very interested to contribute as a beta tester and/or provide samples and training for the base language models.

  24. Hello, I was looking at converting all the linux.conf.au Videos to text, and it looks like this might be good enough to do the trick. However I am rather new to this, so:
    1) Simon keeps on asking for a Simon Base Model file. All I know is that it has an “.sbm” extension, but I haven’t been able to find any files with an .sbm extension in any of the linked files or on the voxforge site.
    2) It seems that maybe what I actually want is Transcriber from sphinx. I think I am meant to edit edu/cmu/sphinx/demo/transcriber/config.xml to point to your files, but it needs e.g. a .gram file and I can’t find a file with the .gram extension in your files. Do it exist, just with a different extension?
    3) BTW, do you know if it would be easy to automatically add information on precisely *when* a particular phrase was spoken in the video (useful for subtitles)?

  25. Dictation pluging status update?

    “Sorry, this is not yet integrated. Even if you build from source you’d need to do a lot of manual hacking to get it where I got it in the video. Please be a bit patient.”

    ..any updates on the status of the dictation plugin? I compiled / packaged simon on PCLinuxOS, and while it seems to work, the real reason I wanted it was for the dictation.


  26. Replies

    Daniel: About 2 hours.

    Cybertex: I hope to have a consumer-ready prototype by this summer.

    Rob: Thank you. Have a look at VoxForge.org to submit your recordings: http://www.voxforge.org/home/read

    John: The dictation functionality is still not integrated in a way where you can use Simon’s UI to set everything up. It is not intended for end users at the moment. Sorry. (for the record: 2. is better, yes, but .gram is a grammar – you’ll need to use an n-gram instead; you may want to use the one I published in the “Wrapping up” follow up post; 3. yes, this is supported by sphinx.)

    Travis: We’re still working on it. Expect something by this summer.

  27. Continuous Adaptation

    Is it possible to eventually support a continuous adaptation system in simon? For example, given the following captured input:

    1] Voice capture
    2] Resulting, corrected (manually), transcript

    Routinely (overnight, weekly, etc, in the daemon) adapt the voice and language model:

    Language model
    1] Add new words from the transcript to the language model, if word matches certain threshold (ie occurence: maintain a tally and only integrate weekly)
    2] Determine new word pronunciation from recording, add to phonetic dictionary.

    Acoustic model
    1] Adapt previous day’s recording given recording and text
    2] Capability (optional, of course) to review transcript/recording and upload to voxforge. Cached recordings/transcripts purged routinely (daily, weekly, etc).

    All automatically.

    Benefits include reducing tedious training, greatly reducing WER over time, and hopefully providing a significant bump to voxforge’s corpus.

    Is something like this even possible? And feasible, would the simon developers be interested?

  28. Hey,

    the adaption of the language model doesn’t work like that. You can never recognize words that you don’t know about, hence, you can’t “add words to the LM from the transcript” because they won’t be in there. Ideally, you’ll get an or similar marker to represent an unknown word – often, though, you’ll get a similar sounding word, because that’s all that the system knows about.

    The adaption of the AM kinda works like that: You create new training data from recordings and their automatic transcription, selecting only those where the decoder was reasonably confident. You can google “conservative training” for more information.
    Simon kind of has a very primitive version of this already implemented. Open the Simond settings (over KSimond, for example) and tick the box “Keep recognition samples” in the Simond section. This will make Simond keep all samples used for decoding plus the log of the decoder to it. You can find them in ~/.kde/share/apps/simond/models//active.
    If you use Sam for acoustic modelling, you can use the “Import recognition samples” function in the main screen. Point that to the folder I pointed to above and it can write a prompts file, with the transcription taken from the recognition result. If your decoder engine supports confidence scores (currently only the Julius backend does), you can specify a minimum confidence score for the sample to be used as well.
    It’s fairly trivial to upload those samples to Voxforge. Even from within Simon, if you want: To do that, import the training data (it can read the prompts file Sam wrote) and submit them using the “Contribute samples” feature. As always, you can check the manual for more info.

    Thanks for your suggestions, though!

    Best regards,

    • I don’t know how much information you can get from AT-SPI 2, but I do think this suggestion is possible:

      Language Model:
      from simon/julian: access to audio and recognition output, ie correlation between phonemes and words
      from at-spi: access to “corrected” words (at the very least by comparing current, manually corrected content with automatic content produced by the sre)
      From this you can isolate the phonemes of the corrected words. Given these phonemes you can create new pronunciation dictionary entries. Since you already have a correlation between phoneme/word, forced alignment isn’t even necessary (unless there are multiple adjacent corrected words). You can use the surrounding text to generate n-gram models.

      Acoustic model
      Conservative training shouldn’t be necessary if you have access to the manually corrected transcript. That said, what you describe nonetheless sounds promising. You make it seem as though the work you describe with language models is already implemented and working. Does this mean you’ve integrated this work upstream?

  29. Great! .. Where is the download link for this featcher ? … i am using simon 0.4.1 windows version.
    Is it not free or not opensource ?

Leave a Reply

Your email address will not be published. Required fields are marked *