Open Source Dictation: Scoping out the Problem

Today I want to start with the first “process story” of creating a prototype of an open source dictation system.

Project scope

Given around a weeks worth of time, I’ll build a demonstrative prototype of a continuous speech recognition system for the task of dictating texts such as emails, chat or reports, using only open resources and technologies.

Dictation systems are usually developed for a target user group and then modified for a single user (the one who’ll be using the system). For this prototype, the target user group is “English speaking techies” and I myself will be the end-user to whom the system will be adapted to. The software to process and handle the recognition result will be Simon. Any additions or modifications to the software will be made public.

During the course of the project, I’ll be referencing different data files and resources. Unless otherwise noted, those resources are available to the public under free licenses. If you need help to find them or would like more information (including any developed models), please contact me.

Evaluating existing models

I started by developing a sensible testcase for the recognizer by selecting a total of 39 sentences of mixed complexity from various sources including a review of “Man of Steel”, a couple of news articles from CNN and slashdot and some blog posts right here on PlanetKDE. This, I feel, represents a nice cross-section of different writing styles and topics that is in line with what the target user group would probably intend to write.

I then recorded these sentences myself (speaking rather quickly and without pauses) and ran recognition tests with PocketSphinx and various existing acoustic and language models to see how they’d perform.
Specifically, I measured what is called “Word Error Rate” or “WER”, that basically tells you the percentage of words the system got wrong when comparing the perfect (manual) transcription to the one created by the recognizer. You can find more information on Wikipedia. Lower WER is better.

Acoustic model Dictionary Language model WER
HUB4 (cont) HUB4 (cmudict 0.6a) HUB4 53.21 %
HUB4 (cont) cmudict 0.7 Generic 58.32%
HUB4 (cont) HUB4 (cmudict 0.6a) Gigaword, 64k 49.62%
WSJ (cont) HUB4 (cmudict 0.6a) HUB4 42.81 %
WSJ (cont) cmudict 0.7 Generic 50.69%
WSJ (cont) cmudict 0.7 Gigaword, 64k 41.07%
HUB4 (semi) HUB4 (cmudict 0.6a) HUB4 38.23 %
HUB4 (semi) cmudict 0.7 Generic 56.64%
HUB4 (semi) cmudict 0.7 Gigaword, 64k 36.18 %
Voxforge 0.4 (cont) HUB4 (cmudict 0.6a) HUB4 32.67%
Voxforge 0.4 (cont) cmudict 0.7 Generic 42.5 %
Voxforge 0.4 (cont) cmudict 0.7 Gigaword, 64k 31.02 %

So, what can we take away from these tests: Overall, the scores are fairly low and any system based on those models would be almost unusable in practice. There are several reasons why the scores are low: Firstly, I am not a native English speaker so my accent definitely plays a role here. Secondly, many sentences I recorded for the test corpus are purposefully complex (e.g., “Together they reinvent the great granddaddy of funnybook strongmen as a struggling orphan whose destined for greater things.”) to make the comparisons between different models more meaningful. And thirdly: the used models are nowhere near perfect.

For comparison, I also analyzed the results of Google’s public speech recognition API which managed to score a surprisingly measly 32.72 % WER on the same test set. If you compare that with the values above, it actually performed worse than the best of the open source alternatives. I re-ran the test twice and I can only assume that either their public API is using a simplified model for computational reasons or that their system really doesn’t like my accent.
Edit: An American native speaker offered to record my test set to eliminate the accent from the equation so I re-ran the comparison of Google’s API with the best model above with his recordings and found the two systems to produce pretty much equivalent word error rates (Google: 27.83 %, Voxforge: 27.22 %).

All things considered then, 31.02 % WER for a speaker independent dictation task on a 64k word vocabulary is still a solid start and a huge win for the Voxforge model!

Fine print: The table above should not be interpreted as definitive comparison between the tested models. The test set is comparatively small and limited to my own voice which, as mentioned above, is by no means representative.
If you’re a researcher trying to find the best acoustic model for your own decoding task, you should definitely do your own comparison; it’s really easy and definitely worth your while.
Facebooktwittergoogle_plusredditpinterestlinkedinmailby feather

Peter Grasch


  1. Hello Peter,

    Just out of curiosity: in the past, did you compare your software with the commercial application “Dragon Naturally Speaking” (version 12, as of now) by Nuance” ?

    More precisely, given the dictation as a task to carry out, how does Simon behave compared to Dragon?
    Do you have some results to show ? 🙂

    In Italy, Dragon is quite often suggested as THE main software to buy to perform this kind of task (dictation).
    Therefore, I am pretty sure it would be *extremely* useful to read some sort of comparison…

    Since Simon runs on Windows as well, It should not be too hard to use both of them together on the same platform (as regards Dragon, you might simply install a demo).

    Best regards!

    Silvio Grosso

  2. I have not compared them as I’m just starting to seriously work on dictation in Simon.

    However, I have no doubt that Dragon would – at this stage at least – easily outperform any model I can create right now.

    They have been perfecting dictation for decades now and have spent a considerable budget not only on algorithms and software development but also on the data acquisition for their corpora.

    In short: If you want to perform dictation right now, buy Dragon.

    Best regards,

  3. AT-SPI

    all with Simon is just amazing! I’m enjoying all the process!

    Again, all in Simon is just amazing! I’m enjoying all the process!

    Maybe this is not the best place to ask but I want you to help me.

    I have Simon in a windows 7 virtual machine and also I have it in Fedora. In windows, it doesn’t have the AT-SPI plug-in but in fedora, it does! I just want to do this: http://www.youtube.com/watch?v=mjVc8bKRdqA
    in windows, do you know what happens?

    I saw the source files in windows, and it has the AT-SPI files in the plug-ins folder. But it doesn’t appears as a choice when I click the “manage plug-ins” button, in the Simon client.

    I hope someone knows, thanks.

    • AT-SPI is a Linux only technology. It doesn’t exist on Windows.

      However, Simon’s AT-SPI plugin is still highly experimental – even on Linux. I would not recommend running it yet.

  4. Hi Peter,

    A very informative blog there.

    I am trying to improve PocketSphinx’s recognition accuracy for a specific application and had a couple of questions. The idea is to reconfirm your results for my accent and then move on to adapting acoustic models if required.

    – Is there a centralized repository of the language models/acoustic models/dictionaries that you have used for comparison? This link
    http://www.speech.cs.cmu.edu/sphinx/models/ just mentions the HUB and WSJ language models.

    – How did you configure PocketSphinx to run with long audio files? Currently if I run pocketSphinx_continuos.exe with the -infile parameter, the transcribed result always has just a few words (as if it expects the the result to conform to a predetermined length).

    – Does pocketSphinx provide the WER as an output or did you calculate them manually?

    I would really appreciate any help in this regard.


    • Hey Neeraj,

      you can find a nice collection of models here: http://sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Languag

      pocketsphinx_continuous is fine, but you probably have “long” pauses in your audio. The continuous executable still does silence segmentation (uselessly), so will terminate when it finds a long period of silence. You can trim those out with Audacity, for example, but it’s better to just split on silence (again, Audacity can help you).

      SphinxTrain has a script called word_align.pl that can be used to calculate WER.

      In general, please contact the SPHINX team for more support on their ecosystem.

      Best regards,

      • Hey Peter,

        I did end up using word_align.pl to calculate the WER. Just that my numbers are worse than the ones that you have observed. I am testing this on the VoxForge corpus and I assumed the WER would be around 20% (without any adaptation) )since a lot of those recordings have an American accent. Guess not.

        I plan to repeat my experiments with the latest generic language models as suggested by the Sphinx team.

        Thanks again.


  5. I actually found the Voxforge model to be very “foreigner’s English” friendly.
    The best possible model for you would probably be one based on TED LIUM. There’s been a new release recently, maybe I can find some time to build a model off of it soon. Watch the blog 🙂

    Best regards,

  6. Hi Peter!

    I find very useful what you have here, but what else? How’s your research? Do you still perform SR?

    For long time I found MS SAPI on windows had (by far) much more accuracy (and fewer WER) than PSphinx. This in the context of Human-Robot Interaction with non native speakers. Plus, there is no need of Dictionary, language model, corpora, etc. Just a Win7 Sapi5.3 out of the box interfaced with C#.

    Now, for odd reasons I’m getting into PSphinx with not that good results. Kaldi is a promising option too. Do you have any comments regarding this? Should I skip sphinx and deal straight with Simon?

    Thanks for any advice!

    • Hi Azkar!

      Yes, still working in the ASR field, but not much to report publicly at the moment.

      MS SAPI may be surperior to PSphinx for some workloads, yes, but its license makes it unusable in the scope of a project like Simon.

      I don’t know enough about your concrete project to give you any advice but if you don’t want to deal with low level data, Simon may be a good idea, yes. It does not support KALDI at this point, though. Please keep in mind that neither SPHINX nor KALDI were meant to be used by end-users so if you don’t have an ASR background, it may be difficult to get them to do what you want.

      Best regards,

Leave a Reply

Your email address will not be published. Required fields are marked *