Open Source Dictation: Scoping out the Problem

Today I want to start with the first "process story" of creating a prototype of an open source dictation system.

Project scope

Given around a weeks worth of time, I'll build a demonstrative prototype of a continuous speech recognition system for the task of dictating texts such as emails, chat or reports, using only open resources and technologies.

Dictation systems are usually developed for a target user group and then modified for a single user (the one who'll be using the system). For this prototype, the target user group is "English speaking techies" and I myself will be the end-user to whom the system will be adapted to. The software to process and handle the recognition result will be Simon. Any additions or modifications to the software will be made public.

During the course of the project, I'll be referencing different data files and resources. Unless otherwise noted, those resources are available to the public under free licenses. If you need help to find them or would like more information (including any developed models), please contact me.

Evaluating existing models

I started by developing a sensible testcase for the recognizer by selecting a total of 39 sentences of mixed complexity from various sources including a review of "Man of Steel", a couple of news articles from CNN and slashdot and some blog posts right here on PlanetKDE. This, I feel, represents a nice cross-section of different writing styles and topics that is in line with what the target user group would probably intend to write.

I then recorded these sentences myself (speaking rather quickly and without pauses) and ran recognition tests with PocketSphinx and various existing acoustic and language models to see how they'd perform.
Specifically, I measured what is called "Word Error Rate" or "WER", that basically tells you the percentage of words the system got wrong when comparing the perfect (manual) transcription to the one created by the recognizer. You can find more information on Wikipedia. Lower WER is better.

Acoustic model Dictionary Language model WER
HUB4 (cont) HUB4 (cmudict 0.6a) HUB4 53.21 %
HUB4 (cont) cmudict 0.7 Generic 58.32%
HUB4 (cont) HUB4 (cmudict 0.6a) Gigaword, 64k 49.62%
WSJ (cont) HUB4 (cmudict 0.6a) HUB4 42.81 %
WSJ (cont) cmudict 0.7 Generic 50.69%
WSJ (cont) cmudict 0.7 Gigaword, 64k 41.07%
HUB4 (semi) HUB4 (cmudict 0.6a) HUB4 38.23 %
HUB4 (semi) cmudict 0.7 Generic 56.64%
HUB4 (semi) cmudict 0.7 Gigaword, 64k 36.18 %
Voxforge 0.4 (cont) HUB4 (cmudict 0.6a) HUB4 32.67%
Voxforge 0.4 (cont) cmudict 0.7 Generic 42.5 %
Voxforge 0.4 (cont) cmudict 0.7 Gigaword, 64k 31.02 %

So, what can we take away from these tests: Overall, the scores are fairly low and any system based on those models would be almost unusable in practice. There are several reasons why the scores are low: Firstly, I am not a native English speaker so my accent definitely plays a role here. Secondly, many sentences I recorded for the test corpus are purposefully complex (e.g., "Together they reinvent the great granddaddy of funnybook strongmen as a struggling orphan whose destined for greater things.") to make the comparisons between different models more meaningful. And thirdly: the used models are nowhere near perfect.

For comparison, I also analyzed the results of Google's public speech recognition API which managed to score a surprisingly measly 32.72 % WER on the same test set. If you compare that with the values above, it actually performed worse than the best of the open source alternatives. I re-ran the test twice and I can only assume that either their public API is using a simplified model for computational reasons or that their system really doesn't like my accent.
Edit: An American native speaker offered to record my test set to eliminate the accent from the equation so I re-ran the comparison of Google's API with the best model above with his recordings and found the two systems to produce pretty much equivalent word error rates (Google: 27.83 %, Voxforge: 27.22 %).

All things considered then, 31.02 % WER for a speaker independent dictation task on a 64k word vocabulary is still a solid start and a huge win for the Voxforge model!

Fine print: The table above should not be interpreted as definitive comparison between the tested models. The test set is comparatively small and limited to my own voice which, as mentioned above, is by no means representative.
If you're a researcher trying to find the best acoustic model for your own decoding task, you should definitely do your own comparison; it's really easy and definitely worth your while.



Hello Peter,

Just out of curiosity: in the past, did you compare your software with the commercial application "Dragon Naturally Speaking" (version 12, as of now) by Nuance" ?

More precisely, given the dictation as a task to carry out, how does Simon behave compared to Dragon?
Do you have some results to show ? :-)

In Italy, Dragon is quite often suggested as THE main software to buy to perform this kind of task (dictation).
Therefore, I am pretty sure it would be *extremely* useful to read some sort of comparison...

Since Simon runs on Windows as well, It should not be too hard to use both of them together on the same platform (as regards Dragon, you might simply install a demo).

Best regards!

Silvio Grosso

Peter Grasch's picture

I have not compared them as I'm just starting to seriously work on dictation in Simon.

However, I have no doubt that Dragon would - at this stage at least - easily outperform any model I can create right now.

They have been perfecting dictation for decades now and have spent a considerable budget not only on algorithms and software development but also on the data acquisition for their corpora.

In short: If you want to perform dictation right now, buy Dragon.

Best regards,

all with Simon is just amazing! I'm enjoying all the process!

Again, all in Simon is just amazing! I'm enjoying all the process!

Maybe this is not the best place to ask but I want you to help me.

I have Simon in a windows 7 virtual machine and also I have it in Fedora. In windows, it doesn't have the AT-SPI plug-in but in fedora, it does! I just want to do this:
in windows, do you know what happens?

I saw the source files in windows, and it has the AT-SPI files in the plug-ins folder. But it doesn't appears as a choice when I click the "manage plug-ins" button, in the Simon client.

I hope someone knows, thanks.

Peter Grasch's picture

AT-SPI is a Linux only technology. It doesn't exist on Windows.

However, Simon's AT-SPI plugin is still highly experimental - even on Linux. I would not recommend running it yet.

Thanks, Peter.

Sanvhost provides customers with reseller hosting and low-cost shared hosting. Its offers cPanel, Plesk panel round-the-clock support and a range of free one-click install scripts and applications. We provide everything from affordable shared hosting to dedicated servers. The host offers free website transfers from other hosts, including all of the customer’s files and databases. Whether you are looking to host personal websites, small or large business websites, blogs, forums, audio/video streaming,reseller platform and virtual or dedicated environments, we have a solution for you. Our web hosting services are feature rich including Sanvhost wordpress, joomla, shopping carts, ecommerce scripts, and multi language panel, CGI/Perl, MySQL, PHP and much more.Affordable hosting package offered by Sanvhost which not only provides the best in terms of hosting packages but also believes in truly being there for the customer, 24x7 chat support. Cheap hosting Moreover , they offer unlimited bandwidth as well as nearly 1GB storage along with database maintenance, email facility along with storage, availability of sub domain and many other important features for a very low price.Sanvhost is dicated web hosting company providing quality VPS hosting for websites and has plans ( Windows cheap VPS, Forex VPS, Plesk VPS, Shared Hosting, LinuxVps and Windows cloud VPS ) catering to everyone’s needs and we do provide 7 days money back guarantee. If your website is grown up or not running smoothly, we can provide you quality Virtual private server (VPS) hosting at just 9.99 USD per month. In VPS you will get all the features of a dedicated server for fraction of a dedicated server cost. You will get full root access, can host unlimited domains, unlimited email ids. You can install any software which need root access and can set any configuration setting as per your need.We offer high quality and professional IT solutions and services to meet the needs of businesses across the globe. We deliver innovative webhosting solutions to our clients. Sanvhost offers one of the cheapest web hosting plans around with unlimited bandwidth and unlimited web space, and many other unbeatable features in shared hosting. Sanvhost a complete Hosting solution.

For more info visit Window Hosting | Linux Hosting | Windows Vps | Linux Vps | PLesk Vps | Forex Vps | SmarterMail

Add new comment

Filtered HTML

  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <blockquote> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.