Open Source Dictation: Scoping out the Problem

Today I want to start with the first "process story" of creating a prototype of an open source dictation system.

Project scope

Given around a weeks worth of time, I'll build a demonstrative prototype of a continuous speech recognition system for the task of dictating texts such as emails, chat or reports, using only open resources and technologies.

Dictation systems are usually developed for a target user group and then modified for a single user (the one who'll be using the system). For this prototype, the target user group is "English speaking techies" and I myself will be the end-user to whom the system will be adapted to. The software to process and handle the recognition result will be Simon. Any additions or modifications to the software will be made public.

During the course of the project, I'll be referencing different data files and resources. Unless otherwise noted, those resources are available to the public under free licenses. If you need help to find them or would like more information (including any developed models), please contact me.

Evaluating existing models

I started by developing a sensible testcase for the recognizer by selecting a total of 39 sentences of mixed complexity from various sources including a review of "Man of Steel", a couple of news articles from CNN and slashdot and some blog posts right here on PlanetKDE. This, I feel, represents a nice cross-section of different writing styles and topics that is in line with what the target user group would probably intend to write.

I then recorded these sentences myself (speaking rather quickly and without pauses) and ran recognition tests with PocketSphinx and various existing acoustic and language models to see how they'd perform.
Specifically, I measured what is called "Word Error Rate" or "WER", that basically tells you the percentage of words the system got wrong when comparing the perfect (manual) transcription to the one created by the recognizer. You can find more information on Wikipedia. Lower WER is better.

Acoustic model Dictionary Language model WER
HUB4 (cont) HUB4 (cmudict 0.6a) HUB4 53.21 %
HUB4 (cont) cmudict 0.7 Generic 58.32%
HUB4 (cont) HUB4 (cmudict 0.6a) Gigaword, 64k 49.62%
WSJ (cont) HUB4 (cmudict 0.6a) HUB4 42.81 %
WSJ (cont) cmudict 0.7 Generic 50.69%
WSJ (cont) cmudict 0.7 Gigaword, 64k 41.07%
HUB4 (semi) HUB4 (cmudict 0.6a) HUB4 38.23 %
HUB4 (semi) cmudict 0.7 Generic 56.64%
HUB4 (semi) cmudict 0.7 Gigaword, 64k 36.18 %
Voxforge 0.4 (cont) HUB4 (cmudict 0.6a) HUB4 32.67%
Voxforge 0.4 (cont) cmudict 0.7 Generic 42.5 %
Voxforge 0.4 (cont) cmudict 0.7 Gigaword, 64k 31.02 %

So, what can we take away from these tests: Overall, the scores are fairly low and any system based on those models would be almost unusable in practice. There are several reasons why the scores are low: Firstly, I am not a native English speaker so my accent definitely plays a role here. Secondly, many sentences I recorded for the test corpus are purposefully complex (e.g., "Together they reinvent the great granddaddy of funnybook strongmen as a struggling orphan whose destined for greater things.") to make the comparisons between different models more meaningful. And thirdly: the used models are nowhere near perfect.

For comparison, I also analyzed the results of Google's public speech recognition API which managed to score a surprisingly measly 32.72 % WER on the same test set. If you compare that with the values above, it actually performed worse than the best of the open source alternatives. I re-ran the test twice and I can only assume that either their public API is using a simplified model for computational reasons or that their system really doesn't like my accent.
Edit: An American native speaker offered to record my test set to eliminate the accent from the equation so I re-ran the comparison of Google's API with the best model above with his recordings and found the two systems to produce pretty much equivalent word error rates (Google: 27.83 %, Voxforge: 27.22 %).

All things considered then, 31.02 % WER for a speaker independent dictation task on a 64k word vocabulary is still a solid start and a huge win for the Voxforge model!

Fine print: The table above should not be interpreted as definitive comparison between the tested models. The test set is comparatively small and limited to my own voice which, as mentioned above, is by no means representative.
If you're a researcher trying to find the best acoustic model for your own decoding task, you should definitely do your own comparison; it's really easy and definitely worth your while.

Tags:

Comments

Hello Peter,

Just out of curiosity: in the past, did you compare your software with the commercial application "Dragon Naturally Speaking" (version 12, as of now) by Nuance" ?

More precisely, given the dictation as a task to carry out, how does Simon behave compared to Dragon?
Do you have some results to show ? :-)

In Italy, Dragon is quite often suggested as THE main software to buy to perform this kind of task (dictation).
Therefore, I am pretty sure it would be *extremely* useful to read some sort of comparison...

Since Simon runs on Windows as well, It should not be too hard to use both of them together on the same platform (as regards Dragon, you might simply install a demo).

Best regards!

Silvio Grosso

Peter Grasch's picture

I have not compared them as I'm just starting to seriously work on dictation in Simon.

However, I have no doubt that Dragon would - at this stage at least - easily outperform any model I can create right now.

They have been perfecting dictation for decades now and have spent a considerable budget not only on algorithms and software development but also on the data acquisition for their corpora.

In short: If you want to perform dictation right now, buy Dragon.

Best regards,
Peter

all with Simon is just amazing! I'm enjoying all the process!

Again, all in Simon is just amazing! I'm enjoying all the process!

Maybe this is not the best place to ask but I want you to help me.

I have Simon in a windows 7 virtual machine and also I have it in Fedora. In windows, it doesn't have the AT-SPI plug-in but in fedora, it does! I just want to do this: http://www.youtube.com/watch?v=mjVc8bKRdqA
in windows, do you know what happens?

I saw the source files in windows, and it has the AT-SPI files in the plug-ins folder. But it doesn't appears as a choice when I click the "manage plug-ins" button, in the Simon client.

I hope someone knows, thanks.

Peter Grasch's picture

AT-SPI is a Linux only technology. It doesn't exist on Windows.

However, Simon's AT-SPI plugin is still highly experimental - even on Linux. I would not recommend running it yet.

Thanks, Peter.

Hi Peter,

A very informative blog there.

I am trying to improve PocketSphinx's recognition accuracy for a specific application and had a couple of questions. The idea is to reconfirm your results for my accent and then move on to adapting acoustic models if required.

- Is there a centralized repository of the language models/acoustic models/dictionaries that you have used for comparison? This link
http://www.speech.cs.cmu.edu/sphinx/models/ just mentions the HUB and WSJ language models.

- How did you configure PocketSphinx to run with long audio files? Currently if I run pocketSphinx_continuos.exe with the -infile parameter, the transcribed result always has just a few words (as if it expects the the result to conform to a predetermined length).

- Does pocketSphinx provide the WER as an output or did you calculate them manually?

I would really appreciate any help in this regard.

Thanks,
Neeraj

Add new comment

Filtered HTML

  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <blockquote> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.