Today I want to start with the first “process story” of creating a prototype of an open source dictation system.
Given around a weeks worth of time, I’ll build a demonstrative prototype of a continuous speech recognition system for the task of dictating texts such as emails, chat or reports, using only open resources and technologies.
Dictation systems are usually developed for a target user group and then modified for a single user (the one who’ll be using the system). For this prototype, the target user group is “English speaking techies” and I myself will be the end-user to whom the system will be adapted to. The software to process and handle the recognition result will be Simon. Any additions or modifications to the software will be made public.
During the course of the project, I’ll be referencing different data files and resources. Unless otherwise noted, those resources are available to the public under free licenses. If you need help to find them or would like more information (including any developed models), please contact me.
Evaluating existing models
I started by developing a sensible testcase for the recognizer by selecting a total of 39 sentences of mixed complexity from various sources including a review of “Man of Steel”, a couple of news articles from CNN and slashdot and some blog posts right here on PlanetKDE. This, I feel, represents a nice cross-section of different writing styles and topics that is in line with what the target user group would probably intend to write.
I then recorded these sentences myself (speaking rather quickly and without pauses) and ran recognition tests with PocketSphinx and various existing acoustic and language models to see how they’d perform.
Specifically, I measured what is called “Word Error Rate” or “WER”, that basically tells you the percentage of words the system got wrong when comparing the perfect (manual) transcription to the one created by the recognizer. You can find more information on Wikipedia. Lower WER is better.
|Acoustic model||Dictionary||Language model||WER|
|HUB4 (cont)||HUB4 (cmudict 0.6a)||HUB4||53.21 %|
|HUB4 (cont)||cmudict 0.7||Generic||58.32%|
|HUB4 (cont)||HUB4 (cmudict 0.6a)||Gigaword, 64k||49.62%|
|WSJ (cont)||HUB4 (cmudict 0.6a)||HUB4||42.81 %|
|WSJ (cont)||cmudict 0.7||Generic||50.69%|
|WSJ (cont)||cmudict 0.7||Gigaword, 64k||41.07%|
|HUB4 (semi)||HUB4 (cmudict 0.6a)||HUB4||38.23 %|
|HUB4 (semi)||cmudict 0.7||Generic||56.64%|
|HUB4 (semi)||cmudict 0.7||Gigaword, 64k||36.18 %|
|Voxforge 0.4 (cont)||HUB4 (cmudict 0.6a)||HUB4||32.67%|
|Voxforge 0.4 (cont)||cmudict 0.7||Generic||42.5 %|
|Voxforge 0.4 (cont)||cmudict 0.7||Gigaword, 64k||31.02 %|
So, what can we take away from these tests: Overall, the scores are fairly low and any system based on those models would be almost unusable in practice. There are several reasons why the scores are low: Firstly, I am not a native English speaker so my accent definitely plays a role here. Secondly, many sentences I recorded for the test corpus are purposefully complex (e.g., “Together they reinvent the great granddaddy of funnybook strongmen as a struggling orphan whose destined for greater things.”) to make the comparisons between different models more meaningful. And thirdly: the used models are nowhere near perfect.
For comparison, I also analyzed the results of Google’s public speech recognition API which managed to score a surprisingly measly 32.72 % WER on the same test set. If you compare that with the values above, it actually performed worse than the best of the open source alternatives. I re-ran the test twice and I can only assume that either their public API is using a simplified model for computational reasons or that their system really doesn’t like my accent.
Edit: An American native speaker offered to record my test set to eliminate the accent from the equation so I re-ran the comparison of Google’s API with the best model above with his recordings and found the two systems to produce pretty much equivalent word error rates (Google: 27.83 %, Voxforge: 27.22 %).
All things considered then, 31.02 % WER for a speaker independent dictation task on a 64k word vocabulary is still a solid start and a huge win for the Voxforge model!
If you’re a researcher trying to find the best acoustic model for your own decoding task, you should definitely do your own comparison; it’s really easy and definitely worth your while.