Over the last couple of weeks, I’ve been working towards a demo of open source speech recognition. I did a review of existing resources, and managed to improve both acoustic- and language model. That left turning Simon into a real dictation system.
Making Simon work with large-vocabulary models
First of all, I needed to hack Simond a bit to accept and use an n-gram based language model instead of the scenarios grammar when the first was available. With this little bit of trickery, Simon was already able to use the models I built in the last weeks.
Sadly, I immediately noticed a big performance issue: Up until now, Simon basically recorded one sample until the user stopped speaking and then started recognizing. While not a problem when the “sentences” are constrained to simple, short commands, this would cause significant lag as the length of the sentences, and therefore the time required for recognition, increased. Even when recognizing faster than real time, this essentially meant that you had to wait for ~ 2 seconds after saying a ~ 3 second sentence.
To keep Simon snappy, I implemented continuous recognition in Simond (for pocketsphinx): Simon now feeds data to the recognizer engine as soon as the initial buffer is filled, making the whole system much more responsive.
Revisiting the Dictation plugin
Even before this project started, Simon already had a “Dictation” command plugin. Basically, this plugin would just write out everything that Simon recognizes. But that’s far from everything there is to dictation from a software perspective.
First of all, I needed to take care of replacing the special words used for punctuation, like “.period”, with their associated signs. To do that, I implemented a configurable list of string replaces in the dictation plugin.
An already existing option to add a given text at the end of a recognition result takes care of adding spaces after sentences if configured to do so. I also added the option to uppercase the first letter of every new spoken sentence.
Then, I set up some shortcut commands that would be useful while dictating (“Go to the end of the document” for ctrl+end or “Delete that” for backspace, for example).
To deal with incorrect recognition results, I also wanted to be able to modified already written text. To do that, I made Simon aware of the currently focused text input field by using AT-SPI 2. I then implemented a special “Select x” command that would search through the current text field and select the text “x” if found. This enables the user to select the offending word(s) to either remove them or simply dictate the correction.
So without much ado, this is the end result:
Of course, this is just the beginning. If we want to build a real, competitive open source speech recognition offering we have to tackle – among others – the following challenges:
- Turning the adaption I did manually into an integrated, guided setup procedure for Simon (enrollment).
- Continuing to work towards better language- and acoustic models in general. There’s a lot to do there.
- Improving the user interface for the dictation: We should show off the current (partial) hypothesis even while the user is speaking. That would make the system feel even more responsive.
- Better accounting for spontaneous input: Simon should be aware of (and ignore) filler words, support mid-sentence corrections, false starts, etc.
- Integrating semantic logic into the language model; For example, in the current prototype, recognizing “Select x” is pretty tricky because e.g., “Select hear” is not a sentence that makes sentence according to the language model – it does in the application, though (select the text “hear” in the written text for correction / deletion).
- Better incorporating the dictation with traditional command & control: When not dictating texts, we should still exploit the information we do have (available commands) to keep recognition accuracy as high as it is for the limited-vocabulary use case we have now. A mixture (or switching) between grammar and language model should be explored.
- Better integration in other apps: The AT-SPI information used for correcting mistakes is sadly not consistent across toolkits and widgets. Many KDE widgets are in fact not accessible through AT-SPI (e.g. the document area of Calligra Words does not report to be a text field). This is mostly down to the fact that no other application currently requires the kind of information Simon does.
Even this rather long list is just a tiny selection of what I can think of right off the top of my head – and I’m not even touching on improvements in e.g. CMU SPHINX.
There’s certainly still a lot left to do, but all of it is very exciting and meaningful work.
I’ll be at the Akademy conference for the coming week where I’ll also be giving a talk about the future of open source speech recognition. If you want to get involved in the development of an open source speech recognition system capable of dictation: Get in touch with me – either in person, or – if you can’t make it to Akademy – write me an email!by