As I had previously announced, I am resigning my active positions in Simon and KDE Speech.
As part of me handing over the project to an eventual successor, I had announced a day-long workshop on speech recognition basics for anyone who’s interested. Mario Fux of Randa fame took me up on that offer. In a long and intense Jitsi meeting we discussed basic theory, went through all the processes involved in creating and adapting language- and acoustic models and looked at the Simon codebase. But maybe most importantly of all, we talked about what I also want to outline in this blog post: What Simon is, what it could and should be, and how to get there.
As some of you may know, Simon started as a speech control solution for people with speech impediments. This use case required immense flexibility when it comes to speech model processing. But flexibility doesn’t come cheap – it eliminates many simplifying assumptions. This increases the complexity of the codebase substantially, makes adding new features more difficult and ultimately also leads to a more confusing user interface.
Just to give you an example of how deep this problem runs, I remember fretting over what to call the smallest lexical element of the recognition vocabulary. Can we even call it a “word”? It may actually not be in Simon – this is up to the user’s configuration.
Now, everyone reading this would be forgiven for asking, why, almost 9 years in, Simon hasn’t simply been streamlined yet. The answer is, that removing what makes Simon difficult to understand, and difficult to maintain, would necessarily also entail removing most of what makes Simon a great tool for what it was engineered to be: An extremely flexible speech control platform, allowing even people with speech impediments to control all kinds of systems in (almost) all kinds of environments.
Over the years, it became clear that Simon’s core functionality could also be useful to a much wider audience.
Eventually, this lead to the decision to shoehorn a “simple” speech recognition system for end-users into the Simon concept.
The logic was simple: Putting both use cases in the same application allowed for easier branding, and additional developers, who would hopefully be attracted by the prospect of working on a speech recognition system for a larger audience, would automatically improve the system for both use-cases. Moreover, the shared codebase could ease maintenance and further development.
In hindsight, however, this was a mistake.
Ostensibly making Simon easier to use meant that all the complexity, which was purposefully still there to support the core use case, needed to be wrapped in another layer of “simplification”, which in practice only further complicated the codebase. For the end-user, this was problematic as well, as the low-level settings were basically only hidden under a thin veil of convenience over a power-user system.
In my opinion, it’s time to treat Simon’s two personalities as the two separate projects, that simply share common libraries for common tasks.
Simon itself should stay the tool for power-users, that allows to fiddle with vocabulary transcription, grammar rules and adaption configurations. It’s really quite good at this, and there is a genuine need, as the adoption of Simon shows.
The new project should be a straight-forward dictation-enabled command and control system for the Plasma desktop. Plasma’s answer to Windows’ and OS X’ built in speech recognition, so to say. This project’s task would be vastly simpler than Simon’s task, which allows a substantially leaner codebase. Let’s look at a small list of simplifying assumptions that could never hold in Simon, but which would be appropriate for this new project:
- As the system will be dictation enabled, it will necessarily only work for languages where a dictation-capable acoustic model already exists. Therefore, the capability to create acoustic models from scratch is not required.
- As dictation capable speech models would need to be built anyway, a common model architecture can be enforced, removing the need to support HTK / Julius.
- As generic speech models (base models) will be used, the pronunciations of words can be assumed to be known (for example, following the “rules” for “US English”). Therefore, users would not need to transcribe their words, as this can be done automatically through grapheme to phoneme conversion (the g2p model would be part of the speech model distribution). This, together with the switch from Grammars to N-Grams would eliminate the need for what were the entire “Vocabulary” and “Grammar” sections in the Simon UI.
But talk is cheap. Let’s look at a prototype. Let’s look at Lera.
Lera’s main user interface is a simple app indicator, that gets out the way. Clicking on it opens the configuration dialog.
Lera’s configuration dialog (mockup, non-functional) is an exercise in austerity. A drop-down lets the user chose the used speech model, which should default to the system’s language if a matching speech model is available. A list of scenarios, which should be auto-enabled based on installed applications, show what can be controlled and how. The user should be able to improve performance by going through training (in the second tab) and to configure when Lera should be listening (in the third tab).
Here’s the best part: Lera is a working prototype. Only the core functionality, the actual decoding, is implemented, but it works out of the box, powered by an improved version of the speech model I presented on this blog in 2013, enabling continuous “dictation” in English (the model is available in Lera’s git repository; So far, the only output produced is a small popup showing the recognition result).
I implemented this prototype mostly to show off what I think the future of open-source speech recognition should look like, and how you could get started to get there. Lera’s whole codebase has 1099 lines, 821 of which are responsible for recording audio. The actual integration of the SPHINX speech recognizer is only a handful of lines. The model too, is built with absolute simplicity in mind. There’s no “secret sauce”, just a basic SPHINX acoustic model, built from open corpora (see the readme in the model folder).
If anything, Lera is a starting point. The next steps would be to move Simon’s “eventsimulation” library into a separate framework, to be shared between Lera and Simon. Lera could then use this to type out the recognition results (see Simon’s Dictation plugin). Then, I would suggest porting a simplified notion of “Scenarios” to Lera, which should only really contain a set of commands, and maybe context information (vocabulary and “grammar” can be synthesized automatically from the command triggers). The implementation of training (acoustic model adaption) would then complete a very sensible, very usable version 1.0.
Sadly, as I mentioned before, I will not be able to work on this any longer. I do, however, consider open-source speech recognition to be an important project, and would love to see it continued. If Lera kindled your interest, feel free to clone the repo and hack on it a little. It’s fun. I promise.