Passing the Torch

This fall, the Simon project will turn 9 years old. What started with a team of ambitious 17-year-olds and a school project, evolved over the course of more than 3000 commits and 4 major releases into a sophisticated speech-recognition platform.

Like most open-source projects, Simon saw periods of stagnation and periods of explosive growth. It saw GSoCs and Open Academies; Research projects and commercial deployments. And through it all, I was proud to be Simon’s maintainer. Almost 9 years in, however, it is time for me to take a step back, pass on the torch, and focus on a new adventure.

At the end of this month, I will start an exciting new career at Apple’s Siri team in Paris. I will therefore sadly no longer be able to serve as Simon’s maintainer, starting on the 20th of August.
I hereby want to announce an open call to find a new maintainer for KDE’s speech recognition efforts.

To help my successor get started, I will hold an online workshop on the upcoming Tuesday, 4th of August 2015 (starting at around 11am, Vienna time), where I will introduce speech-recognition basics, outline Simon’s codebase, and answer any arising questions. I am also planning on discussing my sketches and ideas on where I personally think the project should be going, and how to get there. After the workshop, I will stay available to answer any questions, especially when it comes to the workings of the actual speech recognition, via email until the 20th, to make the transition as smooth as possible.

I encourage everyone interested in potentially leading the Simon team to introduce themselves on the kde-speech@kde.org mailing list. Prior knowledge about speech recognition is not required.

At this point, I also want to extend my sincere gratitude to everyone involved in turning this project into what it has become. Franz and Mathias Stieger, Alexander Breznik, Frederik Gladhorn, Yash Shah, Vladislav Sitalo, Adam Nash, Patrick von Reth, Manfred Scheucher, Phillip Goriup, Martin Gigerl, Bettina Sturmann, Susanne Tschernegg, Jos Poortvliet, Lydia Pintscher, the KDE e.V. board, and the countless others: thank you!


Speech-Based, Natural Language Conversational Recommender Systems

A while ago, I published a post about ReComment, a speech-based recommender system. Well, today I want to talk about SpeechRec: ReComment’s bigger, better, faster, stronger brother.


With ReComment we showed, that a speech-based interface can enable users to specify their preferences quicker and more accurately. Specifically, ReComment exploited information from attributes such as “a little” or “a lot” which naturally occur in spoken input, to refine its user model.
With SpeechRec, I wanted to build on the idea, that spoken language carries more semantic information than traditional user interfaces typically allow to express, by integrating paralingual features in the recommendation strategy, to give proportionally more weight to requirements that are said in a more forceful manner. For example, this allows to assign more weight to the constraint “Cheaper!, than to the statement “Cheaper..?”. In other words: SpeechRec doesn’t just listen to what you are saying, and how you are phrasing it, but also how you are pronouncing it.

Moreover, I wanted to up the ante when it came to task complexity. The ReComment prototype recommended compact digital cameras, which turned out to be a problem domain where a user’s requirements are fairly predictable, and finding a fitting product is arguably easy for the majority of users. To provoke conflicting requirements, and therefore better highlight the strengths and weaknesses of the recommender system under test, SpeechRec was evaluated with the domain of laptops. And let me tell you: recommending laptops to what were primarily students at a technical university is quite a challenge :).

Another problem we observed in the evaluation of ReComment, was that some people did not seem to “trust” that the system would understand complex input, and instead defaulted to very simple, command-like sentences, which, in turn, carried less information about the user’s true, hidden preferences. SpeechRec was therefore engineered to provoke natural user interaction by engaging user’s in a human-like, mixed initiative, spoken sales dialog with an animated avatar.


The developed prototype was written in C++, using Qt4. The speech recognition system was realized with the open-source speech recognition solution Simon, using a custom, domain-specific speech model that was especially adapted to the pervasive Styrian dialect. Simon was modified to integrate OpenEAR, which was used to evaluate a statement’s “arousal” value, to realize the paralingual weighting discussed above (this modification can be found in Simon’s “emotion” branch).

The used avatar was designed in Blender 3D. MARY TTS was used as a text-to-speech service.

A comprehensive product database of more than 600 notebook was collected from various sources. Each product was annotated with information on 40 attributes, ranging from simple ones like price to the used display panel technology. As early pilots showed that users, facing a “natural” sales dialog, did not hesitate to also talk about subjective attributes (e.g., “I want a device that looks good”), SpeechRec’s database additionally included sentiment towards 41 distinct aspects, sourced through automatic sentiment analysis from thousands of customer reviews. MongoDB was used as a DBMS. Optimality criteria further allowed SpeechRec to further decode statements such as “Show me one with a better graphics card” or “I want one with good battery life”.

SpeechRec’s recommendation strategy was based on the incremental critiquing approach as described by Reilly et al, with a fuzzy satisfaction function.


Check out the video demonstration of a typical interaction session below.

Note, how SpeechRec takes initiative in the beginning, asking for the user’s primary use case. It does this, because it determined that it did not yet have sufficient information to make a sensible recommendation (mixed-initiative dialog strategy).
Also note, how the system corrects it’s mistake of favoring to satisfy the “price” attribute over the “screen size” attribute, when the user complains about it (towards the end). SpeechRec instead comes up with a different, more favorable compromise, which even includes slightly bending the user’s request for something that costs at most 1200 euros (“It’s just a little more, and it’s worth it for you.”).

Direct link: Watch Video on Youtube
(The experiment was conducted in German to find more native speaking testers in Austria; be sure to turn on subtitles!)

Results and Further Information

The conducted empirical study showed, that the nuanced user input extracted from the natural language processing and paralingual analysis enabled SpeechRec to find better fitting products significantly quicker than when a traditional, knowledge-based recommender system. Further comparison showed that the full version of SpeechRec described above also substantially outperformed a restricted version of itself, which was configured to not act on the extracted lexical and paralingual nuances, confirming that the richer user model facilitated by spoken natural language interaction is a major contribution toward the increase in recommendation performance of SpeechRec.

More information about SpeechRec and related research an be found in the journal paper P. Grasch and A. Felfernig. On the Importance of Subtext in Recommender Systems, icom, 14(1):41-52, 2015, and in my Master’s thesis.

SpeechRec was published under the terms of the GPLv2, and is available on Github. All dependencies and external components are free software and available at their respective websites.