Speech-Based, Natural Language Conversational Recommender Systems

A while ago, I published a post about ReComment, a speech-based recommender system. Well, today I want to talk about SpeechRec: ReComment’s bigger, better, faster, stronger brother.


With ReComment we showed, that a speech-based interface can enable users to specify their preferences quicker and more accurately. Specifically, ReComment exploited information from attributes such as “a little” or “a lot” which naturally occur in spoken input, to refine its user model.
With SpeechRec, I wanted to build on the idea, that spoken language carries more semantic information than traditional user interfaces typically allow to express, by integrating paralingual features in the recommendation strategy, to give proportionally more weight to requirements that are said in a more forceful manner. For example, this allows to assign more weight to the constraint “Cheaper!, than to the statement “Cheaper..?”. In other words: SpeechRec doesn’t just listen to what you are saying, and how you are phrasing it, but also how you are pronouncing it.

Moreover, I wanted to up the ante when it came to task complexity. The ReComment prototype recommended compact digital cameras, which turned out to be a problem domain where a user’s requirements are fairly predictable, and finding a fitting product is arguably easy for the majority of users. To provoke conflicting requirements, and therefore better highlight the strengths and weaknesses of the recommender system under test, SpeechRec was evaluated with the domain of laptops. And let me tell you: recommending laptops to what were primarily students at a technical university is quite a challenge :).

Another problem we observed in the evaluation of ReComment, was that some people did not seem to “trust” that the system would understand complex input, and instead defaulted to very simple, command-like sentences, which, in turn, carried less information about the user’s true, hidden preferences. SpeechRec was therefore engineered to provoke natural user interaction by engaging user’s in a human-like, mixed initiative, spoken sales dialog with an animated avatar.


The developed prototype was written in C++, using Qt4. The speech recognition system was realized with the open-source speech recognition solution Simon, using a custom, domain-specific speech model that was especially adapted to the pervasive Styrian dialect. Simon was modified to integrate OpenEAR, which was used to evaluate a statement’s “arousal” value, to realize the paralingual weighting discussed above (this modification can be found in Simon’s “emotion” branch).

The used avatar was designed in Blender 3D. MARY TTS was used as a text-to-speech service.

A comprehensive product database of more than 600 notebook was collected from various sources. Each product was annotated with information on 40 attributes, ranging from simple ones like price to the used display panel technology. As early pilots showed that users, facing a “natural” sales dialog, did not hesitate to also talk about subjective attributes (e.g., “I want a device that looks good”), SpeechRec’s database additionally included sentiment towards 41 distinct aspects, sourced through automatic sentiment analysis from thousands of customer reviews. MongoDB was used as a DBMS. Optimality criteria further allowed SpeechRec to further decode statements such as “Show me one with a better graphics card” or “I want one with good battery life”.

SpeechRec’s recommendation strategy was based on the incremental critiquing approach as described by Reilly et al, with a fuzzy satisfaction function.


Check out the video demonstration of a typical interaction session below.

Note, how SpeechRec takes initiative in the beginning, asking for the user’s primary use case. It does this, because it determined that it did not yet have sufficient information to make a sensible recommendation (mixed-initiative dialog strategy).
Also note, how the system corrects it’s mistake of favoring to satisfy the “price” attribute over the “screen size” attribute, when the user complains about it (towards the end). SpeechRec instead comes up with a different, more favorable compromise, which even includes slightly bending the user’s request for something that costs at most 1200 euros (“It’s just a little more, and it’s worth it for you.”).

Direct link: Watch Video on Youtube
(The experiment was conducted in German to find more native speaking testers in Austria; be sure to turn on subtitles!)

Results and Further Information

The conducted empirical study showed, that the nuanced user input extracted from the natural language processing and paralingual analysis enabled SpeechRec to find better fitting products significantly quicker than when a traditional, knowledge-based recommender system. Further comparison showed that the full version of SpeechRec described above also substantially outperformed a restricted version of itself, which was configured to not act on the extracted lexical and paralingual nuances, confirming that the richer user model facilitated by spoken natural language interaction is a major contribution toward the increase in recommendation performance of SpeechRec.

More information about SpeechRec and related research an be found in the journal paper P. Grasch and A. Felfernig. On the Importance of Subtext in Recommender Systems, icom, 14(1):41-52, 2015, and in my Master’s thesis.

SpeechRec was published under the terms of the GPLv2, and is available on Github. All dependencies and external components are free software and available at their respective websites.

Facebooktwitterredditpinterestlinkedinmailby feather

Peter Grasch