You pick, I work: Dictation, Assistant or Translator?

A little while ago, I mentioned that I'll be giving a talk about the current state of open source speech recognition at this years Akademy.
As part of that talk, I want to show off a tech-demo of a moonshot use case of open source speech recognition to not only demonstrate what is already possible, but also show off the limits of the current state of the art.

So a couple of days ago, I asked what application of speech recognition technology would be most interesting for you, and many of you responded. I extracted the three options that broadly cover all suggestions: Dictation (like Dragon Naturally Speaking), a virtual assistant (like Siri) and simultaneous translation (like Star Trek's universal translator).

You now get to pick one of those three from the poll below.

After the poll closes (a week from now), I'll take the idea that received the most votes and devote about a week to build a prototype based on currently available open source language processing tools. This prototype will then be demonstrated at this years Akademy.

Happy voting!


Poll: Dictation, Assistant or Translator?

48% (124 votes)
Virtual personal assistant
36% (94 votes)
Simultaneous translation
16% (41 votes)
Total votes: 259


For the virtual assistant use-case, don't you share some common goals with Nepomuk, where they are working on human-entered search queries ( ) ? When I read about that project, I thought right away about combining it with speech-recognition for Plasma Active.

Peter Grasch's picture

Yes, thought about that as well. I'd actually have a couple of ideas for all the implementations in the poll.

Then again, please keep in mind that this is about a tech-demo, not a ship-able product. That comes later.

If Simon implements full dictation then all of the others can follow. All ready if you can launch krunner through Simon and dictate the search you can use the KDE web shortcuts to search amazon, imdb, duck duck go and various different websites, which is almost an assistant. Dictation combined with the nepomuk natural language parser makes for even more interesting uses. And of course one would need dictation to even implement a live translator.

Peter Grasch's picture

Yes, all three ideas require a large vocabulary, continuous speech recognizer.
However, there is more to dictation than that. Think about corrections, upper-lower casing, punctuation marks, etc.

It's also the one project (of the three) where most of the action will happen outside of Simon, requiring special communication (e.g., you'll want to control the whole input field in e.g. krunner if you're going to do correction).

Peter Grasch's picture

Oh and because this has come up more than once already: it's really not realistic to "dictate" e.g., "gg: #include says file not found" into krunner or "cd ~/bin" into Konsole. That's not how dictation works.

To be able to say that, one would need a specialized model that expects stuff like "gg colon hashtag include" and knows how to handle it. Such a model, however, would be near-useless to e.g., write a letter.

To do searches, etc. retrofitting existing interfaces with "dictation" is honestly probably the worst thing you could do. It's much easier, for example, to recognize "do a Google search on hashtag include ..." than the query above.
This, then, is where the "virtual assistant" project becomes relevant.

I thank you for your responce, as far as the web shortcuts most of them do have the long version, and can be delimited by a space instead of a colon. So google followed by ones search terms is possible, "Google lol cats" or "amazon go Girl" should bring up the search results in a more natural way. But, yes gg amz and the storthand shortcuts followed by a colon would not make any sense.

Peter Grasch's picture

I'm afraid this is still far from a good idea because "Google lol cats" is really not a common "sentence". Even if I'm doing my very best (and I'm trying) to model expected sentences from a vast amount of sources, it's unlikely that I'll hit "Google lol cats" even once.
That means you'll be fighting the recognizers "intuition".
Try doing that with a human, if you want: Next you chat with someone about something mundane (non technical / geek culture), respond to a question with "Google lol cats". I'd be almost certain that she / he won't understand it without you repeating. That's because we humans do the same thing: we expect certain responses and if there's one that's really far off the mark, we usually require a second take to recognize it (case in point: "yeah.... wait, what?").

I think that the dictation would be a really useful addition to this new project (KDE: Artikulate):

Peter Grasch's picture

Artikulate is great but pronunciation training != dictation. This needs different algorithms.

Add new comment

Filtered HTML

  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <blockquote> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.