Open Source Dictation: Language Model
A language model defines probable word succession probabilities: For example "now a daze" and "nowadays" are pronounced exactly the same, but because of context we know that "Now a daze I have a smartphone" is far less likely than "Nowadays I have a smartphone". To model such contextual information, speech recognition systems usually use an n-gram that contains information of how likely a specific word is, given the context of the sentence.
When comparing different, existing speech models, the Gigaword language model (with a 64000 words vocabulary) outperformed all other language models. I decided to try to improve upon that model.
Existing Language Models
To see what's what, I again set up a test set. This time, we only want to look at the language model performance with no influence from other components.
This is done by measuring the perplexity of the language model given an input text. The perplexity value basically tells you how "confused" the language model was when seeing the test text. Think about it like this: "The weather today is round" is certainly more confusing than "The weather is good". (for the more mathematically inclined: the perplexity is two to the power of the entropy of the test set). Ideally, word successions that make sense would yield low perplexity and sentences that don't, very high ones. That way, the recognizer is discouraged from outputting hypothesis like "Now a daze I have a smartphone".
Additionally, we'll also be looking at "out of vocabulary" words. Naturally, a language model only contains information about a certain amount of words. Words that are not in the LM can not be recognized. Therefore, one might be tempted to use ever growing vocabulary sizes to mitigate the issue. However, this makes the recognizer not only slower but also more inaccurate: because the language model encodes more transitions, the gap between common and very rare transitions also becomes smaller, increasing perplexity. Moreover, implementation details in CMU SPHINX discourage the use of vocabularies much larger than roughly 64000 words.
Because I didn't need to record this test set, I elected to use a bigger one to get more accurate results. The used test set therefore consists of 880 sentences from 495 chat messages, 169 email fragments, 175 sentences from scientific texts and 30 sentences from various news sources. The extracted sentences were not cleaned of e.g., names or specialized vocabulary. Because we're aiming for dictation, I modified the used corpora to use what's called "verbalized punctuation": punctuation marks are replaced with pseudo-words like ".period" or "(open-parenthesis" that represent what a user should say in the finished system.
To sum up: We are looking for a language model with around 64 thousand words that has the lowest possible perplexity score and the lowest amount of out-of-vocabulary words on our given test set.
So, let's first again compare what's currently out there given our new test set.
|Language model||OOVs [%]||Perplexity|
|HUB 4 (64k)||15.03%||506.9|
From this, we can already see why the Gigaworld corpus performed much better than the other two language models in our earlier test. However, it still has almost 10% out of vocabulary words on our test set. To understand why, we have to look no further than what the corpus is built from: various news wire sources that are about a decade old by now. Given our problem domain for this experiment, this is obviously not ideal.
Can we do better?
Building a language model isn't exceptionally hard, but you need to find an extensive amount of data that should closely resemble what you want the system to recognize later on.
After a lot of experimenting (tests of all data sets below are available upon request), I settled on the following freely available data sources:
- English Wikipedia
Solid foundation over a diverse set of topics. Encyclopaedic writing style.
- U.S. Congressional Record (2007)
Somber discussions over a variety of topics using sophisticated vocabulary
- Corpus of E-Mails of Enron Employees
Mixture of business and colloquial messages between employees.
- Stack Exchange (split between Stack Overflow and all other sites)
Questions and answers from experts over a variety of domains, many of which are technical (fitting our problem domain).
- Open Subtitles.org (dump graciously provided upon request)
Everyday, spoken speech.
- Newsgroups (alt.* with a few exceptions)
I built separate models of each of these corpora which were then combined to one large "ensemble" model with mixture weights optimizing the perplexity scores on the test set. These mixture weights are visualized in the graph below.
For each of the data sets, I also calculated word counts, selected the top 20000 to 35000 words (depending on the variability of the corpus) and removed duplicates to end up with a word list of about 136000 common words across the above corpora. I then further pruned this word list with a large dictionary of valid English words (more than 400000 entries) and manually removed a couple of e.g., foreign names to arrive at a list of around 65000 common English words to which I limited the ensemble language model.
The end result is a model with significantly fewer out of vocabulary words and lower perplexity on our test set than the Gigaword corpus.
|Language model||OOVs [%]||Perplexity|
In order to perform recognition, we also need a phonetic dictionary. Of the 65k words in the ensemble language model, about 55k were already in the original CMU dictionary. The pronunciations for the remaining 10k words were (mostly) automatically synthesized with the CMU SPHINX g2p framework. While I was at it, I also applied the casing of the (conventional) dictionary to the (otherwise all uppercase) phonetic dictionary and language model. While a bit crude, this takes care of e.g., uppercasing names, languages, countries, etc.
So how does our thus created language model perform compared to the next best thing we tested? On the same test, with the same acoustic model, we decreased our word error rate by almost 2 percent - a more than 5 percent relative improvement.
|Acoustic model||Dictionary||Language model||WER|
|Voxforge 0.4 (cont)||cmudict 0.7||Gigaword, 64k||31.02 %|
|Voxforge 0.4 (cont)||Ensemble 65k||Ensemble 65k||29.31 %|