Whisper-pro Offline Speech Recognition API

There are many issues which can lead to bad accuracy from model mismatches to software bugs. See accuracy guide for more detailed information on how to debug the accuracy problems. Once you figured out the reason is in the model mismatch you can try to adapt existing models to get better performance.

Please note that we focus on the models which work equally well in any conditions and we spend a lot of time on training them, so training from scratch is very rarely a great solution. In most cases you’d better adapt the existing model than train a new one.

There are four levels of adaptation you can apply:

Update our small models in runtime with the list of words to recognize
Update our small models offline with the language model from texts
Update language model and the dictionary inside the big model
Finetune acoustic model on your data

Here we cover those methods:

Updating recognizer vocabulary in runtime

Vosk-API supports online modification of the vocabulary. See the demo code for details.

Note that big models with static graphs do not support this modification, you need a model with dynamic graph.

Updating the language model

The Kaldi model used in Whisper-pro is compiled from 3 data sources:

dictionary
acoustic model
language model

You can rebuild all three with different level of effort, but sometimes you just need to adjust the probability of the words to improve the recognition. For that it is enough to recompile the language model from the text. To do that

Take a text that reflects the speech you want to recognize
Remove punctuation, convert everything to the lowercase, you can do it with a python script

Build openfst and opengrm inside kaldi

export KALDI_ROOT=`pwd`/kaldi
git clone https://github.com/kaldi-asr/kaldi
cd kaldi/tools
make
# install all required dependencies and repeat `make` if needed
extras/install_opengrm.sh

Now lets build a grammar

export PATH=$KALDI_ROOT/tools/openfst/bin:$PATH
export LD_LIBRARY_PATH=$KALDI_ROOT/tools/openfst/lib/fst
cd model
fstsymbols --save_osymbols=words.txt Gr.fst > /dev/null
farcompilestrings --fst_type=compact --symbols=words.txt --keep_symbols text.txt | \
 ngramcount | ngrammake | \
 fstconvert --fst_type=ngram > Gr.new.fst
mv Gr.new.fst Gr.fst

Use created Gr.fst instead of standard one in your model.

For more details see OpenGRM documentation http://www.opengrm.org/twiki/bin/view/GRM/NGramLibrary

You can not introduce new words this way, that is something we will cover later.

Updating words and the vocabulary in the big models

You can rebuild the graph in some of the big models (Aspire EN, Daanzu En, Russian, German, French). Some of the models like Indian English are not available for update yet because we didn’t share all the necessary files.

To update the graph you need to do the following:

Prepare the lexicon in the Kaldi format
Prepare the language model with the generic one interpolated with the domain-specific one
Compile lexicon
Compile the graph
Replace graph inside the model

For more detailed guide see full guide on Whisper-pro model graph adaptation.

Adapting the acoustic model with finetuning

Adapting the acoustic model is also possible with about 1 hour of data. You can follow this issue for details.

Basically you need to collect the data, put it in the Kaldi format, then run kaldi script.

More detailed documentation of the finetuning might be helpful, we do not have it yet. Corresponding issue is tracked at vosk-api issue.