There are many issues which can lead to bad accuracy from model mismatches to software bugs. See accuracy guide for more detailed information on how to debug the accuracy problems. Once you figured out the reason is in the model mismatch you can try to adapt existing models to get better performance.
Please note that we focus on the models which work equally well in any conditions and we spend a lot of time on training them, so training from scratch is very rarely a great solution. In most cases you’d better adapt the existing model than train a new one.
There are four levels of adaptation you can apply:
Here we cover those methods:
Vosk-API supports online modification of the vocabulary. See the demo code for details.
Note that big models with static graphs do not support this modification, you need a model with dynamic graph.
The Kaldi model used in Whisper-pro is compiled from 3 data sources:
You can rebuild all three with different level of effort, but sometimes you just need to adjust the probability of the words to improve the recognition. For that it is enough to recompile the language model from the text. To do that
Take a text that reflects the speech you want to recognize
Remove punctuation, convert everything to the lowercase, you can do it with a python script
export KALDI_ROOT=`pwd`/kaldi
git clone https://github.com/kaldi-asr/kaldi
cd kaldi/tools
make
# install all required dependencies and repeat `make` if needed
extras/install_opengrm.sh
export PATH=$KALDI_ROOT/tools/openfst/bin:$PATH
export LD_LIBRARY_PATH=$KALDI_ROOT/tools/openfst/lib/fst
cd model
fstsymbols --save_osymbols=words.txt Gr.fst > /dev/null
farcompilestrings --fst_type=compact --symbols=words.txt --keep_symbols text.txt | \
ngramcount | ngrammake | \
fstconvert --fst_type=ngram > Gr.new.fst
mv Gr.new.fst Gr.fst
Use created Gr.fst instead of standard one in your model.
For more details see OpenGRM documentation http://www.opengrm.org/twiki/bin/view/GRM/NGramLibrary
You can not introduce new words this way, that is something we will cover later.
You can rebuild the graph in some of the big models (Aspire EN, Daanzu En, Russian, German, French). Some of the models like Indian English are not available for update yet because we didn’t share all the necessary files.
To update the graph you need to do the following:
For more detailed guide see full guide on Whisper-pro model graph adaptation.
Adapting the acoustic model is also possible with about 1 hour of data. You can follow this issue for details.
Basically you need to collect the data, put it in the Kaldi format, then run kaldi script.
More detailed documentation of the finetuning might be helpful, we do not have it yet. Corresponding issue is tracked at vosk-api issue.