Mistral introduces Voxtral, a family of open-source speech recognition and understanding models. It’s about time. We haven’t seen a comparable open-source model since OpenAI’s Whisper, and that was quite a while ago.
The models are provided in 3B and 24B sizes and outperform Whisper on most benchmarks. However, they require more powerful hardware, as the largest Whisper variant is just 1.5B. This is a direct consequence of it also being a regular language model. Another consequence is that controlling them in a pure transcription setting would be harder.
The models are available on Hugging Face as well as through the Mistral API and their LeChat.
What they also currently lack is diarization (speaker recognition) support. It’s on the roadmap, but in the meantime, we still have to use somewhat clunky pyannote-audio for this purpose.