New models with audio support on the block:
Introducing frontier open source speech understanding models.
mistral.ai
Basically they added ASR support on top of their existing 3B and 24B models, and managed really nice results with those. Having a built-in LLM is awesome as well, depending on your application.
I'm personally more interested in the STT part, not really the LLM itself, so I'll be giving the 3B model a run this weekend and comparing it to my current WhisperX setup. If someone comes up with nice quants for the 24B version, I may end up giving it a go as well.