I haven't found a single good (working, easy to deploy cross-platform on CPU/CUDA/Apple Silicon) implementation of streaming + diarization, and I have looked at everything from WhisperX to pyannote to WhisperKit.
I'm curious which of the Whisper derivatives is actually the fastest ?
Since faster-whisper claims 4x speedup over base Whisper, and I've found WhisperX to be faster still (for longer audio where it can do batch inference), at least on consumer GPUs.
So with AiOla saying "50% speedup", is that actually noteworthy?
From my understanding faster-whisper optimizes the inference without changing the model itself. Here they seem to be changing the model architecture but not applying other optimizations.
50% on its own doesn’t make this the current best choice for production. But I imagine this could become the new base model that all of the inference optimizations are applied to.
Wonder if it’s plug and play or if faster-whisper and others would need to reimplement from scratch?
If you're interested, you might as well check out Gladia, at least they have a pricing section and allow you to use it as a developer, unlike just asking you to "Request a Demo".
And while a sibling comment links to the GitHub repository, their entire website does not contain such a link.
---
Edit: My bad, for some reason I first checked the website instead of the blog post. Looks much more interesting now.
Looks like they left out all training code, presumably for commercial reasons (but it only just came out so it's conceivable they are just cleaning up that side of the code but I doubt it). Totally their call, given they've put the effort in, just a shame.