AiOla open-sources ultra-fast ‘multi-head’ speech recognition model

BetterWhisper · on Aug 3, 2024

Does it do speaker recognition/ diarization? Can't see it from the repo readme

ukuina · on Aug 4, 2024

I haven't found a single good (working, easy to deploy cross-platform on CPU/CUDA/Apple Silicon) implementation of streaming + diarization, and I have looked at everything from WhisperX to pyannote to WhisperKit.

Any suggestions would be very welcome!

gronky_ · on Aug 3, 2024

GH repo: https://github.com/aiola-lab/whisper-medusa

Doohickey-d · on Aug 3, 2024

I'm curious which of the Whisper derivatives is actually the fastest ?

Since faster-whisper claims 4x speedup over base Whisper, and I've found WhisperX to be faster still (for longer audio where it can do batch inference), at least on consumer GPUs.

So with AiOla saying "50% speedup", is that actually noteworthy?

gronky_ · on Aug 3, 2024

From my understanding faster-whisper optimizes the inference without changing the model itself. Here they seem to be changing the model architecture but not applying other optimizations.

50% on its own doesn’t make this the current best choice for production. But I imagine this could become the new base model that all of the inference optimizations are applied to.

Wonder if it’s plug and play or if faster-whisper and others would need to reimplement from scratch?

dloss · on Aug 3, 2024

Is this even faster? https://github.com/Vaibhavs10/insanely-fast-whisper

If so, is the quality still acceptable?

tomp · on Aug 3, 2024

Depends what you mean by “fast”.

I’ve tested WhisperLive, it’s basically real-time (i.e. low latency).

gunalx · on Aug 3, 2024

Dosent whisperlive just use faster-whisper under the hood. Witch can be way faster than real time.

tomp · on Aug 5, 2024

There's a difference between speed & latency when it comes to performance.

Faster-whisper can process audio faster than real-time, but AFAIK vanilla Whisper needs a few seconds long audio "frame" to do inference/STT.

Whisper Live fixes that, and reduces latency to a few tens/hundreds of ms.

gcr · on Aug 7, 2024

What sort of performance are you needing?

My Apple M1 MacBook (2021) can infer whisper-medium at roughly 10x realtime, for comparison. Takes about 20min to process three hours of audio.

phkahler · on Aug 3, 2024

IIRC Whisper works on wave files. Can this do real time low latency continuous ASR?

qwertox · on Aug 3, 2024

Nothing of interest here, it's an ad.

If you're interested, you might as well check out Gladia, at least they have a pricing section and allow you to use it as a developer, unlike just asking you to "Request a Demo".

And while a sibling comment links to the GitHub repository, their entire website does not contain such a link.

---

Edit: My bad, for some reason I first checked the website instead of the blog post. Looks much more interesting now.

cheptsov · on Aug 3, 2024

They have shared the link to GitHub [1], HuggingFace repo [2], and the paper [3]:

1. https://github.com/aiola-lab/whisper-medusa

2. https://huggingface.co/aiola/whisper-medusa-v1

3. https://paperswithcode.com/method/multi-head-attention

nmstoker · on Aug 3, 2024

Looks like they left out all training code, presumably for commercial reasons (but it only just came out so it's conceivable they are just cleaning up that side of the code but I doubt it). Totally their call, given they've put the effort in, just a shame.