Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
[flagged] AiOla open-sources ultra-fast ‘multi-head’ speech recognition model (aiola.com)
71 points by cheptsov on Aug 3, 2024 | hide | past | favorite | 14 comments


Does it do speaker recognition/ diarization? Can't see it from the repo readme


I haven't found a single good (working, easy to deploy cross-platform on CPU/CUDA/Apple Silicon) implementation of streaming + diarization, and I have looked at everything from WhisperX to pyannote to WhisperKit.

Any suggestions would be very welcome!



I'm curious which of the Whisper derivatives is actually the fastest ?

Since faster-whisper claims 4x speedup over base Whisper, and I've found WhisperX to be faster still (for longer audio where it can do batch inference), at least on consumer GPUs.

So with AiOla saying "50% speedup", is that actually noteworthy?


From my understanding faster-whisper optimizes the inference without changing the model itself. Here they seem to be changing the model architecture but not applying other optimizations.

50% on its own doesn’t make this the current best choice for production. But I imagine this could become the new base model that all of the inference optimizations are applied to.

Wonder if it’s plug and play or if faster-whisper and others would need to reimplement from scratch?


Is this even faster? https://github.com/Vaibhavs10/insanely-fast-whisper

If so, is the quality still acceptable?


Depends what you mean by “fast”.

I’ve tested WhisperLive, it’s basically real-time (i.e. low latency).


Dosent whisperlive just use faster-whisper under the hood. Witch can be way faster than real time.


There's a difference between speed & latency when it comes to performance.

Faster-whisper can process audio faster than real-time, but AFAIK vanilla Whisper needs a few seconds long audio "frame" to do inference/STT.

Whisper Live fixes that, and reduces latency to a few tens/hundreds of ms.


What sort of performance are you needing?

My Apple M1 MacBook (2021) can infer whisper-medium at roughly 10x realtime, for comparison. Takes about 20min to process three hours of audio.


IIRC Whisper works on wave files. Can this do real time low latency continuous ASR?


Nothing of interest here, it's an ad.

If you're interested, you might as well check out Gladia, at least they have a pricing section and allow you to use it as a developer, unlike just asking you to "Request a Demo".

And while a sibling comment links to the GitHub repository, their entire website does not contain such a link.

---

Edit: My bad, for some reason I first checked the website instead of the blog post. Looks much more interesting now.



Looks like they left out all training code, presumably for commercial reasons (but it only just came out so it's conceivable they are just cleaning up that side of the code but I doubt it). Totally their call, given they've put the effort in, just a shame.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: