Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> For all of the hype around LLMs, this general area (image generation and graphical assets) seems to me to be the big long-term winner of current-generation AI. It hits the sweet spot for the fundamental limitations of the methods:

I am biased (I work at Rev.com and Rev.ai), but I totally agree and would add one more thing: transcription. Accurate human transcription takes a really, really long time to do right. Often a ratio of 3:1-10:1 of transcriptionist time to original audio length.

Though ASR is only ~90-95% accurate on many "average" audios, it is often 100% accurate on high quality audio.

It's not only a cost savings thing, but there are entire industries that are popping up around AI transcription that just weren't possible before with human speed and scale.



Also the other way around: text to speech. We're at the point where I can finally listen to computer generated voice for extended periods of time without fatigue.

There was a project mentioned here on HN where someone was creating audio book versions of content in the public domain that would never have been converted through the time and expense of human narrators because it wouldn't be economically feasible. That's a huge win for accessibility. Screen readers are also about to get dramatically better.


I’d add image to text - I use this all the time. For instance I’ll take a photo of a board or device and ChatGPT/claude/pick your frontier multi modal is almost always able to classify it accurately and describe details, including chipsets, pinouts, etc.


I tried using ChatGPT for some handwritten text I couldn't make out and it failed miserably, just made stuff up.

Tried it on a PDF and it didn't even read the PDF.

I'm sure we'll get there but.. real shame it lies when it can't figure something out


Are you using 4o?

First lying requires agency and intent, which LLMs don’t have and they can’t lie.

Yes it makes stuff up when you put garbage in and uncritically consume the garbage. The key isn’t to look at it as an outsourcing of agency or the easy button but as a tool that gets you started on stuff and a new way of interacting with computers. It also confidently asserts things that are untrue or are subtly off base. To that extent, and in a literally very real sense, this is a very early preview of the technology - of a completely new computing technique that only reached bare minimum usability in the last two years. Would you rather not have early access or have to wait 20 years as accountants and product managers strangle it?

For OCR I’m surprised anyone who has ever used it before would scan illegible hand writing in and expect to not get a bunch of garbage out without it identifying the garbage was semantically wrong. Frontier Multimodal LLMs do an amazing job - compared to the state of the art a year ago. Do they do an amazing job compared to an ever shifting goal post? Are all the guard rails of a mature 30 year old software technique even discovered yet? No. But I’ll tell you from the early days of things, the early days of HTTP was nothing like today. Was HTTP useless because it was so unreliable and flakey? No it was amazing for those with the patience and capacity to dream to building something truly remarkable at the time, like Google or Amazon or eBay.

The PDF issue you had is not expected. I upload PDFs all the time. For instance when I’m working on something, like restringing some hunter Douglas blinds in my house recently, I upload the instructions for the restring kit to a ChatGPT session or Claude and it then becomes something I can ask iteratively how to tackle what I’m working on as I get to challenge spots in the process. It’s not always right and if confidently tells me subtly wrong things. But I pretty quickly realize what’s wrong and isn’t as I work and that’s usually something ambiguous in the instructions and requires a lot more context on something very specific and likely not documented publicly anywhere. But 80% of the time my questions get answered as I work. That’s -amazing- that I can scan a paper instruction sheet into a computer and get step by step guidance that I can interactively interrogate using my voice as I work and it literally understands everything I ask and gives me cogent if sometimes off answers. This is like literally the definition of the future I was promised.


> a project mentioned here on HN where someone was creating audio book versions of content in the public domain

Maybe this: https://news.ycombinator.com/item?id=40961385


That's the one! Thanks!


As an ex-Rev transcriber, I can think of the worst one I ever did.

It was a video for ESPN of an indoor motorcross race and the transcription was for the commentators. There were two fundamental problems:

1) The bike noise made the commentators almost inaudible

2) The commentators were using the [well-known to fans] nicknames of all the racers, and not their real names

I haven't used Rev for about three years, so I don't know how much better your auto-transcription system has gotten. I'd hope AI can solve #1, but #2 is a very hard problem to solve, simply because of the domain knowledge required. The nicknames were like Buttski McDumpleface etc and took a bunch of Googling to figure out.

I eventually got fired from Rev simply because the moderators haven't heard of the Oxford comma :p


I agree. I think it's more of a niche use-case than image models (and fundamentally harder to evaluate), but transcription and summarization is my current front-runner for winning use-case of LLMs.

That said, "hallucination" is more of a fundamental problem for this area than it is for imagery, which is why I still think imagery is the most interesting category.


Is there any models that can do diarization well yet?

I need one for a product and the state of the art, e.g. pyannote, is so bad it's better to not use them.


Deepgram has been pretty good for our product. Fast and fairly accurate for English.


Do they have a local model?

I keep getting burned by APIs having stupid restrictions that makes use cases impossible that are trivial if you can run the thing locally.


German Public Television switchednto Automatic transcriptions a few year back already.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: