More

kbyatnal · 2025-11-25T01:05:47 1764032747

Ultimately, there’s some intersection of accuracy x cost x speed that’s ideal, which can be different per use case. We’ll surface all of those metrics shortly so that you can pick the best model for the job along those axes.

kbyatnal · 2025-11-25T00:55:51 1764032151

Claude coming shortly (in the next ~1 hour)

kbyatnal · 2025-11-25T00:55:35 1764032135

We wanted to keep the focus on (1) foundation VLMs and (2) open source OCR models.

We had Mistral previously but had to remove it because their hosted API for OCR was super unstable and returned a lot of garbage results unfortunately.

Paddle, Nanonets, and Chandra being added shortly!

timbmg · 2025-11-25T14:37:08 1764081428

MistralOCR works stably for me when first uploading the file to their server and then running the OCR. I also had some issues before when giving a URL directly to the OCR API, not sure if you're doing that?

kbyatnal · 2025-11-25T00:51:54 1764031914

Sonnet/Opus is being added shortly!

kbyatnal · 2025-10-14T15:50:27 1760457027

School transcripts are surprisingly one of the hardest documents to parse. The thing that makes them tricky is (1) the multi-column tabular layouts and (2) the data ambiguity.

Transcript data is usually found in some sort of table, but they're some of the hardest tables for OCR or LLMs to interpret. There's all kinds of edge cases with tables split across pages, nested cells, side-by-side columns, etc. The tabular layout breaks every off-the-shelf OCR engine we've run across (and we've benchmarked all of them). To make it worse, there's no consistency at all (every school in the country basically has their own format).

What we've seen help in these cases are:

1. VLM based review and correction of OCR errors for tables. OCR is still critical for determinism, but VLMs really excel at visually interpreting the long tail.

2. Using both HTML and Markdown as an LLM input format. For some of the edge cases, markdown cannot represent certain structures (e.g. a table cell nested within a table cell). HTML is a much better representation for this, and models are trained on a lot of HTML data.

The data ambiguity is a whole set of problems on its own (e.g. how do you normalize what a "semester" is across all the different ways it can be written). Eval sets + automated prompt engineering can get you pretty far though.

Disclaimer: I started a LLM doc processing company to help companies solve problems in this space (https://extend.ai/).

meisel · 2025-10-14T16:14:12 1760458452

Would it help a lot to run it through multiple different AI systems and verify that they agree on the result?

kbyatnal · 2025-10-14T17:22:10 1760462530

Yeah that can occasionally work and something we've tested, but it introduces a lot of noise unfortunately and makes systematic evals difficult.

kbyatnal · 2025-10-09T21:46:51 1760046411

thanks! Datalab is great, I've met Vik a few times and their team has done some impressive work. We can also support the conversion to markdown use case, and might be a better fit depending on your use case. Feel free to create an account to try it out!

kbyatnal · 2025-10-09T19:38:42 1760038722

It's very dependent on the use case. That's why we offer a native evals experience in the product, so you can directly measure the % accuracy diffs between the two modes for your exact docs.

As a rule of thumb, light processing mode is great for (1) most classification tasks, (2) splitting on smaller docs, (3) extraction on simpler documents, or (4) latency sensitive use cases.

kbyatnal · 2025-10-09T19:35:31 1760038531

Exactly correct! We've had users migrate over from other providers because our granular pricing enabled new use cases that weren't feasible to do before.

One interesting thing we've learned is, most production pipelines often end up using a combination of the two (e.g. cheap classification and splitting, paired with performance extraction).

kbyatnal · 2025-10-09T19:33:18 1760038398

Feedback heard. Pricing is hard, and we've iterated on this multiple times so far.

Our goal is to provide customers with as much transparency & flexibility as possible. Our pricing has 2 axes:

- the complexity of the task

- performance processing vs cost-optimized processing

Complexity matters because e.g. classification is much easier than extraction, and as such it should be cheaper. That unlocks a wide range of use cases, such as tagging and filtering pipelines.

Toggles for performance is also important because not all use cases are created equal. Similar to how having options between cheaper and the best foundation models is important, the same applies to document tasks.

For certain use cases, you might be willing to take a slight hit to accuracy in exchange for better costs and latency. To support this, we offer a "light" processing mode (with significantly lower prices) that uses smaller models, fewer VLMs, and more heuristics under the hood.

For other use cases, you simply want the highest accuracy possible. Our "performance" processing mode is a great fit for that, which enables layout models, signature detection, handwriting VLMs, and the most performant foundation models.

In fact, most pipelines we seen in production often end up combining the two (cheap classification and splitting, paired with performance extraction).

Without this level of granularity, we'd either be overcharging certain customers or undercharging others. I definitely understand how this is confusing though, we'll work on making our docs better!

kbyatnal · 2025-10-09T19:20:31 1760037631

good question!

Our goal is to provide customers with as much flexibility as possible. For certain use cases, you might be willing to take a slight hit to accuracy in exchange for better costs and latency. To support this, we offer a "light" processing mode (with significantly lower prices) that uses smaller models, fewer VLMs, and more heuristics under the hood.

For other use cases, you simply want the highest accuracy possible. Our "performance" processing mode is a great fit for that, which enables layout models, signature detection, handwriting VLMs, and the most performant foundation models.

We back this up with a native evals experience in the product, so you can directly measure the % accuracy difference between the two modes for your exact use case.