Is finetuning still worth it as models advance so fast? Any real-world use cases...

probably_wrong · on May 25, 2024

Yes.

I have non-English human data annotated in a format that was designed for a very specific health-related study. LLMs haver never seen these annotations, non-English LLMs are not a top priority for companies, and we can only use offline-first ones for data privacy reasons.

In this scenario fine-tuning a general purpose LM works wonders.

newswasboring · on May 25, 2024

What is your strategy for this? Do you finetune when a new flagship model is made available? You said local first, so I'm guessing you might have finetuned llama. But there are llama fine-tunes available which have better performance than the base model. How do you choose?

probably_wrong · on May 25, 2024

Our strategy is to take a well-known, battle-tested model as a base, train it, and then hopefully one day release the fine-tuned model on HuggingFace.

Other than that, fine-tunes don't really matter for us because not many people are rushing to beat the top models on (say) Georgian POS tagging or Urdu sentiment analysis.

As long as the model can turn language into a reasonable vector, we're happy with it.

sp332 · on May 25, 2024

Fine tuning can be useful if you need to generate lots of output in a particular format. You can fine-tune on formatted messages, and then the model will generate that automatically. That could save a bunch of tokens explaining the output format in every prompt.

NeutralForest · on May 25, 2024

You can use structured generation instead of fiddling with the prompt, which is unreliable. https://github.com/outlines-dev/outlines

codetrotter · on May 25, 2024

Does this Python package control the LLMs using something other than text? Or is the end result still that that Python package wraps your prompt with additional text containing additional instructions that become part of the prompt itself?

tikhonj · on May 25, 2024

Looks like it actually changes how you do token generation to conform to a given context-free grammar. It's a way to structure how you sample from the model rather than a tweak to the prompt, so it's more efficient and guarantees that the output matches the formal grammar.

There's a reference to the paper that describes the method at the bottom of the README: https://arxiv.org/pdf/2307.09702

sp332 · on May 25, 2024

The output of the LLM is not just one token, but a statistical distribution across all possible output tokens. The tool you use to generate output will sample from this distribution with various techniques, and you can put constraints on it like not being too repetitive. Some of them support getting very specific about the allowed output format, e.g. https://github.com/ggerganov/llama.cpp/blob/master/grammars/... So even if the LLM says that an invalid token is the most likely next token, the tool will never select it for output. It will only sample from valid tokens.

progbits · on May 25, 2024

No it limits what tokens the LLM can output. The output is guaranteed to follow the schema.

kgeist · on May 25, 2024

>Is finetuning still worth it as models advance so fast? Any real-world use cases?

Internal corporate data GPT4 was never exposed to?

simonw · on May 25, 2024

I still haven't seen really convincing evidence that fine tuning is useful for the internal corporate data use-case (as opposed to RAG which seems to work really well.)

ankit219 · on May 25, 2024

Finetuning would never add any new knowledge. For internal corporate data, use RAG or train a model from scratch. Finetuning would not help anyone answer a question from that data.

qeternity · on May 25, 2024

This isn’t true at all. Fine tune with a single sample and see what happens to your model.

ankit219 · on May 25, 2024

Will share a longer post on this once I finish it. I have tested this multiple times, on bigger models, on custom smaller models, and it does not work.

In a strict sense, finetuning can add new knowledge but for that you need millions of tokens and multiple runs without using LoRA or Peft. For practical purposes, it does not.

qeternity · on May 26, 2024

I get the sense that you want to do the anti-RAG: take some relatively small corpus of data, train a lora, and then magically have a chatbot that knows your stuff...yeah, that will not work.

But chatbots are only one single use-case. And broadly I think this pattern of LLM-as-store-of-knowledge is a bad one (of course, until ASI, and then it isn't).

That said, you absolutely can impart new knowledge through fine-tuning. Millions of tokens is a rather small hurdle to overcome. And if you're not retraining with original/general data, then your model will become very specialized and possibly overfit...which is not an issue in many instances, and may even be desirable.

F-Lexx · on May 25, 2024

Bring on the new post and more details, this stuff is interesting.

zwaps · on May 25, 2024

PEFT is not for adding knowledge, that's obvious

Exuma · on May 25, 2024

How does one do RAG? I see it mentioned like 20 times

behnamoh · on May 25, 2024

This article reviews some of the most advanced RAG methods: https://medium.com/@krtarunsingh/advanced-rag-techniques-unl...

(not mine)

meiraleal · on May 25, 2024

you supply to the LLM the subset of data that you need for a specific prompt. That's RAG.

barrell · on May 25, 2024

When it comes to traditional NLP related tasks, LLMs are far below dedicated NLP pipelines like POS tagging and feature tagging. However, fine tuning bridges the gap quite a bit between the two.

It's a narrow domain, but so is most of programming. I think if you're just training a general purpose LLM to be more inclined towards your data -- no, fine tuning is probably not very relevant. But if you're trying to solve a very specific yet fuzzy problem, and LLMs can get you _part_ of the way there, fine tuning is likely your best bet.

NhanH · on May 25, 2024

Can you share a bit more on which tasks are you discussing about?

anon373839 · on May 25, 2024

Not the OP, but you can take a look at https://spacy.io/usage/spacy-101 to get a sense of what traditional NLP tasks look like. These things can be done much faster than LLMs with appropriate tooling (such as spaCy) and they don’t risk hallucination.

maaaaattttt · on May 25, 2024

I’ve tried PII anonymization with standard NLP approaches and LLMs have been (way) better at this task in my experience.

threeseed · on May 25, 2024

Also you can do it at scale with SparkNLP: https://sparknlp.org

Many of the use cases I've seen for LLMs would actually be better with NLP.

meandmycode · on May 25, 2024

Which kinds of use cases?

barrell · on May 25, 2024

In this comment I was referring to POS tagging and feature extraction.

Another use case for fine tuning we have as well is to reduce a 5-shot prompt that we have to run hundreds of times per request down to a “0-shot” (heavy emphasis on the double quotes.

Run five shot on gpt-4o a couple thousand times, then fine tune on cohere’s command-r or haiku or llama3 8b or whichever small but mighty llm. You can reduce costs by 99%, or somewhere in that ballpark, without really sacrificing quality on 99% of the queries.

subroutine · on May 25, 2024

Function calling might be one reason. If your app has a lot of custom functions for interacting with tools, fine tuning may be preferred over using context tokens.

AlexeyBrin · on May 25, 2024

Can you recommend a tutorial/document about fine tuning a small model for selecting the correct functions ? Would be great to have something local that can replace OpenAI functions/tools API.

subroutine · on May 25, 2024

The linked GitHub article above describes syntax for fine tuning Mistral for function calling.

Here is an example of data prepared for fine tuning Llama for function calling...

https://huggingface.co/datasets/mzbac/function-calling-llama...

I'm unaware of any comprehensive guides - we're still in the wild west.

hectormalot · on May 25, 2024

My sense is that finetuning might still have a role if your use is high-volume and with a narrow/specific goal. For example, we have GPT based summaries of our customer contact calls. It's fairly high volume (a typical bank will handle millions of calls a year).

We're considering finetuning for 2 reasons:

(1) the current system prompt with instructions is getting quite large* and we see more challenges for GPT to stick to the instructions. We could finetune with our historic summaries and a simplified prompt to slightly improve the performance of the summaries / lower token count (performance). The idea would be to then continue improving our system prompt again from that new starting point. Exploration still to be done though.

(2) we might be able to finetune a much smaller model to do what the larger model is currently doing (cost, sustainability)

* The larger instruction prompt is because there are lots of specific needs for the summaries (e.g. how to write the summary, what to exclude (health, etc.), what to include (actions taken), and we e.g. give examples of good vs bad summaries to improve performance.

irthomasthomas · on May 26, 2024

I think the bitter lesson kicks in shortly after you complete fine tuning, and openai release their next model which performs better at the same task.

uyzstvqs · on May 25, 2024

I'd summarize it as prompting for input, finetuning for output.

RAG is a far better option for making a model work with your data/information.

Finetuning is better for making it output in a language, style, data format, programming language, etc.

CuriouslyC · on May 25, 2024

Fine tuning is generally not worth it over RAG with newer models unless you have a niche data set that's very different from the pretraining data. LoRAs/control vectors are also generally better as they don't induce forgetting or hallucinations, and they're usually "good enough" with RAG.

keeptrying · on May 25, 2024

I think their training methodology was suspect.

Also I think there’s a bunch of important techniques that just are t shared.

Delmololo · on May 25, 2024

My startup has data the Internet doesn't have.

I want it to be fixing knowledge based on my data and fine-tune it to be adjusted to my use case already.

I also want a.feedbackloop and can't start to send more and more context with the payload just because my feedback loop adds / finetunes data

BaculumMeumEst · on May 25, 2024

I use a programming language that isn’t publicly available and it would be useful to have a model understand it. AI isn’t something I’m super knowledgeable about but fine-tuning sounds like a good way to accomplish that.

Tenoke · on May 25, 2024

Quality suffers at very large contexts, and you might want to use your large context for something else (e.g. a long current conversation or a lot of recent data).

fulafel · on May 25, 2024

Finance is a very mainstream, broad, english-language dominated field. Most people don't work in a field that is so generalist-LLM friendly.

stuckinhell · on May 25, 2024

No fine tuning seems pointless right now. you can't even fine tune the frontier models.

ewalk153 · on May 25, 2024

This seems to be the narrative Microsoft was pushing at the Build conference this week.

behnamoh · on May 25, 2024

Weird, because they make one of the best small language models (phi series) which is great for finetuning.

rldjbpin · on May 26, 2024

not all domains will receive enough representation in training datasets, due to interest and/or access.

in the current landscape, specialized, smaller systems will still be the most efficient way ahead in the near future.