Is finetuning still worth it as models advance so fast? Any real-world use cases?
For example, Bloomberg trained a GPT-3.5 class LLM on their financial data last year and soon after GPT-4-8k outperformed it on nearly all finance tasks.
We ended up focusing on having high-quality eval data and an architecture that makes switching to new models easy.
I have non-English human data annotated in a format that was designed for a very specific health-related study. LLMs haver never seen these annotations, non-English LLMs are not a top priority for companies, and we can only use offline-first ones for data privacy reasons.
In this scenario fine-tuning a general purpose LM works wonders.
What is your strategy for this? Do you finetune when a new flagship model is made available? You said local first, so I'm guessing you might have finetuned llama. But there are llama fine-tunes available which have better performance than the base model. How do you choose?
Our strategy is to take a well-known, battle-tested model as a base, train it, and then hopefully one day release the fine-tuned model on HuggingFace.
Other than that, fine-tunes don't really matter for us because not many people are rushing to beat the top models on (say) Georgian POS tagging or Urdu sentiment analysis.
As long as the model can turn language into a reasonable vector, we're happy with it.
Fine tuning can be useful if you need to generate lots of output in a particular format. You can fine-tune on formatted messages, and then the model will generate that automatically. That could save a bunch of tokens explaining the output format in every prompt.
Does this Python package control the LLMs using something other than text? Or is the end result still that that Python package wraps your prompt with additional text containing additional instructions that become part of the prompt itself?
Looks like it actually changes how you do token generation to conform to a given context-free grammar. It's a way to structure how you sample from the model rather than a tweak to the prompt, so it's more efficient and guarantees that the output matches the formal grammar.
The output of the LLM is not just one token, but a statistical distribution across all possible output tokens. The tool you use to generate output will sample from this distribution with various techniques, and you can put constraints on it like not being too repetitive. Some of them support getting very specific about the allowed output format, e.g. https://github.com/ggerganov/llama.cpp/blob/master/grammars/... So even if the LLM says that an invalid token is the most likely next token, the tool will never select it for output. It will only sample from valid tokens.
I still haven't seen really convincing evidence that fine tuning is useful for the internal corporate data use-case (as opposed to RAG which seems to work really well.)
Finetuning would never add any new knowledge. For internal corporate data, use RAG or train a model from scratch. Finetuning would not help anyone answer a question from that data.
Will share a longer post on this once I finish it. I have tested this multiple times, on bigger models, on custom smaller models, and it does not work.
In a strict sense, finetuning can add new knowledge but for that you need millions of tokens and multiple runs without using LoRA or Peft. For practical purposes, it does not.
I get the sense that you want to do the anti-RAG: take some relatively small corpus of data, train a lora, and then magically have a chatbot that knows your stuff...yeah, that will not work.
But chatbots are only one single use-case. And broadly I think this pattern of LLM-as-store-of-knowledge is a bad one (of course, until ASI, and then it isn't).
That said, you absolutely can impart new knowledge through fine-tuning. Millions of tokens is a rather small hurdle to overcome. And if you're not retraining with original/general data, then your model will become very specialized and possibly overfit...which is not an issue in many instances, and may even be desirable.
When it comes to traditional NLP related tasks, LLMs are far below dedicated NLP pipelines like POS tagging and feature tagging. However, fine tuning bridges the gap quite a bit between the two.
It's a narrow domain, but so is most of programming. I think if you're just training a general purpose LLM to be more inclined towards your data -- no, fine tuning is probably not very relevant. But if you're trying to solve a very specific yet fuzzy problem, and LLMs can get you _part_ of the way there, fine tuning is likely your best bet.
Not the OP, but you can take a look at https://spacy.io/usage/spacy-101 to get a sense of what traditional NLP tasks look like. These things can be done much faster than LLMs with appropriate tooling (such as spaCy) and they don’t risk hallucination.
In this comment I was referring to POS tagging and feature extraction.
Another use case for fine tuning we have as well is to reduce a 5-shot prompt that we have to run hundreds of times per request down to a “0-shot” (heavy emphasis on the double quotes.
Run five shot on gpt-4o a couple thousand times, then fine tune on cohere’s command-r or haiku or llama3 8b or whichever small but mighty llm. You can reduce costs by 99%, or somewhere in that ballpark, without really sacrificing quality on 99% of the queries.
Function calling might be one reason. If your app has a lot of custom functions for interacting with tools, fine tuning may be preferred over using context tokens.
Can you recommend a tutorial/document about fine tuning a small model for selecting the correct functions ? Would be great to have something local that can replace OpenAI functions/tools API.
My sense is that finetuning might still have a role if your use is high-volume and with a narrow/specific goal. For example, we have GPT based summaries of our customer contact calls. It's fairly high volume (a typical bank will handle millions of calls a year).
We're considering finetuning for 2 reasons:
(1) the current system prompt with instructions is getting quite large* and we see more challenges for GPT to stick to the instructions. We could finetune with our historic summaries and a simplified prompt to slightly improve the performance of the summaries / lower token count (performance). The idea would be to then continue improving our system prompt again from that new starting point. Exploration still to be done though.
(2) we might be able to finetune a much smaller model to do what the larger model is currently doing (cost, sustainability)
* The larger instruction prompt is because there are lots of specific needs for the summaries (e.g. how to write the summary, what to exclude (health, etc.), what to include (actions taken), and we e.g. give examples of good vs bad summaries to improve performance.
Fine tuning is generally not worth it over RAG with newer models unless you have a niche data set that's very different from the pretraining data. LoRAs/control vectors are also generally better as they don't induce forgetting or hallucinations, and they're usually "good enough" with RAG.
I use a programming language that isn’t publicly available and it would be useful to have a model understand it. AI isn’t something I’m super knowledgeable about but fine-tuning sounds like a good way to accomplish that.
Quality suffers at very large contexts, and you might want to use your large context for something else (e.g. a long current conversation or a lot of recent data).
For example, Bloomberg trained a GPT-3.5 class LLM on their financial data last year and soon after GPT-4-8k outperformed it on nearly all finance tasks.
We ended up focusing on having high-quality eval data and an architecture that makes switching to new models easy.