Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Is finetuning still worth it as models advance so fast? Any real-world use cases?

For example, Bloomberg trained a GPT-3.5 class LLM on their financial data last year and soon after GPT-4-8k outperformed it on nearly all finance tasks.

We ended up focusing on having high-quality eval data and an architecture that makes switching to new models easy.



Yes.

I have non-English human data annotated in a format that was designed for a very specific health-related study. LLMs haver never seen these annotations, non-English LLMs are not a top priority for companies, and we can only use offline-first ones for data privacy reasons.

In this scenario fine-tuning a general purpose LM works wonders.


What is your strategy for this? Do you finetune when a new flagship model is made available? You said local first, so I'm guessing you might have finetuned llama. But there are llama fine-tunes available which have better performance than the base model. How do you choose?


Our strategy is to take a well-known, battle-tested model as a base, train it, and then hopefully one day release the fine-tuned model on HuggingFace.

Other than that, fine-tunes don't really matter for us because not many people are rushing to beat the top models on (say) Georgian POS tagging or Urdu sentiment analysis.

As long as the model can turn language into a reasonable vector, we're happy with it.


Fine tuning can be useful if you need to generate lots of output in a particular format. You can fine-tune on formatted messages, and then the model will generate that automatically. That could save a bunch of tokens explaining the output format in every prompt.


You can use structured generation instead of fiddling with the prompt, which is unreliable. https://github.com/outlines-dev/outlines


Does this Python package control the LLMs using something other than text? Or is the end result still that that Python package wraps your prompt with additional text containing additional instructions that become part of the prompt itself?


Looks like it actually changes how you do token generation to conform to a given context-free grammar. It's a way to structure how you sample from the model rather than a tweak to the prompt, so it's more efficient and guarantees that the output matches the formal grammar.

There's a reference to the paper that describes the method at the bottom of the README: https://arxiv.org/pdf/2307.09702


The output of the LLM is not just one token, but a statistical distribution across all possible output tokens. The tool you use to generate output will sample from this distribution with various techniques, and you can put constraints on it like not being too repetitive. Some of them support getting very specific about the allowed output format, e.g. https://github.com/ggerganov/llama.cpp/blob/master/grammars/... So even if the LLM says that an invalid token is the most likely next token, the tool will never select it for output. It will only sample from valid tokens.


No it limits what tokens the LLM can output. The output is guaranteed to follow the schema.


>Is finetuning still worth it as models advance so fast? Any real-world use cases?

Internal corporate data GPT4 was never exposed to?


I still haven't seen really convincing evidence that fine tuning is useful for the internal corporate data use-case (as opposed to RAG which seems to work really well.)


Finetuning would never add any new knowledge. For internal corporate data, use RAG or train a model from scratch. Finetuning would not help anyone answer a question from that data.


This isn’t true at all. Fine tune with a single sample and see what happens to your model.


Will share a longer post on this once I finish it. I have tested this multiple times, on bigger models, on custom smaller models, and it does not work.

In a strict sense, finetuning can add new knowledge but for that you need millions of tokens and multiple runs without using LoRA or Peft. For practical purposes, it does not.


I get the sense that you want to do the anti-RAG: take some relatively small corpus of data, train a lora, and then magically have a chatbot that knows your stuff...yeah, that will not work.

But chatbots are only one single use-case. And broadly I think this pattern of LLM-as-store-of-knowledge is a bad one (of course, until ASI, and then it isn't).

That said, you absolutely can impart new knowledge through fine-tuning. Millions of tokens is a rather small hurdle to overcome. And if you're not retraining with original/general data, then your model will become very specialized and possibly overfit...which is not an issue in many instances, and may even be desirable.


Bring on the new post and more details, this stuff is interesting.


PEFT is not for adding knowledge, that's obvious


How does one do RAG? I see it mentioned like 20 times


This article reviews some of the most advanced RAG methods: https://medium.com/@krtarunsingh/advanced-rag-techniques-unl...

(not mine)


you supply to the LLM the subset of data that you need for a specific prompt. That's RAG.


When it comes to traditional NLP related tasks, LLMs are far below dedicated NLP pipelines like POS tagging and feature tagging. However, fine tuning bridges the gap quite a bit between the two.

It's a narrow domain, but so is most of programming. I think if you're just training a general purpose LLM to be more inclined towards your data -- no, fine tuning is probably not very relevant. But if you're trying to solve a very specific yet fuzzy problem, and LLMs can get you _part_ of the way there, fine tuning is likely your best bet.


Can you share a bit more on which tasks are you discussing about?


Not the OP, but you can take a look at https://spacy.io/usage/spacy-101 to get a sense of what traditional NLP tasks look like. These things can be done much faster than LLMs with appropriate tooling (such as spaCy) and they don’t risk hallucination.


I’ve tried PII anonymization with standard NLP approaches and LLMs have been (way) better at this task in my experience.


Also you can do it at scale with SparkNLP: https://sparknlp.org

Many of the use cases I've seen for LLMs would actually be better with NLP.


Which kinds of use cases?


In this comment I was referring to POS tagging and feature extraction.

Another use case for fine tuning we have as well is to reduce a 5-shot prompt that we have to run hundreds of times per request down to a “0-shot” (heavy emphasis on the double quotes.

Run five shot on gpt-4o a couple thousand times, then fine tune on cohere’s command-r or haiku or llama3 8b or whichever small but mighty llm. You can reduce costs by 99%, or somewhere in that ballpark, without really sacrificing quality on 99% of the queries.


Function calling might be one reason. If your app has a lot of custom functions for interacting with tools, fine tuning may be preferred over using context tokens.


Can you recommend a tutorial/document about fine tuning a small model for selecting the correct functions ? Would be great to have something local that can replace OpenAI functions/tools API.


The linked GitHub article above describes syntax for fine tuning Mistral for function calling.

Here is an example of data prepared for fine tuning Llama for function calling...

https://huggingface.co/datasets/mzbac/function-calling-llama...

I'm unaware of any comprehensive guides - we're still in the wild west.


My sense is that finetuning might still have a role if your use is high-volume and with a narrow/specific goal. For example, we have GPT based summaries of our customer contact calls. It's fairly high volume (a typical bank will handle millions of calls a year).

We're considering finetuning for 2 reasons:

(1) the current system prompt with instructions is getting quite large* and we see more challenges for GPT to stick to the instructions. We could finetune with our historic summaries and a simplified prompt to slightly improve the performance of the summaries / lower token count (performance). The idea would be to then continue improving our system prompt again from that new starting point. Exploration still to be done though.

(2) we might be able to finetune a much smaller model to do what the larger model is currently doing (cost, sustainability)

* The larger instruction prompt is because there are lots of specific needs for the summaries (e.g. how to write the summary, what to exclude (health, etc.), what to include (actions taken), and we e.g. give examples of good vs bad summaries to improve performance.


I think the bitter lesson kicks in shortly after you complete fine tuning, and openai release their next model which performs better at the same task.


I'd summarize it as prompting for input, finetuning for output.

RAG is a far better option for making a model work with your data/information.

Finetuning is better for making it output in a language, style, data format, programming language, etc.


Fine tuning is generally not worth it over RAG with newer models unless you have a niche data set that's very different from the pretraining data. LoRAs/control vectors are also generally better as they don't induce forgetting or hallucinations, and they're usually "good enough" with RAG.


I think their training methodology was suspect.

Also I think there’s a bunch of important techniques that just are t shared.


My startup has data the Internet doesn't have.

I want it to be fixing knowledge based on my data and fine-tune it to be adjusted to my use case already.

I also want a.feedbackloop and can't start to send more and more context with the payload just because my feedback loop adds / finetunes data


I use a programming language that isn’t publicly available and it would be useful to have a model understand it. AI isn’t something I’m super knowledgeable about but fine-tuning sounds like a good way to accomplish that.


Quality suffers at very large contexts, and you might want to use your large context for something else (e.g. a long current conversation or a lot of recent data).


Finance is a very mainstream, broad, english-language dominated field. Most people don't work in a field that is so generalist-LLM friendly.


No fine tuning seems pointless right now. you can't even fine tune the frontier models.


This seems to be the narrative Microsoft was pushing at the Build conference this week.


Weird, because they make one of the best small language models (phi series) which is great for finetuning.


not all domains will receive enough representation in training datasets, due to interest and/or access.

in the current landscape, specialized, smaller systems will still be the most efficient way ahead in the near future.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: