A bit of an unpopular opinion as it seems, but I would actually bet that the current prompt engineering is just a short term thing.
When the performance of LLMs continue to improve I actually expect that they will become much better to understand not so well formed prompts. Especially when you take into consideration that they now are trained with RLHF on _real_ users input.
So it will probably become less of an engineering problem but more an articulation of what exactly you want
Learning to say what you want is a skill. Much like you can get better at searching, you can get better at saying what you want.
The framework described in the blog post seems like a more formal way to do it, but there are other ways to iterate in conversation. After seeing the first result, you can explain better what you want. If you're not expecting to repeat the query then maybe that's good enough?
I expect there will be better UI's that encourage iteration. Maybe you see a list of suggested prompts that are similar and decide which one you really want?
True, but “learning to figure out what someone wanted to say but wasn’t able to express themselves” is also a skill, I expect LLMs will be able to learn that pretty well.
Imagine you prompt ChatGPT4.5 and it doesn’t give you what you want. You click the thumbs down. ChatGPT says “Hold on, let me try again”. (Behind the scenes, OpenAI runs your prompt through their “prompt improver”, replaces your prompt with that prompt, and just shows you the output, no point showing the optimized prompt since it might be gibberish to a human). The new response actually is what you want, so you click thumbs up on the new output. That “thumbs down, thumbs up” pattern generates high quality labeled data for training the prompt improver, with very little cost.
Yes, "thumbs up, thumbs down" voting is a pretty good way to collect directionally unambiguous feedback that can easily be aggregated across many users to serve as training data.
But it's terrible interaction design for communicating what you want. Imagine trying to fulfill the 1001st most common user need and pressing "thumbs down" 1000 times until you finally get there.
In such cases, being able to construct a more specific prompt would save you a lot of time.
Remember when Google search actually found the words you were searching for instead of creating some quasi common search keywords in a. Shallow pool of most ranked websites?
Right now, most of these AI systems are no better than current Google search. Weird and ironic.
The next major leap in LLMs (in the next year) is probably going to be the prompt context size. Right now we have 2k, 4k, 8k ... but OpenAI also has a 32k model that they're not really giving access to unfortunately.
The 8k model is nice but it's GPT4 so it's slow.
I think the thing that you're missing is that zero shot learning is VERY hard but anything > GPT3 is actually pretty good once you give it some real world examples.
I think prompt engineering is going to be here for a while just because, on a lot of task, examples are needed.
Doesn't mean it needs to be a herculean effort of course. Just that you need to come up with some concrete examples.
This is going to be ESPECIALLY true with Open Source LLMs that aren't anywhere near as sophisticated as GPT4.
In fact, I think there's a huge opportunity to use GPT4 to train the prompts of smaller models, come up with more examples, and help improve their precision/recall without massive prompt engineering efforts.
>> The next major leap in LLMs (in the next year) is probably going to be the prompt context size. Right now we have 2k, 4k, 8k ... but OpenAI also has a 32k model that they're not really giving access to unfortunately.
Saw this article today about a different approach that opens up orders of magnitude larger contexts
This is just ToS violation which will just result in loss of access to OpenAI. There is nothing they can do to stop you from commercially competing, given there is no copyright law precedence
It can be argued that if you build a model using their outputs such that you can then stop using their API, your model is effectively competing with their’s.
Let’s just say that if you’re a startup or SMB, you do not want to be the one dragged to court to iron out whether this holds or not.
They're probably talking about the TOS a user would've had to agree to when using their services. It's actually a lot more permissive then I expected
> Restrictions. You may not (i) use the Services in a way that infringes, misappropriates or violates any person’s rights; (ii) reverse assemble, reverse compile, decompile, translate or otherwise attempt to discover the source code or underlying components of models, algorithms, and systems of the Services (except to the extent such restrictions are contrary to applicable law); (iii) use output from the Services to develop models that compete with OpenAI;
> use output from the Services to develop models that compete with OpenAI
It can be argued that if you build a model using their outputs such that you can then stop using their API, your model is effectively competing with their’s.
Let’s just say that if you’re a startup or SMB, you do not want to be the one dragged to court to iron out whether this holds or not.
Sure, but the ways of acquiring those outputs legally have vampiric licensing that bind you to those ToS, since the re-licenser is bound by the original ToS.
It's like distributing GPL code in a nonfree application. Even if you didn't "consent to [the original author's] ToS," you are still going to be bound to it via the redistributors license.
There’s no license. OpenAI is not an author of their models’ outputs, and they know it.
OpenAI can’t just start suing random people on the street without any legal basis. That’s how lawyers become not-lawyers.
There’s only a (somewhat dubiously enforceable) ToS contract between OpenAI and the user of OpenAI’s website. This is probably bullshit too - what legitimate interest does OpenAI have in a model output that doesn’t even belong to them, but it’s less obviously bullshit.
> Even if you didn't "consent to [the original author's] ToS," you are still going to be bound to it via the redistributors license.
In the context of the GPL, are there real examples of judgements which bind defendants to a license they never saw or knew anything about, because of the errant actions of an intermediary?
It gives OpenAI a legal basis to launch a law suit if they want to.
Would it succeed? Is it right? Do they care? Eh.
…but, if I as some random reddit user say I might sue you for making a LLM you for training on data that may or may not have my posts in it, you can probably safely ignore me.
If you go and build a massive LLM using high quality data that couldn’t possibly come from anywhere other than openai, and they have a log of all the content that api key XXX generated; they both know and have a legal basis for litigation.
There’s a difference, even if you’re a third party (not the owner of the api key) or don’t care.
(And I’m not saying they would, or even they would win; but it’s sufficient cause for them to be able to make a case if they want to)
Just being able to make a case doesn't mean they will consider the legal fees and resulting judgment to be valuable enough to their business, nor that the suit will even make it into the courts resulting in a final judgment.
A lot of behavior that rides this line is rationalized via a careful cost-benefit analysis.
Sure, I'm just saying that in that cost-benefit analysis the 'risk of case failing and getting nothing from it' is significantly lower; it's your call as a consumer to do your due diligence and decide:
"I don't think they'll be bothered chasing after me"
vs.
"If it came to it I think the court will rule that they don't have a case after we play the pay-the-lawyers-all-the-money game"
vs.
"how screwed am I if they do, I lose and I have clearly, blatantly and provably violated their terms of service"
^
...because, and this is the point I'm making. There is no question; it is very very obvious if you do this, and it's not very difficult for them to prove it.
All they need is to slap you with a discovery order, look at your training data, and compare it to their output logs.
I remember being a "good google querier" before autocomplete rendered that mostly irrelevant. While i think you're right to some degree, you still have to articulate exactly what you want and need from this machine, and no amount of the LLM guessing what the intent was will ever replace specifically and explicitly stating your needs and goals. I see a continuing relationship with the complexity of the task tied to the required complexity of the request.
They still "work" in a sense but it soft ignores them and will also guess at what else you might mean, polluting your results. Some things like site: still work as expected. Been this way for years now.
> They still "work" in a sense but it soft ignores them and will also guess at what else you might mean, polluting your results
Doesn't it usually ask the user if they want to Search instead for "X" and then gives you the results? Its annoying when google thinks it knows best when I use " but it seems to work as it ever has after clicking the Search instead link
Being able to compose a good query is still relevant I think! My peer once asked me for help with a mathematical problem, for which they could not find help online - after not much searching I could find a relevant page, given the same information/problem statement.
Google autocomplete using your query history also reduces the information you learn from suggestions as you do the searching...
While in the past "indexDB.set undefined in " might autocomplete to show safari first, indicating a vendor-specific bug, it'll often now prefill with some noun from whatever you last searched (e.g. "main window") to "help" you.
Haven't found a way to disable that, annoying for understanding bugs, situations/context and root causes.
I can imagine the response including questions. Eg “do you mean X or Y, here are the implications”. That way the machine gets better at helping us clarify our thinking. Right now it’s just blurting out a response that fits your instruction with no attempt to clarify.
And still _a lot of people_ cannot effectively use search engines. It is less about technical capabilities of the search engine, and more about (trained) skills related to finding and filtering information.
To talk to other humans, we literally have a whole writing field, courses to teach how to open to write technical documentation or research grants and so much more.
There’s already a whole industry already on how to talk to the human language model and humans are currently way smarter.
The models are quite good at this already, so while (of course) they will be getting better, the (much) larger gain in performance will be from users giving up more and more privacy ultimately (or allowing local models more access).
Writing better prompts is not as big of a deal people keep making it, and exposes how lazy people really have gotten in light of these new tools. If you ask your friend to make a website and then are mad that they used python on the backend instead of Rust... well - you didn't specify, so it's not really your friends fault. The fact that specifications are needed to fulfill tasks, or that information of what your availability is when you're planning to do something, etc should not be heralded as some sort of amazing "engineering". The term is sickeningly stupid.
Having domain knowledge and expertise helps with communication and how to correctly identify good design - there is nothing interesting really going on here.
I don't think so. It might improve a little but not to the point of making it unnecessary.
The problem is not the LLM, the problem on the other side of the keyboard. No matter how good they are, LLMs can't read minds, they can only guess from what you write, and they won't be able to help you unless you give enough information to express the problem. And it is not an issue specific to LLMs, we already do "prompt engineering" when we are talking to humans. We don't call it that, but it is the same idea: write your message carefully to make sure the person on the other side understands your request.
Maybe, but there is a balance to be found, if the AI asked you for perfect clarity, it would be more like a programming language than a natural language model.
The whole point of GPTs is that they are able to guess what you need based on incomplete prompts and how people usually react to these incomplete prompts. Prompt engineering is the art of telling just enough information for the AI to understand the specificity of your request, but still let it fill in the blanks.
Even as LLMs get better over time at understanding ill-formed prompts, I expect that API prices will still continue to depend on the number of tokens used. That’s an incentive to minimize tokens, so “prompt engineering” might stick around, even if just for cost optimization.
Do you not expect a trend of token prices decreasing over time? There will be business using a less cutting edge model and the difference of how many words a prompt is won't be a big contributing factor to the total spend of the business.
Good point. On the other hand, for every business that sticks to a less advanced model, there might be a competitor around the corner running the cutting-edge one in an attempt to serve customers better.
I was in your place as an opinion a few months ago. I thought this was a short thing, but models will just get easier to use, like many other tools. And that won't stop someone from having to use them with prompt engineering, even if it seems pretty trivial.
It's like the Community Manager, social networks and a creating a community seem to be something very easy to do, but still, someone need to do that.
Not so sure about that. The biggest part of prompt engineering I am seeing is of the kind that sets up context to bootstrap a discussion on a predetermined domain.
As I've said elsewhere, in most knowledge work context is key to getting viable results. I don't think something like this is ever going to get automated away, especially in the cases where the context comes from proprietary knowledge.
It depends on how you define "short term". If you until until AGI, then sure. Until then, however, for anything that is going to potentially generate revenue you will need to consider the points raised by the article to keep costs manageable, to avoid performance regression, etc.
Interesting- my vision of future LLM interface also envisions one where more bits of information per second are required per interaction to operate it to exactly the spec that you want. But for exactly because it’ll just be a plain old engineering problem.
I think that fundamentally the UIs will become more realtime. The models will - because of much lower latencies and more efficient inference throughput - become realtime autosuggest: prompt tuning; i/o feedback at:
(reading wpm)/(”ui wpm”)
in fact it might be interesting to just have a model optimize for “likely comprehensible and as concise as possible” rather than “most alike the human dataset after RLHF alignmen”, just for this bandrate idea
Articulating what you want is prompt engineering. Techniques will adapt to technological progression but the people really getting everything they can out of these systems will still be considered engineers.
> Especially when you take into consideration that they now are trained with RLHF on _real_ users input.
When I started to see people post their amazement at how good the pricing was I began to realize that once again, we are the product right now. We are the new training data and are even paying a nominal fee to be it.
People spent years and years learning how to get the best answers with least possible efforts and search engines evolved with it. Seems pretty insane to me that we have now devolved into asking insanely specific and obtuse questions to receive obtuse answers to any questions.
I expect the exact opposite. As more rules and regulations get put in, prompt engineering is going to be the new software development. "I would like you to pretend i need a lawyer dealing in a commercial lease that..."
How does that make sense? LLM's are machines that produce output from input, the position and distribution of that input in the latent space is highly predictive of the output. It seems fairly uncontroversial to expect some knowledge of the tokens and their individual contribution to that distribution in combination with the others, some intuition of the multivariate nonlinear behavior of the hidden layers, is exactly what would let you utilize this machine for anything useful.
Regular people type all sorts of shit into google, but power users know how to query google effectively to work with the system. Knowing the right keywords is often half the work. I don't understand how the architecture of current LLMs are going to work around that feature.