This is overstated and easily disproved. ChatGPT produces accurate facts as a matter of course. test it right now: "How tall is the Statue of Liberty?", "In what state was Abraham Lincoln born?", etc. There are an infinite list of factual questions for which it produces accurate and correct answers.
It loves to hallucinate api methods that don't exist. It can struggle with indivudual logic problems or questions. But these limitations have clear explanations. It's doing language completion so it will infer probable things that don't actually exist. It wasn't designed to do logic problems so it will struggle with classes of them.
Dismissing it as completely unreliable is unnecessary hyperbole. It's a software tool that has strengths, weaknesses and limitations. Like with any other software, it's up to us to learn what those are and how it can be used to make useful products.
It’s completely unreliable because it sounds confident and plausible even when it’s wrong. Essentially it fails in ways that are very hard to detect.
If you ask it question you have no way to know whether it’s right or not unless you already know the answer.
I’ve asked it many questions about my area of expertise, distributed systems, and it was often flat out wrong, but to a non expert it sounded perfectly plausible. I asked my my wife, a physician, to try it out and she reported the exact same problem. Some of the answers it gave had even veered off
from just annoying to actually dangerous.
That doesn’t mean it can’t be useful, but using it to answer important questions or using it to teach yourself something you don’t already know is dangerous.
Weirdly enough, people seem to view this only from one side: "ChatGPT is an expert at lying!".
Actually, it was explicitly instructed to sound confident, serious and expertly.
The important thing here is: to most people, it is enough to sound like that in order to make them believe.
This heuristic is trivial to hack, gets exploited not just by ChatBots and is downright dangerous on a societal level.
Acquiring and vetting information is a task of crucial importance situated in between journalism and science. Just trawling the internet and autocompleting the acquired driftwood is a weird approach to begin with. Doing so without transparency in private companies is simply folly.
I know flesh and blood humans who are confident in their knowledge but still manage to make shit up and get facts wrong. I make up my own mind about what's true and what's useful by considering an ensemble of sources.
In my experience with ChatGPT, it gets less wrong and more right than most humans, and it's able to find relationships among facts and concepts that are impossible with Wikipedia and Google.
It's extremely useful and powerful in its current form, in spite of its many limitations.
With other tools, we don't demand perfection to find them useful.
Just because a hammer sometimes misses the nail and puts a hole in the wall, just because Wikipedia sometimes gets vandalized with lies, doesn't mean hammers and Wikipedia aren't essential tools.
A good hammer is designed with an optimal weight distribution that makes it easy to hit the nail. ChatGPT may be better than using your own head at times, people will still sue sooner rather than later over lost digits.
However, it provides an excellent opportunity for us mere humans to reflect upon the way we organize our society in regard to vital information processing.
We are awfully bad at it. So bad actually, we can't even tell how it should be in the first place and are disturbed already by some piece of software telling nonsense.
Yeah, it's kinda funny how we simultaneously expect the ideal AI to have human intelligence, but also not to make human mistakes. Guess what? If you build a perfect replica of a human brain, it will make human
errors, because that's what human brains do.
And incidentally, it's the consequences of those errors that teach us not to repeat them... which is the same feedback loop we use to train AI in the first place!
>I know flesh and blood humans who are confident in their knowledge but still manage to make shit up and get facts wrong. I make up my own mind about what's true and what's useful by considering an ensemble of sources.
It's basically like political candidate Kari Lake in Arizona. She is an extremely confident speaker, talking about the conspiracies against her, how she really won the election for Az Gov, all of them have no evidence or support. She is such an excellent speaker, she's just chilling to me because people will often just believe someone who is so well spoken. Chatgpt is that. Imagine connecting what chatgtp says with with a "super confident fascist human avatar" that is kind of Kari Lake. It will happen because it's obvious.
We are just going to have ever more persuasive unreliable sources of information.
It may describe a number of people you have worked with but it does not describe the average medical textbook which will usually describe the best knowledge we have of a condition. If chatgpt produces something that looks like it came from a medical textbook then it is hard to blame many people for believing it. More generally, people started to trust computers more than people as soon as calculators demonstrated their reliability.
I don't know, I get this feeling ChatGPT has also read all the quack books where the condition depends on the alignment of stars when you were born, or how chicken entrails land when the shaman "studies" it. Those books are also written confidently without giving any sign of being completely fabricated.
In the end, why do people believe what they believe? The answer is it connects with what they already believe. That's it. If you had a diet of what we call science, you'll have a foothold in a whole bunch of arenas where you can feel yourself forward, going from one little truth to another. If you are a blank slate with a bit of quackery seeded onto it, you end up believing the stars predict your life and that you can communicate with dead people with a Ouija board.
CGPT doesn't have an "already believe". It just has a "humans on the panel give me reward" mechanism, and all it's doing is reflecting what it got rewarded for. Sometimes that's the scientific truth, sometimes it's crap. All the time it's confident, because that's what's rewarded.
And hopefully after someone shows you that they can be described that way, you stop trusting what they say (at least about the thing they are known to be unreliable about).
A whole lot of people can be trusted within a reasonable identified set of constraint, and we do, while discounting them in others where they're known to spew bullshit. It's very rarely all or nothing.
Obviously. Chatgpt will randomly spew bullshit about nearly any topic though. So you can really only trust it for things you are already an expert in or things that are very easy to verify.
It's not random. It's probabilistic. There is a very big difference. The types of errors are predictable which means they can be improved, and empirically so if you follow the academic literature.
I have read enough of the academic literature to say with some confidence that the majority of the errors I’m talking about are not probabilistic in any meaningful sense. At least not in a way that can be predicted from the amount of readily available training data.
I don’t trust a random hacker news comment by someone I don’t know or can verify any further than I can throw it, so in that sense they are probably similar.
The comment I was replying to was “the errors you’re talking about are probabilistic if you read the literature” my response is “no they aren’t I have read the literature.”
Note that I’m talking about a specific class of error and proving a negative is difficult enough that I’m not diving through papers to find citations for something n levels deep in a hacker news thread.
Here you go, here's a bunch of papers that you have not read. If you had read them then you would know that the errors are predictable and therefore there are many measurable ways to make improvements.
I took the time to read through the only one of those that looked like it was peer reviewed and read the abstracts for the rest.
Survey of Hallucination in Natural Language Generation only provided promising methods for detecting hallucinations in summarization tasks, which are of course much easier to detect. Searching arXiv for a list of non-reviewed papers that sound like they might be related to the topic at hand is fun debate strategy. But no one else is reading this far into an old thread, so I'm not sure who you're trying to convince.
None of these paper prove your claims about hallucinations, and most aren't even trying to. However, even if the errors that I'm saying aren't meaningfully probabilistic aren't hallucinations.
What if instead of asking ChatGPT for facts, we instead asked it for plausible methods of determining those facts? So instead of answering "The statue of Liberty is 151 feet tall" it could instead respond with "Look it up yourself on wikipedia."
ChatGPT has made up sources when I've asked for them, so I wouldn't 100% trust it to provide great sources.
Also, if ChatGPT just redirects you to search, doesn't that remove most of the value?
> Also, if ChatGPT just redirects you to search, doesn't that remove most of the value?
That's pretty much the conclusion I've already come to. I have to verify everything ChatGPT tells me, so using it is pointless if I already know where/how to look something up.
> It’s completely unreliable because it sounds confident and plausible even when it’s wrong.
That applies to my boss, college professors, Wikipedia, or my neighbor after a couple of beers. It's not designed to give the correct medical steps to diagnosis a brain injury.
The problem is most people do believe these confident facts spoken by a leader, they aren't evaluating them. An easy example is the fairly widespread belief that somehow the us pres election was stolen. People said it, but there mountains of evidence don't exist. But millions of people believe it.
You're not thinking about this probabilistically. This isn't about "correct" and "incorrect". There is an entire universe between those two poles with regards to most kinds of knowledge. Facts are a specific kind of knowledge, and frankly, not that important. It's more important to know why the Civil War was fought than to know the exact dates and locations of each battle.
I've heard people say they wouldn't use ChatGPT even if there was "only a 1 in a billion chance that it made their bank account details public"...
May I introduce you to some very real probabilities:
If you are living in America there is a 1 in 615 chance that your cause of death will be the result of an automobile accident.
So yes, it is unlikely that we will ever create a tool that can answer with 100% confidence. It is also unlikely that a manufacturing process will result in a widget that conforms to allowed tolerances 100% of the time.
However, in manufacturing this is understood. A defect of 3.4 per million widgets is considered an incredibly high process capability.
These tools are being made more reliable everyday. Please have a realistic goal in mind.
Edit: Well I've learned this much. Many of you are not Beyesians!
I asked it 3 questions about distributed systems, 2 of the answers were so fundamentally wrong that they would lead to a complete misunderstanding of the subjects, and if you tried to implement systems based on that understanding they would be broken in hard to detect ways.
The problem is that an LLM can only be as reliable as its training material.
For example I wrote a blog post on the 2 generals problem. In the comments there is more text from inexperienced people asserting that a solution exists than there is text from my original article.
An LLM trained on that article and comments will always be wrong.
LLMs can be trained on smaller vetted training sets sure, but they also currently require massive amounts of data and there’s no guarantee that just waiting a few months or years will improve reliability enough to fix these issues.
Isn't part of the problem that an LLM will complete a statement with words that form a kind of statistical midpoint of the things that might come next. But when we're dealing with specific things in the real world, the true way to complete a sentence will not be that kind of statistical function, in fact it only reliably will follow from actually knowing the answer.
So, it's sort of like driving a car based not on a complete map of the area, but a partial map where the missing pieces are filled in from statistical averages of things that appear generally in the world-- but not necessarily in that location. We all may be impressed at how close the predictions come to reality, but the only truly reliable thing is that those predictions, measured with precision, will be wrong. And if you rely on them without verifying them yourself, you might have a serious problem.
I’m going to refrain from epistemological instruction but I would recommend you start with the Greeks and at least cover Kant and the analytical philosophers that followed.
The “true” way to answer a proposition is not even remotely as simple as you’ve described.
Or we could just deflate "truth" for this particular discussion, since what matters is whether predicting the next word works well enough without knowledge of the world. Pragmatically speaking, that's what matters.
Yes, but you have to know to improve the results, and the techniques don't alway work. I asked for a table of sports results from a specific date in the past that's easy to lookup, and it gave me wrong results. But they looked plausible. When I asked it about specific result, it gave me a completely misleading answer, but one that would be plausible to someone who doesn't know the sport very well. After informing it of the wrong answer, it corrected itself by giving a different wrong answer. I tried that a couple times with the same result.
This becomes a problem when people start to treat LLMs as authoritative on facts.
I have no strong interest either way. I played with chatgpt with no expectations going in, and I’ve played with it for many hours since. I have found it severely lacking for many purposes which people have said it’s suited. That’s it.
Consider why chatgpt is banned on Stackoverflow despite the median quality already being poor there (because its answers/questions are even worse than that; the more answers from chatgpt, the worse the site)
The irony here is that I get more help from ChatGPT than Stack Overflow.
I suspect I'm better at using the tool than most people. So if most people are not good at using the tool and then submitting their poor results to Stack Overflow, I can see the problem.
> Dismissing it as completely unreliable is unnecessary hyperbole
No it is not. If it hallucinates facts in 5% of cases, that is completely unreliable. You're basically saying "yes, it produces facts, as long as you already know the fact and double check it!" You cannot trust it with knowledge-based questions that you don't already know.
I just asked it the height of the Statue of Liberty. I had to then look it up for myself just to see if it was telling the truth, because there's no citation or 'confidence level'. How is that useful?
Yep. Ask it a popular question, and it might get it right. Ask an equivalent question that is unknown and it gets it hilariously wrong.
> What is the volume of the Statue of Liberty?
The Statue of Liberty is a hollow copper statue, and its volume can be calculated by multiplying its height, width, and depth. The statue's height is 151 feet (46 meters) from the base to the torch, and its width is 35 feet (10.7 meters) at the waist. The depth of the statue, or the thickness of the copper shell, is about 2.25 inches (5.7 centimeters) throughout most of the statue.
Using these measurements, the volume of the Statue of Liberty can be calculated as follows:
Volume = Height x Width x Depth
Volume = 151 ft x 35 ft x 0.1875 ft (2.25 inches converted to feet)
Volume = 985.3125 cubic feet
Therefore, the volume of the Statue of Liberty is approximately 985.3125 cubic feet (or 27.86 cubic meters).
It sounds confident, the maths looks correct, but the answer is entirely wrong in multiple ways. It might be interesting to see what prompt you would need to use for it to calculate say the cylindrical volume of the main body?
It's actually 990.9375. Curious how it botched the multiplication but still got an almost-right answer (to that multiplication, not to the actual question.)
I will point out that it's frequently useful with knowledge-based questions where it's hard to generate a correct answer, but easy to verify whether an answer is correct.
You don't consult most people you know on things you'd consult with an specialist, but asking ChatGPT is portrayed as something closer to consulting with an specialist than to consulting someone on the street.
> This is overstated and easily disproved. ChatGPT produces accurate facts as a matter of course. test it right now
"the idea that this stopped clock is broken is overstated and easily disproved. the clock produces accurate time as a matter of course. go ahead and ask what time is it, just make sure it is 3:45am or 3:45pm"
what? the argument here is that ChatGPT giving factual answers is a mere coincidence, not at all what the model was trained to do. It's a broken clock, it can tell you the correct time at very specific contexts, but you shouldn't rely on it as your source of factual information. If you feed him enough data saying the statue of liberty is 1 cm tall, it will happily answer a query with that "fact".
Any analogy is incorrect if you stretch it enough, otherwise it wouldn't be an analogy...
My clock analogy works up to this: ChatGPT success in factually answering a query is merely a happy coincidence, so it does not work well as a primary source of facts. Exactly like... a broken clock. It correctly tells the time twice a day, but it does not work well as a primary source of time keeping.
Please don't read more deeply into the analogy than that :)
Nope, not random behavior. ChatGPT works by predicting the continuation of a sentence. It has been trained in enough data to emulate some pretty awesome and deep statistical structure in human language. Some studies even argue it has built world models in some contexts, but I'd say that needs more careful analysis. Nonetheless, in no way, shape or form has it developed a sense of right vs wrong, real vs fiction, in a way you can depend on it for precise, factual information. It's a language model. If enough data says bananas are larger than the Empire State building, it would repeat that, even if it's absurd.
I didn’t say it was random behavior. You did when you said it was a happy coincidence.
I know it is just a language model. I know that if you took the same model and trained it on some other corpus that it would produce different results.
But it wasn’t so it doesn’t have enough data to say that bananas are larger than the Empire State Building, not that it would really matter anyways.
One important part of this story that you’re missing is that even if there were no texts about bananas and skyscrapers that the model could infer a relationship between those based on the massive amounts of other size comparisons. It is comparing everything to everything else.
See the Norvig-Chomsky debate for a concrete example of how a language model can creat sentences that have never existed.
> the model could infer a relationship between those based on the massive amounts of other size comparisons
That is true! But would it be factually correct? That's the whole point of my argument.
The knowledge and connections that it acquires comes from its training data and it is trained for completing well-structured sentences, not correct ones. Its training data is the freaking internet. ChatGPT stating facts are a happy coincidence because (1) the internet is filled with incorrect information, (2) its training is wired for mimicking human-language's rich statistical structure, not generating factual sentences, and (3) its own powerful and awesome inference capabilities can make it hallucinate completely false but convincingly-structured sentences.
Sure, it can regurgitate simple facts accurately, especially those that are repeated enough in its training corpus. But it fails for more challenging queries.
For a personal anecdote, I tried asking it for some references for a particular topic I needed to review in my masters dissertation. It gave me a few papers, complete with title, author, year, and a short summary. I got really excited. Turns out all the papers it referenced were completely hallucinated :)
Clock correctness is relative. If the antique windup clock in your living room is off by 5 minutes, it's still basically right. But if the clock in your smartphone is 5 minutes off, something has clearly gone wrong.
Nor is it only incorrect one billionth of the time, as you seem to be indicating through your hypotheticals. Depending on what I've asked it about, it can be incorrect at an extremely high rate.
To riff off the statue of liberty, a conversation (edited or brevity, to remove the ChatGPT waffling):
How tall is the Statue of Liberty?
The Statue of Liberty is approximately 305 feet or 93 meters tall from the ground to the tip of the torch.
When did terrorists attack the statue of liberty?
The Statue of Liberty has not been attacked by terrorists. However, there have been some reports and rumors circulating on the internet about such attacks.
In which works of fiction did terrorists attack the statue of liberty?
I am not aware of any works of fiction in which terrorists attack the Statue of Liberty. While the Statue of Liberty has been featured in many works of fiction, including movies, books, and TV shows, I cannot recall any specific instance of it being attacked by terrorists.
Do you know about the game Deus Ex?
Yes, Deus Ex is a popular action role-playing video game series developed by Ion Storm and published by Square Enix. The game takes place in various locations around the world, including New York City and the Statue of Liberty. In the game, the player must thwart a terrorist attack on the Statue of Liberty by infiltrating the terrorist organization and gathering intelligence.
(end of chat)
It directly contradicts itself here. The summary for Dues Ex is also wrong (terrorists attacked the statue of liberty before the game, and only the very first part of the game features the statue and it's not to "thwart a terrorist attack on the Statue of Liberty").
It can generate very accurate things, and can generates nonsense. However, you're never quite what it is, and there is no "well, I think it was something like this or that, but I don't know for sure". So the only way to know for sure it's accurate is to verify it with another source. And if you need to verify the accuracy then it is actually just inaccurate.
Of course, humans can also generate nonsense, but you shouldn't measure ChatGPT vs. all of humanity; because it's one vs. the worst of several billion. That's not a fair match. Compared to a reasonably person ChatGPT is horribly inaccurate.
> It directly contradicts itself here. The summary for Dues Ex is also wrong (terrorists attacked the statue of liberty before the game, and only the very first part of the game features the statue and it's not to "thwart a terrorist attack on the Statue of Liberty").
I wouldn't call that a contradiction. I think it's fair to only count attacks that go through.
It says it doesn't know about fictional terror attacks on the statue of liberty, and then it does point out one.
Also, turns out "The Statue of Liberty has not been attacked by terrorists" isn't even correct either: https://en.wikipedia.org/wiki/Black_Tom_explosion – you could argue a bit about the semantics of "terror attack" and whether it was the statue of liberty that was attacked, but it turns up on all the search results (I actually wanted to find other fictional accounts).
> It says it doesn't know about fictional terror attacks on the statue of liberty, and then it does point out one.
"I wouldn't call that a contradiction. I think it's fair to only count attacks that go through. "
It "knows" about a thwarted plan for an attack. I don't think it's a "direct contradiction" to "know" about that but still say "I am not aware of any works of fiction in which terrorists attack the Statue of Liberty."
It's not thwarted. Look at the picture from the previous comment. Does that look "thwarted" to you? ChatGPT confuses a lot of things here, and now you're confused because ChatGPT is confused.
There is a terror attack, but it's NOT the case that "the player must thwart a terrorist attack on the Statue of Liberty". It's merely something that happened in the fictional setting, before the story even takes place.
In many ways this thread demonstrates my point quite well: it's kind-of roughly correct, but mixes up some crucial things which does make a meaningful different, and now someone unfamiliar with the subject matter is confused as well. In this case, it's just a game and of no real importance, but it goes to show ChatGPT should now be used for anything of any importance in its current state.
"I am not aware of any works of fiction in which terrorists attack the Statue of Liberty" followed by "I know about Deus Ex" sounds like a contradiction to me (even though the summary is wrong). Had it instead said "I've heard of it, but don't really know the details" or something to that effect instead of an authoritative statement would be different.
But it doesn't really matter; it's inconsistent and unreliable. Whether you could classify this or that as a "contradiction" isn't important and mostly just pedantic argueing.
It's not a contradiction because it doesn't know about the successful attack in the Deus Ex backstory. At least, that's the claim in the conversation you quoted. Wrong but not self-contradictory.
> But it doesn't really matter; it's inconsistent and unreliable. Whether you could classify this or that as a "contradiction" isn't important and mostly just pedantic argueing.
But it wasn't inconsistent in your conversation. That's my point. It was wrong in a self-consistent way.
> it doesn't know about the successful attack in the Deus Ex backstory
There ya go:
What's the backstory of the statue of liberty in Deus Ex?
In the game Deus Ex, the Statue of Liberty serves as a central location and plot point. The statue has been converted into a highly secured base for the United Nations Anti-Terrorist Coalition (UNATCO), which is the organization that the player character, JC Denton, works for.
The backstory of the statue in the game is that it was damaged by terrorists in an attack prior to the events of the game. The terrorist group responsible for the attack is a shadowy organization known as "Majestic 12," which serves as the game's primary antagonist.
In response to the attack, the statue was converted into a heavily fortified UNATCO headquarters, which serves as the player's base of operations for much of the game. The statue is also home to various high-tech facilities and laboratories, where UNATCO develops advanced weaponry and nanotechnology.
Throughout the game, the player learns more about the backstory of the statue and the role it plays in the game's overarching conspiracy plot.
Was I not clear enough that I was talking about the single transcript? I really don't know how to be clearer.
You're now demonstrating inconsistency between conversations. Great. But your earlier claim was that it directly contradicted itself inside that specific conversation. I don't think it did.
(And no, a new conversation where it directly contradicts itself inside the same conversation won't change my mind, because I already know it can do that. I was just saying it didn't in your original example.)
I continued the previous conversation, but okay. You're being exceedingly "but akshually" pedantic about the entire thing. This entire conversation has zero value and nothing of any significant meaning was discussed at all. Congratulations; I hope you feel very smart about yourself. I was a fool to even reply to your initial comment to start with.
It's fair to call it untrustworthy, however. In other words, it's not safe to trust or believe anything it says without additional validation.
My policy is that teams should use ChatGPT but that everyone is individually responsible for their tasks. In other words, it's a tool, not a replacement, and if the tool does something wrong the responsibility still resides with the employee to the degree that they could have validated the results, but failed to do so. I think this strikes a good balance that preserves human jobs for as long as possible.
A lot of fictional books will contain claims that are true in the real world. It doesn’t fundamentally change the nature of what the author is doing, or whether you should trust any claim chosen at random from a fictional book.
You’re missing the entire point the op was making.
The fact that these queries are correct are not because that is the intent of chatgpt’s “query engine” it because it just so happens to have been fed data that synthesizes to something sane this time (vs confident crap any other time).
Compare this to a database where queries are realized against an actual corpus of data to give actual answers.
The purpose of Chatgpt is to provide fluent gibberish when given a prompt. It’s a LARGE LANGUAGE MODEL. It’s only able to give you response that looks good for a language (over a distribution or words), not anything actually knowledgeable.
When I query ChatGPT about a book or show I know well, it will start giving wrong answers after a few questions. I’ve had tell me that I was wrong when trying to correct it. I’ve also had ChatGPT generate syntactically invalid code that it provided results for. And valid code that produced different outputs from its explanation when I ran it. And then there are times when I’m having it generate fiction, and it gives me a nonsensical scenario. It’s also mistaken me for an LLM on one occasion.
>test it right now: "How tall is the Statue of Liberty?", "In what state was Abraham Lincoln born?", etc. There are an infinite list of factual questions for which it produces accurate and correct answers.
Those are easy because you can verify them. But you actually did go back and check to make sure they were right after you got answers out of it, right? And why is that, don't you trust it to just be right?
This isn't new, but after a bit of prompt hacking, here's ChatGPT very confidently telling you the wrong height of the Statue of Liberty[1]. I made it off by 20 feet, so there's no way anyone would seriously know.
This is overstated and easily disproved. ChatGPT produces accurate facts as a matter of course. test it right now: "How tall is the Statue of Liberty?", "In what state was Abraham Lincoln born?", etc. There are an infinite list of factual questions for which it produces accurate and correct answers.
It loves to hallucinate api methods that don't exist. It can struggle with indivudual logic problems or questions. But these limitations have clear explanations. It's doing language completion so it will infer probable things that don't actually exist. It wasn't designed to do logic problems so it will struggle with classes of them.
Dismissing it as completely unreliable is unnecessary hyperbole. It's a software tool that has strengths, weaknesses and limitations. Like with any other software, it's up to us to learn what those are and how it can be used to make useful products.