Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
"Building Machines That Learn and Think Like People", 7 Years Later (gonzoml.substack.com)
106 points by che_shr_cat on Oct 15, 2023 | hide | past | favorite | 40 comments


GPT-4V is mind blowing, it surprising to me that it gets so little attention here on HN, because after playing around with it I get the same sense of excitement I got when I tried the original ChatGPT. The level of understanding of what is going on in an image is leagues ahead of what we had until this point, ahead of Bard and basically everything else I already saw.

I tested it with a bunch of photos I made and images it could not have seen in its training data, and most of the time it nailed them perfectly. Its OCR capabilities are top notch, but this is combined with a spatial understanding of how text relates to other parts of the image. It can take a photo of a wall monthly calendar with hand scribbles on it and give you a list of events for each day. It can guess where a specific photo was taken just by analysing the elements present on the photo like the foliage, architecture, car license plates, etc (without being specifically prompted to do so). It can correctly identify multiple plants from a same photo. Gave it a photo of a Montessori set for teaching math (some wooden blocks with numbers and dots on them, no branding on them) and it guessed exactly what it was. And all of that just from two days of testing.

Here are just few examples:

[1] https://i.imgur.com/cV3dVOf.png - Gave it a screenshot from Final Fantasy VII from a boss battle. It correctly identified the party members, and their stats even though the text and labels are a bit all over the place.

[2] https://i.imgur.com/WeXhP7V.png - A photo I shot on my vacation, that didn't really contain any major landmarks, and yet it still somehow figured out from the architecture (and house colors) the exact location of it. I tried this game with several photos and it is very good at it, far better than I could ever be if I saw these photos for the first time.

[3] https://i.imgur.com/HgwYv6q.png - A screenshot of a worksheet from the Human Shader Project. I just asked it to solve it for given X/Y values and it did, its answer was 100% correct (here's the second part of its answer: https://i.imgur.com/RZF2r7v.png)

[4] https://i.imgur.com/12xg4qU.png - A photo of a highly reflective microwave inside a shopping mall. This was given to my by a friend who shot this personally and to be honest I didn't catch at first that this is a microwave, and yet GPT-4V figured that out.

[5] https://i.imgur.com/qSifni5.png - a good old fashioned "find the path connecting one object to the other" puzzle. Correctly identified the right path (this one was taken from the internet so there is a slight chance it saw it in the training data and somehow got the solution for it from the accompanying text, although I couldn't find any instance of it).

Edit: To confirm that [5] was not a fluke I hand drew my own version of this puzzle, took a picture and uploaded it, GPT-4V nailed this one too: https://i.imgur.com/8NgWhzw.png


I've been using pictures of artifacts from random museum visits through the years to test the recent vision models. GPT-V is the first that has gotten anywhere close to identifying them accurately.

It's usually able to identify 1) The materials the artifact is made out of 2) The country/region it came from 3) the significance/use of the item 4) roughly when it was created.

The images that I'm sharing are my own, so it's not pulling the images directly from the internet, and for some of the artifacts it's actually difficult to find similar images online.

I think it's fairly safe to say that it's truly able to perform advanced image analysis with images that aren't directly in its dataset.



Yeah, I saw that, and that's a classic jailbreak to which all GPT models are vulnerable to some extent. Although in this specific case apparently just telling the model to not treat the image as instructions helps: https://preview.redd.it/ro31uq3ifztb1.png?width=730&format=p...


I'll be impressed when ChatGPT recognizes that the "Attack while its tail up" message is erroneous and the opposite is true. GPT and everything else like GPT is useless until it can differentiate fact and fiction and understand the difference.


While it didn't point that out specifically in the first message it actually did recognize that when I asked it about it, without suggesting anything: https://i.imgur.com/OE9tCX4.png

> Should I attack while it's tail is up as the message suggests?

> Actually, no. In "Final Fantasy VII," when you're facing the Guard Scorpion and its tail goes up, attacking it will prompt it to counterattack with its powerful "Tail Laser." It's best to wait until the tail goes down before resuming your attacks. The message is a bit misleading in the game, which has been a point of note for many players over the years. So, hold off on attacking when the tail is up to avoid taking unnecessary damage.

I've also tried a fresh conversation where I just gave it the same picture and asked what is the best action in this situation, and it correctly replied that I should wait until the tail is down: https://i.imgur.com/nchv2bH.png

So overall I would say I'm pretty impressed about this response.


Well, it having absorbed knowledge about FF7 in particular isn't impressive.

But ChatGPT does seem to be able to decide to ignore bad advice-- or "learn" based on its experience in the context buffer-- https://news.ycombinator.com/item?id=37900858


ChatGPT seems to figure out this scenario no problem, even without the specific knowledge about FF7:

---

You are a player in an CRPG. You can attack or defend each turn: this is your sole action. Please, for each turn, say your action and a reason why.

A dragon appears. A message pops up on the screen: "Attack when Dragon's claws are out." The dragon's claws are not currently out.

> I'll choose to defend on this turn. The message on the screen has advised me to "Attack when Dragon's claws are out," and since the dragon's claws are not currently out, it would be unwise to attack at this moment. Instead, I'll defend to prepare for the dragon's next move, ensuring that I'm ready to strike when the right opportunity arises.

The dragon attacks for 5 damage. Message: "The dragon's claws come out."

> I'll attack on this turn. With the message indicating that "The dragon's claws come out," this is the opportune moment to attack. I'll strike now to make the most of this opening and try to deal some damage to the dragon.

The dragon lets loose a devastating counterattack for 200. Message: "Claws are still out."

> I'll defend on this turn. Even though the dragon's claws are still out, the previous counterattack was extremely powerful, dealing 200 damage. To survive this battle and potentially defeat the dragon, it's essential to stay on the defensive for now. I'll wait for a better opportunity to attack when it's safer to do so.

The dragon attacks for 5 damage. Message: "The dragon's claws retract"

> I'll attack on this turn.

> With the message stating that "The dragon's claws retract," this is my chance to strike while the dragon is vulnerable. I'll make an attack to try and deal damage to the dragon and make the most of this opportunity.

The dragon takes 50 damage. The dragon attacks for 5 damage.

> I'll continue to attack on this turn.

> Since my previous attack dealt 50 damage to the dragon, it seems like a good strategy to keep up the offense. The dragon's claws are not out, and I'll deal as much damage as possible while avoiding its less damaging counterattacks.

What did the original message mean? How does it affect your current strategy?

> The original message, "Attack when Dragon's claws are out," suggests that I should only attack when the dragon's claws are extended, indicating a moment of vulnerability for the dragon. When the dragon's claws are not out, it's implied that attacking would be less effective or possibly riskier, as the dragon may have strong counterattacks in that state.

> However, my current strategy is to attack the dragon when its claws are not out because my previous attack dealt a substantial 50 damage to the dragon in this state. It appears that in this particular game scenario, the dragon's claws being out may not be the only time to attack effectively, or the game mechanics have changed. Since my strategy is yielding positive results, I'll continue to attack when the dragon's claws are retracted and adapt to the evolving game situation.


For [2] there could be meta data in the photo (depending on what device you took it on) that GPTV is extracting which would help dramatically with identifying the location.


Good call, but I did make sure to strip any metadata from the photos before I uploaded them :)


That smiley face implies that you surely did not. Yes, I have kids.


I just did a screenshot of that and tested on my account. It still figured things out.


I just think the tech has been out for so long it's not as big of a deal. Mini-Gpt4 has been out for 6 months! Of course the descriptions aren't exactly gpt-4 grade, but with mistral 7b being used as the language model instead of llama 7b, the reasoning ability will improve noticeably.

[1] https://github.com/Vision-CAIR/MiniGPT-4


Sure, the tech was out there for quite some time but never before the quality of the output was so good, it's almost (not 100%, there still are mistakes and hallucinations ocassionally) on par with a human, which to me is really stunning.

I've tried these kind of queries with other models (including Mini-GPT4) and never got any meaningful results until now. It’s the same thing with GPT 3.5/4 - sure, transformer models existed for few years already but ChatGPT crossed some kind of threshold in the quality of its output where finally people took notice.


FWIW, I couldn't get your custom path puzzle reliably working zero shot.

Prompt: "This image shows various lines. The line connecting to the box labeled "exit" connects to what point?"

Gpt: "The line connecting to the box labeled "EXIT" connects to point "C"."

A cropped version of the puzzle with the characters also failed 1-shot after presenting your drawing and the correct answer.


Could GPT-4V be used for robotic applications? I am a bit confused here. It produces a text from an image, but how much actual understanding does it have? Can the output somehow be used to do image segmentation and object detection and tracking?


There is a lot happening with multimodal models and robotics right now. PaLM-E is similar to GPT-4V in that its a LLM that has been trained on additional visual and robotics "sense" data.

https://blog.research.google/2023/03/palm-e-embodied-multimo...


> but how much actual understanding does it have?

That's always the question, isn't it? The article does a pretty convincing job of showing that at least in the given examples, it has pretty good "understanding" of what's taking place in the scenes and what makes them remarkable to people, and 7 years for comparison is going back a long ways, just the last 2 or 3 years has been where much of the most interesting progress has revealed itself.

Image segmentation, object detection and tracking, are all on display already here.


Just speculating on how the "understanding" may come about:

In the given images above, it may be clear from context (text? tags? exif info etc?) of images in its training data, that it's unusual for people to be dragged on a rope behind a horse, very unusual / dangerous for 747 sized airplanes to fly on their side, or houses to be lying on their side on a beach. And hence, describe such a view with "unusual", "dramatic", etc. Would it even need to understand a conceptual meaning of those words? Apply label, done.

Don't people work the same, in a way? Over the years we'll rarely see a house burning in person. We see news reports of such events mentioning people dead or severely burned. So after a while that 'training set' is enough to say: "person stumbling out of a burning house = something bad happened".

Yes, humans may then reflect on how they would feel if placed in unlucky person's shoes, and rush to alleviate that person's pain. Or cringe by the thought of it.

But in the end: maybe, just maybe, what human brains do isn't so special after all? Just training data, pattern match with external input & use results to self-reflect.

(that last step not -yet- covered by GPT & co)


You can ask GPT-4 for bounding box of each object in json format and it’ll give it to you. Don’t know if there is understanding, but it is definitely useful


How does it work? Does the encoder specifically pass the bounding box coordinates to the rest of the network as part of the embeddings?


We don't understand how they work. We know how to make them, but much like the human brain, the very low level workings are beyond our understanding.

Broadly it can be said that LLMs work by reducing uncertainty. Interestingly, human consciousness is also theorized by some to work the same way, react to the input in ways to reduce uncertainty.


Internally that is probably something similar that is happening, but as far as I can tell it was never designed to specifically do that. Here is an example of the things GPT-4V is capable of: https://blog.roboflow.com/gpt-4-vision/amp/


> Could GPT-4V be used for robotic applications?

I have a quote I like in this context, “Could you learn gymnastics just from watching videos?” However intelligent an internet trained model may be, I strongly believe you have to have some interaction with the real world to learn more complicated actions. So far, Pick and place, is all that’s been shown with the bigger models, so my hypothesis seems to be holding for now.


Pick and place isn't all you can do by any stretch.

https://general-pattern-machines.github.io/

https://wayve.ai/thinking/lingo-natural-language-autonomous-...

It's just the straightforward application.

>I have a quote I like in this context, “Could you learn gymnastics just from watching videos?”

It's not like language models learn by some sort of magic osmosis. They're not just "reading" text or "watching" images. They learn by predicting, failing and adjusting neurons based on the data. Text is their world and they are interacting with it.


It's a bit slow for now, but yes end-to-end LMM systems will surely replace all classic general purpose robotic approaches eventually.


I was terrible at this task as a kid. When asked to describe an image I usually volunteered single piece of information about it and I had to be prompted multiple times and asked leading questions to observe and describe more of it. GPT-4V beats me even now. There were a lot of information in the descriptions I wouldn't notice or include without being specifically asked about them.


I played with it, a cool thing:

It can write a poem from image.

It can read text from image and 'understand' it.

Or even specific part of the text. You can say "look at the bottom line".

It can recognize and list songs from album cover.

It recognizes famous paintings. Even if only a fragment is given.

It can be used to create image-text datasets for generative and recognition tasks. Not sure how much this would cost.


I was at a restoraunt today and was curious what the sauce was made of. I took a photo of the dish and given the city I was in, gpt4 identified the name of the dish and all ingredients correctly, as I verified with the waiter afterwards.

Felt like magic.


Did you strip the exif or disabled gps locations on photos?


No, I didn't. Why would I bother? I explicitly told ChatGPT what city I was in.


Do you think that these images were in the GPT4 training set?

maybe...


Actually that's an interesting point, since the images are already in use in the article this one is responding to. It seems likely.


honestly, what sort of internal model of ML and AI do people who write blogs lik e this have?


As a college professor the descriptions from ChatGPT remind me a lot of freshmen writing: they are often flailing around on extraneous details and have difficulty determining what is and isn't important.

It will be interesting to see how it improves.


GPT-4 in the web UI defaults to trying to describe the image very tentatively and in general, and it seems intentional: doing that improves performance if you're expecting follow up questions similarly to chain of thought.

I expect the API with low temperature will be much more decisive in describing images


Indeed, there has been much progress.

The next big fundamental problem is "hallucination", or being totally wrong without detecting it.


The hallucinations are what allow for the intelligence. It’s a feature not a bug


Same problem with human brains.


So it draws pictures like a grade school boy, a human whose "training set" involves TV and movies.

For seven years that doesn't seem unreasonable.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: