Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

GPT-4V is great for reasoning about what is on the screen. However, it struggles with precision. For example, it is not able to specify the coordinates to tap when it decides to tap an icon. That's where the object detection and accessibility elements help. We can precisely locate interactive elements.


Have you tried putting a pixel grid over the image with labelled guidelines every 100px?

Was one thing I never got around to testing with DemoTime but was always curious about.

Anyway sorry this is a nice product. Congratulations on the launch.

Always good to see substantial tech


Thanks! Yes, we experimented with that! I think because of the way that GPT sees images in patches it has a hard time with absolute positioning but that's just a guess.


I've done something similar and found the same thing. It also could not calibrate when I drew a dot on its last suggested coordinates.

"You said the play button was at 100, 200 and a green circle is drawn there. Is the circle located on the button or do you need to adjust it"

Something along those lines. And it also got the size of the image.

Nope its in the right ballpark but it could not make fine adjustments or anything closer to a button.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: