GPT-4V is great for reasoning about what is on the screen. However, it struggles with precision. For example, it is not able to specify the coordinates to tap when it decides to tap an icon. That's where the object detection and accessibility elements help. We can precisely locate interactive elements.
Thanks! Yes, we experimented with that! I think because of the way that GPT sees images in patches it has a hard time with absolute positioning but that's just a guess.