OmniParser V2 – A simple screen parsing tool towards pure vision based GUI agent

rgovostes · 2025-02-15T20:45:39 1739652339

The OS has additional information including how different graphics layers are composited, and what accessibility metadata is attached to interface elements. It ought to be useful to exploit this to do better than screenshot parsing.

icodar · 2025-02-15T19:31:16 1739647876

This is not the intended use but it good working on parsing document layout from image.

nighthawk454 · 2025-02-15T22:33:24 1739658804

One ponders the connections with the Recall feature

NewUser76312 · 2025-02-15T20:15:28 1739650528

Very cool work. Accurate GUI text and element parsing is exactly the kind of input that LLMs need to be effective agents.