Hacker News
new
|
past
|
comments
|
ask
|
show
|
jobs
|
submit
login
OmniParser V2 – A simple screen parsing tool towards pure vision based GUI agent
(
github.com/microsoft
)
64 points
by
punnerud
10 months ago
|
hide
|
past
|
favorite
|
4 comments
rgovostes
10 months ago
|
next
[–]
The OS has additional information including how different graphics layers are composited, and what accessibility metadata is attached to interface elements. It ought to be useful to exploit this to do better than screenshot parsing.
icodar
10 months ago
|
prev
|
next
[–]
This is not the intended use but it good working on parsing document layout from image.
nighthawk454
10 months ago
|
prev
|
next
[–]
One ponders the connections with the Recall feature
NewUser76312
10 months ago
|
prev
[–]
Very cool work. Accurate GUI text and element parsing is exactly the kind of input that LLMs need to be effective agents.
Guidelines
|
FAQ
|
Lists
|
API
|
Security
|
Legal
|
Apply to YC
|
Contact
Search: