Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I had an older version that used simplified HTML, and it got to decent performance with GPT-4o and Gemini but at the cost of 10x token usage. You are right, identifying the interactable elements and pulling out their values into a prompt structure to explicitly allow the next actions can boost performance, especially if done with grammar like structured outputs or guidance-llm. However, I saw that Claude had similar levels of performance with pure vision, and I felt that vision + more training would beat a specialized DOM algorithm due to "the bitter lesson".

BTW I really like your handling of browser tabs, I think it's really clever.



Fair, also Claude probably only gets better on this since they kinda want people to use Computer use. We are gonna try to do best of both worlds.

Thanks man, Magnus came up with it this morning haha!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: