OmniParser API

Self-hosted Microsoft's OmniParser API - converts UI screenshots to structured data

Self-hosted version of Microsoft's OmniParser Image-to-text model. Built to overcome the limitations of the original Gradio interface, this FastAPI implementation offers significantly faster processing and eliminates rate limits.

OmniParser is a general screen parsing tool, which interprets/converts UI screenshot to structured format, to improve existing LLM based UI agent. Training Datasets include: an interactable icon detection dataset, which was curated from popular web pages and automatically annotated to highlight clickable and actionable regions, and an icon description dataset, designed to associate each UI element with its corresponding function.

Developed originally to empower my web browsing agent OneQuery.

Project Info

Created: November 15, 2024