Zerlo.net Browser AI: Technical Details

This blog article explains the functionality of the experimental Browser AI by Zerlo.net. It describes the current development status and future direction as a demand test.

Avatar
Zerlo Team · 13.07.2025 · AI Development · 5 min

1. Introduction: Transparency for Browser AI

The term "AI" is widely used. Questions about the technical details are justified. We pursue high transparency regarding our experimental Browser AI. This article explains the prototype: its functionality, current limitations, and development goals. The project primarily serves as a demand test. We present the facts without marketing language.

2. Technical Architecture of the Browser AI

Our Browser AI consists of several components. These operate in isolated Docker containers under Kubernetes. The Screenshot Capture layer uses an instrumented Chromium. It generates a PNG screenshot of the browser viewport every 1–2 seconds. Visual data is sent to a specialized Vision Encoder. This is a ResNet hybrid trained on 224x224 patches. It identifies visual elements such as buttons, texts, and input fields. An LLM Controller, a GPT derivative with a 10,000-token context window, plans actions (click, input, scroll) based on the visual information. The Action Runner executes these actions using Puppeteer, including retry logic. It then requests new screenshots. A Memory Store with LiteFS and Redis stores history and system state. A Cost Guardrail limits token usage to a maximum of 12,000 tokens per action. This reduces the cost per action to approximately $0.0001 (Open Weights). The average latency per action is about 600 milliseconds.

3. Current Development Status and Success Rates (July 2025)

The Browser AI in July 2025 is an advanced prototype. Success rates vary depending on the task. Login tasks with reading two fields achieve about a 75% success rate. Challenges include CAPTCHAs, 2FA, or login redirects. Newsletter forms are filled out in approximately 68% of cases; honeypot fields can interfere here. A PDF download via a click chain is successful in 55%. Here, path recognition has gaps. In price comparison across three shops, the success rate is around 40%. Cookie banners and variable shop structures are the main issues. These rates refer to error-free execution without manual correction. Typically, three to five attempts are needed for a stable task execution.

4. Reasons for the Experimental Nature

The experimental nature of the Browser AI is due to the complexity of the internet. The constant changes in DOM structures (classes, IDs) pose a challenge. Our Vision Encoder is compact, which can impair the precise detection of very small buttons. A single action can require up to 20 LLM calls for planning and safety. Special cases such as Shadow DOMs, iframes, and modals are frequent and require specific handling. Since the tool is based solely on screenshots, it makes decisions only on the visible viewport. This is similar to a human interacting with the web solely through screenshots. The functionality is currently not always reliable.

Illustration of the Browser AI in Action

Quelle: zerlo.net

Our experimental Browser AI operates solely on visual data. Every action is based on what is visible on the screen. This is its strength and its greatest limitation.

5. Roadmap Q3/Q4 2025: Planned Developments

A clear roadmap with developments is set for Q3 and Q4 2025. Self-Play Fine-Tuning is prioritized to train the agent autonomously on synthetic websites. A Hierarchical Memory Planner will be implemented. It should be able to break down large goals into manageable steps. The Consent Solver will reliably recognize and close cookie banners using a specialized model. Additionally, we plan to introduce User Macros. These allow users to save their own click sequences as "Gold Runs." The system will be trained on these to increase efficiency and reliability.

6. Long-Term Vision: The Universal Web Copilot

Our long-term vision extends beyond 2026. The goal is to develop a universal web copilot. This copilot will handle simple tasks such as logging in, booking, canceling, and paying. Additionally, seamless integration with calendars, email systems, and file storage is planned. A community-based task marketplace, similar to GitHub Actions, will allow users to share pre-made automations. For sensitive applications like online banking, local execution is planned to maximize security. The ultimate goal is automated browsing in the background for a "zero-wait experience," where web interactions occur without active user involvement.

Quelle: Zerlo.net

On the official Zerlo.net Browser AI page, you can test the project. Your interaction helps us assess the demand and further develop the tool.

7. Purpose of the Project: A Demand Test

The release of this Browser AI primarily serves one purpose: a DEMAND TEST. We use this prototype to collect valid data. Questions include: How many users engage? What tasks can be handled in real-world use? How often do operations fail, and why? If there is a quantifiable demand, we are ready to invest significantly in development, hosting, support, and an API. Otherwise, the project remains an open-source prototype.

Every feedback, every click, every bug report helps us evaluate the necessity and direction of this project.
The Zerlo.net AI Team
The Zerlo.net AI Team
Shaping the Future of Browsing

8. Assistance and Outlook of the Browser AI

Your contribution is important. Actively test our Browser AI in your daily life. Let it perform tasks and report errors. Let us know which tasks the AI should handle. Your experiences are the foundation for whether this project will grow beyond the prototype stage. Visit zerlo.net/en/browser-ai to participate.

Teilen Sie doch unseren Beitrag!