Execution plan for the rework described in
REARCHITECTURE.md. Designed to be driven by many sub-agents in parallel — but safely (agents never edit the same file). Governing rule: EXTREME PARITY with codex on engine features; preserve only the intentional divergences listed inREARCHITECTURE.md("Guiding principle"). Status: plan only — do not implement yet. This is the map we execute when we decide to go.
Legend: [ ] not started · [~] in progress · [x] implemented and parity/e2e tests green. A phase is done only when every box under it (including its Gate) is [x]. Nothing gets [x] until it runs and passes tests (Rule 8). Keep this list in sync as WPs land.
Phase 0 — Carve
- [~] 0.1 Module carve — ~16 cohesive modules extracted (constants, images, auth, goals, memory_citations, rollback, codex_headers, review, config_overrides, persistence, prompts, tool_output, request_user_input, assistant_markup, token_usage), lib.rs 47.3k→44.3k, 488 tests green. Plateau: the remaining tangled core (providers_glue, context, compact, session, events, agents_md, hooks, skills, plugins, subagents, turn) can't be cleanly split by pure code-motion — it needs the async + interface rework it's a prerequisite for. Finish the rest during the async migration (Phase 2).
- 0.2 Freeze interfaces (provider/route, ToolRuntime/Approvable/Sandboxable, context, store, errors)
- 0.3 Parity-test harness (cassettes, fixtures, corpus scaffold)
- Gate 0 — compiles (sync), modules carved, interfaces frozen, harness runs
Phase 1 — browser-use-llm
- 1.1
schema/(LlmRequest/Message/ContentPart/LlmEvent/Usage/ids/options/error) — 5 round-trip tests green - 1.2
route/traits + client + Lifecycle + ToolStream — traits + Lifecycle/ToolStream/SSE framing + async ModelClient/executor all done (retry/backoff, rate-limit parse, secret redaction, network-freedecode_chunkstest).6708ac4. - 1.3
protocols/openai_responses— build_body + SSE decoder via Lifecycle/ToolStream; 6 fixture tests - 1.4
protocols/openai_chat(Ollama/OpenRouter/DeepSeek/Fireworks) — 5 fixture tests (tool calls correlated by stream index) - 1.5
protocols/anthropic_messages— Messages wire format (system-block array, thinking-signature round-trip, event-named SSE); 9 fixture tests. (Claude-Code OAuth is auth, not wire format — lands with the auth/client layer.) - 1.6
route/executor(retry / rate-limit / redaction) - 1.7
providers/facades + openai-compatible{provider,base_url}table - 1.8
tool.rs/tool_runtime.rs - 1.9 Drop codex/ChatGPT backend (de-codex)
- Gate 1 — 3 protocols pass cassette parity; live smoke (OpenAI/Anthropic/Ollama); codex backend gone
Phase 2 — Async spine + orchestrator seam
- 2.1 Async turn loop (codex parity)
- 2.2 Context manager + real token accounting + context-message injection
- 2.3 ToolOrchestrator + runtime/approval/sandbox seam (sandbox=None stub)
- 2.4 Session/resume async over SQLite (write-sink)
- Gate 2 — full async turn end-to-end through orchestrator; persist + resume
Phase 3 — Tools → parity
- 3.1 shell + unified_exec
- 3.2 apply_patch (+ turn_diff_tracker)
- 3.3 view_image (KEEP sync/serial)
- 3.4 update_plan
- 3.5 request_user_input
- 3.6 tool_search + deferred exposure
- 3.7 web_search
- 3.8 browser tools adapter
- 3.9 python tool adapter
- 3.10 mcp tool dispatch
- Gate 3 — every tool parity/divergence-tested
Phase 4 — Subsystems → parity
- 4.1 Compaction (codex parity, model-based; no-LLM path removed)
- 4.2 Context-message system (reference_context alignment)
- 4.3 MCP transports (HTTP/SSE/OAuth/elicitation; fix plugin-oauth bug)
- 4.4 Subagents (roles-as-config, depth limits, event-notify mailbox)
- 4.5 Goals/budget parity
- 4.6 Skills & plugins parity (+ install/elicitation)
- 4.7 Hooks (PermissionRequest event + Prompt/Agent handlers)
- 4.8 Prompts (de-brand; selection parity)
- 4.9 Rollout/resume hardening (archival/fork/truncation over SQLite)
- Gate 4 — subsystem parity tests green; no-LLM compaction gone
Phase 5 — Safety substrate
- 5.1 OS sandboxes (seatbelt/landlock+seccomp/bwrap/windows) + process-hardening
- 5.2 execpolicy (Starlark) + amendments + command_canonicalization
- 5.3 Network proxy/MITM + deferred network approval
- 5.4 Guardian (LLM reviewer, circuit breaker, fork reviewer)
- 5.5 Wire approvals into orchestrator; flip sandbox None→real; PermissionRequest precedence
- Gate 5 — commands sandboxed; approvals/guardian enforced; escalation works
Phase 6 — Advanced parity
- 6.1 code_mode (V8 tools-as-code)
- 6.2 agent_jobs (CSV fan-out)
- 6.3 shared-service subthreads (codex_delegate equiv)
- 6.4 memories ingestion/storage
- 6.5 connectors (OAuth MCP) + session prewarm
Cross-cutting
- De-codex audit (no residual
codex/CODEX_/oai/chatgptexcept intentional) - Final full e2e pass (browser + tools + compaction + subagents + resume, against a live model)
- The unit of work is a work package (WP). Each WP = one sub-agent task, going deep / first-principles.
- WPs are grouped into phases (by dependency) and, within a phase, into parallel groups (WPs that touch disjoint files can run at the same time).
- There is no human in the loop. The orchestrating agent drives everything: it runs each phase as a wave — spin up the group's agents in parallel, review their design notes, let them implement, run the phase gate (integration + parity + e2e tests), merge, then start the next wave. It also does the serial work (the Phase 0 carve) itself.
- The goal is completeness, not speed. Parallelism here is for clean isolation and depth — not to finish fast. Every subsystem must be fully implemented, run, and tested; a phase is not done until it actually works.
- Disjoint files only. Two agents may run concurrently only if their owned-file sets do not intersect. Every WP below lists Owns (exact files/dirs).
- The carve gates everything. While the engine is one 47k-line
lib.rs, parallel edits collide. Phase 0 (the split) must complete before high-parallelism phases. Phase 0 itself is low-parallelism. - Interfaces first, frozen. Shared types/traits/errors are built serially, frozen as contracts, then feature agents code against them (not against each other).
- Worktree isolation for concurrent coders. Each parallel agent works in its own git worktree (or branch); the orchestrating agent merges at the phase gate. (Workflow
isolation: "worktree".) - Each WP de-brands its own files. The de-codex rename (REARCHITECTURE §4) is done inside each WP's owned files — never as a global sweep mid-phase (that would touch everyone's files). A final audit WP catches stragglers.
- A phase isn't done until its gate is green and merged. No starting a dependent wave on un-merged work.
- Tests first, commit first. Each WP writes + commits its parity/e2e tests before implementing the feature, then commits working increments frequently (don't batch a whole subsystem into one commit). The Phase 0 carve commits per extraction step.
- Run it, don't just compile it. Acceptance = the behavior runs and tests pass against a live model (see §5 Testing) — not that it builds.
This is the "go extremely deep into every single thing" contract. Reviewed by the coordinator before implementation starts:
- Purpose — what is this actually for? (the agent-/user-facing behavior it provides)
- Mechanism — how does codex do it (types, flow, file:line)? How do we do it today?
- Heuristics — enumerate codex's exact heuristics/constants/thresholds/decision-rules (the real numbers).
- Heuristic decision — which value/behavior do we pick? Default = match codex exactly. Any divergence must be one of the sanctioned ones (REARCHITECTURE "Guiding principle") or explicitly justified here.
- Parity gap — precisely what differs/missing today vs codex.
- Tasks — concrete implementation steps.
- Parity tests — tests that prove behavior matches codex (same triggers → same outputs → same limits). "Feature exists" is not acceptance; "behaves identically" is.
- Owns — exact file list (collision-safety).
crates/
browser-use-protocol/ # exists — shared wire/event types (de-branded)
browser-use-llm/ # NEW — provider/protocol layer (opencode design)
browser-use-core/ # agent engine, split into submodules:
src/turn/ # the async turn loop
src/context/ # context manager (REAL tokens) + context-message injection
src/compact/ # compaction (codex parity, model-based)
src/session/ # session lifecycle, resume, rollback/fork (over store)
src/orchestrator/ # ToolOrchestrator + ToolRuntime/Approvable/Sandboxable seam
src/tools/<tool>.rs # one file per tool handler
src/goals/ # goals/budget
src/subagents/ # multi-agent v1+v2
src/skills/ # skills + plugins
src/hooks/ # hooks engine
src/prompts/ # prompt assembly + catalog
browser-use-mcp/ # NEW — MCP client + transports (extracted from mcp.rs)
browser-use-sandbox/ # NEW — OS sandboxes + execpolicy + network proxy + hardening
browser-use-guardian/ # NEW — LLM approval reviewer
browser-use-store/ # exists — SQLite (extend, keep)
browser-use-browser/ # exists — CDP/JS harness (black box)
browser-use-python-worker/ # exists
browser-use-tui/ # exists
browser-use-cli/ # exists
Engine stays one crate (browser-use-core) with submodules — submodule files give enough isolation for one-agent-per-file parallelism without the cost of extracting a dozen crates. Clearly separable concerns (llm, mcp, sandbox, guardian) become their own crates.
Phase 0 Carve + interfaces ── gates everything
│
├── Phase 1 browser-use-llm core ──┐
│ ├── Phase 2 Async spine + orchestrator seam
└───────────────────────────────────┘ │
├── Phase 3 Tools → parity (HIGH parallel)
├── Phase 4 Subsystems → parity (HIGH parallel)
│ │
└── Phase 5 Safety substrate (needs orchestrator seam)
│
Phase 6 Advanced parity (HIGH parallel)
Legend — P = parallelism of the phase. Each WP: Owns (files) · Dep (prereqs) · Group (parallel cohort).
Goal: modular layout + frozen interfaces, behavior-preserving, still sync. This is the bottleneck; do it carefully with frequent commits. Do not parallelize across lib.rs.
| WP | Owns | Dep | Group |
|---|---|---|---|
| 0.1 Module/crate skeleton + mechanical move of code into submodules/crates (no behavior change, sync) | whole workspace (esp. core/src/lib.rs → submodules) |
— | solo |
0.2 Freeze interfaces: ModelProvider/route traits, ToolRuntime/Approvable/Sandboxable, context-manager trait, store trait, error enums |
core/src/*/mod.rs trait files, browser-use-llm/src/route/*.rs (signatures only) |
0.1 | after 0.1 |
| 0.3 Parity-test harness: provider cassettes, fixture framework, codex-parity corpus scaffolding | tests/parity/** (new) |
— | parallel w/ 0.2 |
Gate 0: workspace compiles (sync), all subsystems live in their own module/crate, public interfaces frozen, parity harness runnable. Now one-agent-per-module is collision-safe.
Sequencing tip: 0.1 may be split into ordered extraction commits by subsystem (turn → context → compact → tools → goals → subagents → skills → hooks → mcp → store), but each commit is serial (single owner) because they all touch
lib.rs.
Foundation (serial), then protocols (parallel). See REARCHITECTURE §3.
| WP | Owns | Dep | Group |
|---|---|---|---|
1.1 schema/ — LlmRequest/Message/ContentPart/LlmEvent/Usage/ids/options/error |
llm/src/schema/** |
0.2 | foundation (serial) |
1.2 route/ traits + client + Lifecycle + ToolStream helpers |
llm/src/route/**, llm/src/protocols/utils/** |
1.1 | foundation (serial) |
1.3 protocols/openai_responses.rs |
that file | 1.2 | wave A (parallel) |
1.4 protocols/openai_chat.rs (Ollama/OpenRouter/DeepSeek) |
that file | 1.2 | wave A |
1.5 protocols/anthropic_messages.rs (+ Claude-Code OAuth, identity/tool-name canon, system-block, freeform→JSON downgrade) |
that file + providers/anthropic.rs |
1.2 | wave A |
1.6 route/executor.rs — retry/backoff, rate-limit headers, secret redaction |
that file | 1.2 | wave A |
1.7 providers/ facades + OpenAI-compatible {provider,base_url} table |
llm/src/providers/** |
1.2 | wave A |
1.8 tool.rs/tool_runtime.rs (define-once tools → per-protocol lowering) |
those files | 1.2 | wave A |
| 1.9 Drop the codex/ChatGPT backend (de-codex) | remove CodexResponsesProvider + its headers/OAuth/env |
0.1 | wave A |
Gate 1: all three protocols pass cassette parity tests; live smoke against OpenAI, Anthropic (key + OAuth), Ollama; codex backend gone.
The engine becomes async; everything routes through one orchestrator. Tightly coupled WPs — coordinate.
| WP | Owns | Dep | Group |
|---|---|---|---|
2.1 Async turn loop — codex parity: unbounded loop{} on needs_follow_up, CancellationToken tree, FuturesOrdered tool scheduling, abort/interrupt |
core/src/turn/** |
1.x, 0.2 | spine (coordinate w/ 2.3) |
2.2 Context manager — REAL token accounting (per-provider Usage), history assembly/normalization, context-message injection |
core/src/context/** |
1.1 | parallel w/ 2.4 |
2.3 ToolOrchestrator + ToolRuntime/Approvable/Sandboxable seam (sandbox=None stub, auto-approve); unify shell/exec/apply_patch/mcp routing |
core/src/orchestrator/** |
0.2 | spine (coordinate w/ 2.1) |
| 2.4 Session/resume async over SQLite — keep store; wire async load/replay | core/src/session/**, store async API |
0.2 | parallel w/ 2.2 |
Gate 2: an async agent runs a full turn end-to-end through the orchestrator (sandbox off), dispatches tools, persists + resumes a session.
All depend on Phase 2; mutually disjoint → run as one big wave.
| WP | Owns | Parity target |
|---|---|---|
| 3.1 shell + unified_exec | core/src/tools/shell.rs, unified_exec*.rs |
persistent PTY sessions, stdin, timeouts, truncation; route through orchestrator |
| 3.2 apply_patch | core/src/tools/apply_patch.rs |
V4A format + turn_diff_tracker parity |
| 3.3 view_image | core/src/tools/view_image.rs |
KEEP sync/serial (intentional) — test that it never runs parallel with browser actions |
| 3.4 update_plan | core/src/tools/plan.rs |
spec + semantics parity |
| 3.5 request_user_input | core/src/tools/request_user_input.rs |
parity |
| 3.6 tool_search + deferred exposure | core/src/tools/tool_search.rs |
bm25 + deferral threshold parity |
| 3.7 web_search (hosted) | core/src/tools/web_search.rs |
parity |
| 3.8 browser tools adapter | core/src/tools/browser.rs |
thin adapter over browser-use-browser (divergence, keep) |
| 3.9 python tool adapter | core/src/tools/python.rs |
adapter over python-worker (divergence, keep) |
| 3.10 mcp tool dispatch | core/src/tools/mcp.rs |
calls browser-use-mcp |
Gate 3: every tool has a parity (or divergence) test passing.
All disjoint modules; depend on Phase 2 (+3 for tool-touching ones). The core "EXTREME PARITY" wave.
| WP | Owns | Parity target |
|---|---|---|
| 4.1 Compaction | core/src/compact/** |
codex parity, model-based only (remove no-LLM dump): triggers on real token usage (Total + BodyAfterPrefix + full-window), exact summary prompt, preserve full COMPACT_USER_MESSAGE_MAX_TOKENS=20_000 user budget, byte-identical summary prefix, context-window-exceeded drop-oldest handling |
| 4.2 Context-message system | core/src/context/messages.rs |
decision: keep our name-tagged messages vs adopt codex reference_context diff — pick one, justify; match codex injection points/content |
| 4.3 MCP transports | browser-use-mcp/** |
add HTTP/SSE/streamable + OAuth + elicitation; honor plugin type/oauth (fix the silently-ignored bug); keep stdio |
| 4.4 Subagents | core/src/subagents/** |
role-as-config-layer, spawn-depth limits, mailbox semantics; event-notify (in-memory) instead of the 50ms SQLite poll — messages still dumped to SQLite for the record, never polled |
| 4.5 Goals/budget | core/src/goals/** |
audit accounting formula + steering events to parity |
| 4.6 Skills & plugins | core/src/skills/** |
discovery/budgeting/mentions parity + plugin install/elicitation |
| 4.7 Hooks | core/src/hooks/** |
add PermissionRequest event + Prompt/Agent handler kinds |
| 4.8 Prompts | core/src/prompts/** |
de-brand; parity with codex prompt selection; keep browser preamble + interaction-skills + sync-image notes |
| 4.9 Rollout/resume hardening | core/src/session/** (≠4.x), store |
archival, fork modes, truncation parity — over SQLite |
Gate 4: each subsystem's parity tests pass; no-LLM compaction is gone.
Depends on Phase 2 orchestrator seam. 5.1–5.4 parallel (separate crates); 5.5 integrates last.
| WP | Owns | Parity target |
|---|---|---|
| 5.1 OS sandboxes + process-hardening | browser-use-sandbox/src/{seatbelt,landlock,bwrap,windows,hardening}.rs |
per-platform enforcement parity |
| 5.2 execpolicy (Starlark) + amendments + command_canonicalization | browser-use-sandbox/src/execpolicy/** |
declarative allowlist + persisted prefix amendments |
| 5.3 Network proxy/MITM + deferred network approval | browser-use-sandbox/src/network/** |
per-host Allow/Deny/Ask, persisted rules |
| 5.4 Guardian (LLM reviewer) | browser-use-guardian/** |
fail-closed, circuit breaker, ephemeral fork reviewer session |
5.5 Wire approvals into orchestrator; flip sandbox None→real; PermissionRequest hooks precedence |
core/src/orchestrator/** |
full approval→sandbox→escalate flow parity |
Gate 5: commands run sandboxed, approvals/guardian enforced, escalation works.
| WP | Owns | Notes |
|---|---|---|
| 6.1 code_mode (V8 tools-as-code) | browser-use-codemode/** (new) |
nested tool calls via the router |
| 6.2 agent_jobs (CSV fan-out) | core/src/subagents/jobs.rs |
bounded concurrency, crash-recovery |
| 6.3 shared-service subthreads (codex_delegate equiv) | core/src/subagents/delegate.rs |
approval forwarding to parent |
| 6.4 memories ingestion/storage | browser-use-memories/** (new) or store |
beyond citation parsing |
| 6.5 connectors (OAuth MCP) + session prewarm | browser-use-mcp/connectors.rs, core/src/session/prewarm.rs |
remote tool ecosystem |
- Phase 0: ~1 agent (serial carve) + 1 (harness). Bottleneck — invest here.
- Phase 1: 2 serial (schema, route) → 7-way parallel wave.
- Phase 2: ~3-4 agents, coordinated (spine).
- Phase 3: ~10-way parallel (one per tool).
- Phase 4: ~9-way parallel (one per subsystem).
- Phase 5: ~4-way parallel + 1 integrator.
- Phase 6: ~5-way parallel.
Peak concurrency ≈ 9–10 agents (Phases 3/4) — but parallelism is for isolation/depth, not speed. Each agent works in its own worktree; the orchestrating agent merges at gates and runs the integration + parity + e2e tests. We optimize for everything implemented, run, and tested, not for finishing fast.
- De-codex rename: done inside each WP's owned files; a final audit WP greps for residual
codex/CODEX_/oai/chatgptand confirms only intentional references remain. - Parity corpus: grow the codex-parity fixture set continuously; every WP adds its own cases (own files).
- End-to-end testing (required): every phase gate runs the real agent against a live model, and a final full e2e pass exercises browser flows + tools + compaction + subagents + resume. Use the existing codex auth (already configured) as the live-model vehicle for dev/testing. (Transition note: the codex backend provider is slated for removal — REARCHITECTURE §4 — so once OpenAI/Anthropic direct auth is wired, e2e should run through those; codex auth stays a fallback test vehicle until then.)
- Commit cadence: tests committed first, then frequent working-increment commits per WP (Rule 7).
- Decision log:
DECISIONS.mdrecords sanctioned divergences + each "codex offered options, we picked X".
- Context messages: align with codex's typed
reference_contextmechanism for parity; our extra kinds become additional typed items. - Subagent wait: event-notify (in-memory); SQLite is write-only here.
- Crate granularity: engine submodules + separate crates for
llm/mcp/sandbox/guardian. - Async migration: top-down behind frozen Phase-0 interfaces, optimized for correctness/testability (not speed).
- v1 providers: OpenAI (Responses), Anthropic (Messages + Claude-Code OAuth), Ollama, DeepSeek, OpenRouter, Fireworks (via the
openai-chatprotocol where compatible). More are just config entries — added later freely.
This plan is intentionally static and review-ready. When we decide to execute, each WP becomes a sub-agent task seeded with its "Owns" set and the per-WP design-note template; phases run as waves with gates between them.