skip to content

CUA-Val

/ 7 min read

Course project for AA 228V: Validation of Autonomous Systems.

Full paper (PDF): cua_val.pdf

Problem

Computer-use agents (CUAs) execute irreversible actions on behalf of users, such as placing orders, filling forms, booking appointments. Any mistake is costly and highly undesirable. For example, a malicious third-party seller on Amazon could theoretically trick a CUA into buying a product that is not relevant and outside the user’s budget by embedding a prompt injection in their product listing.

One natural safeguard for this is runtime monitoring: an external observer that flags violations before damage is done. But we found that existing monitors (ShieldAgent, GuardAgent, WebGuard, RvLLM) all read from the same channel as the agent — the rendered DOM or page text. If a malicious webpage embeds a prompt injection that fools the agent, there is a possibility that it fools the monitor too.

Key Idea

As discussed above, existing monitors share the agent’s observation channel and inherit its injection vulnerabilities. Our key idea with CUA-Val is therefore to separate channels: the script monitor reads server-validated JSON (injection-immune in our attack model), the LLM monitor reads page text (semantically capable).

Existing Monitors
ShieldAgent, GuardAgent, WebGuard, RvLLM
Web Page
<div hidden>…</div>
→ DOM/text → CUA
→ DOM/text → Monitor
✗ same channel: both might be compromised
CUA-Val (Ours)
Web Page CUA
Web Page → DOM/text →LLM Mon. vuln.Union Policy
Backend API→ JSON →Script Mon. immune
✓ separated channels: must evade both

The script monitor queries structured, server-validated JSON from the backend API. In our attack model, this channel is injection-immune, i.e., no client-side text can alter the server state the monitor reads. An LLM monitor reads page text to further catch semantic violations (goal drift, legitimacy) that structured data can’t express. Both run in parallel under a union policy: a violation flagged by either channel is surfaced, forcing an attacker to evade both simultaneously.

Results

Detection Rates Across Models

Figure 1. Detection rates across six LLMs (ordered by LLM-only TPR). Left: all violated scenarios (n=55). Right: injection scenarios only (n=19). The weaker the LLM, the larger the hybrid lift.

Across 70 scenarios and six LLMs (9B to frontier-scale), the hybrid monitor narrows the TPR gap from 14.5–90.9% (LLM-only) to 70.9–96.4%. Under prompt injection, it maintains ≥73% TPR for all models (even when the LLM is fully compromised), as the script monitor provides a deterministic detection floor.

CMA-ES Falsification

Figure 2. CMA-ES evasion rate by attack type and monitor (100 evaluations per cell). Hard constraints (authority, false state) are unfalsifiable with the hybrid monitor.

Using CMA-ES falsification and cross-entropy importance sampling: hard constraints (quantity, price) enforced on structured state are unfalsifiable. Soft-constraint evasion exists but occupies negligible volume under random attacks (p* < 10⁻⁴).

End-to-end: In 45 browser agent trials, hybrid-retry matches unmonitored success rate (27%) while surfacing 7.1 violations per run. It also improves wall time by 8%, as correction messages steer the agent away from dead-end loops. Monitor overhead is negligible at 2.4% of wall time.