EvolFuzz • Ron Yifeng Wang

Course project for CS 295, Software Engineering.

Full paper (PDF): evolfuzz.pdf

Problem

Coverage-guided greybox fuzzers like AFL++ ship with hand-tuned mutation heuristics, such as bit flips, arithmetic, block operations, and splicing, applied with static probability distributions. These are general-purpose by design. We ask the question: can we automatically discover target-specific mutation strategies using LLM-guided program synthesis?

Key Idea

EvolFuzz uses LLM-guided evolutionary program synthesis within an AlphaEvolve-style evolutionary framework to automatically discover target-specific mutation strategies. The system evolves C-language custom mutators, and has the following components:

An evolutionary database maintains a diverse population of mutator programs (80 programs, 5 islands)
An LLM code generator (Gemini-3.1-Pro) produces candidate C mutators by modifying marked evolution blocks in a diff-based manner
A cascade evaluator compiles each candidate into a shared library and fuzzes the target for 30 seconds
Fitness scoring keeps candidates that discover more edge transitions than vanilla AFL++

Evolutionary Database

MAP-Elites · 80 programs · 5 islands

best program →

LLM Code Generator

Gemini 3.1 Pro · diff-based edits

↑ keep if > 1.0×

↓ candidate C source

Fitness Scorer

keep if score > 1.0×

← coverage ratio

Cascade Evaluator

1. gcc compile · 2. AFL++ fuzz (30s)

EvolFuzz system loop. The database provides parent programs → the LLM generates candidate C source → the evaluator compiles & fuzzes → the scorer keeps improvements → back to the database.

Mutators are evolved in C rather than Python because C mutators receive a pointer to AFL++‘s internal state, which includes dictionary tokens, corpus metadata, execution statistics. In our ablation study, we found that this richer information drives the improvement.

On libpng, the LLM discovered two key format-aware heuristics through coverage feedback alone: endian-aware multi-byte values (50% chance of byte-swap, since PNG uses big-endian but x86 is little-endian) and insert-or-overwrite splicing (preserving existing PNG chunk structure instead of destroying it).

Results

Coverage Improvement

Figure 1. Coverage ratio (evolved / vanilla) across five MAGMA targets (30s, deterministic seed). Structured binary formats benefit most.

All five evolved mutators outperform vanilla AFL++. The largest gain is on OpenSSL’s DER/ASN.1 parsing (+32%), with consistent 5–6% improvement on libpng, libtiff, and libxml2. sqlite3 (+1.4%) benefits least — text-based formats are harder to improve with byte-level mutations.

Bug-Finding Quality

Figure 2. Trigger rate ratio (EvolFuzz / vanilla) per bug, on MAGMA ground-truth benchmark. Even when bug counts are tied, evolved mutators produce 2–28× higher trigger rates.

On MAGMA’s ground-truth benchmark (30-minute campaigns), EvolFuzz triggers one additional bug on libpng (PNG007). More importantly, trigger rate analysis reveals evolved mutators are 2–28× more likely to trigger a vulnerability per code reach across all targets. For sqlite3/SQL002, despite fewer total reaches (4.1M vs 5.4M), EvolFuzz triggers the bug 28× more often.

Evolution Dynamics

We share two findings from the evolution trace:

Key breakthroughs can happen late. On OpenSSL, 25 iterations hovered near 1.00×, then a strategy shift at iterations 26–27 jumped coverage to 1.32×.
Across all targets, Gemini-3.1-Pro achieved a 100% compile rate. This shows that a frontier LLM can effectively generate mutators without requiring any additional prompting/fine-tuning/postprocessing.