neuralnoise.com


Homepage of Dr Pasquale Minervini
Researcher/Faculty at the University of Edinburgh, School of Informatics
Co-Founder and CTO at Miniml.AI
ELLIS Scholar, Edinburgh Unit


[WIP] Benchmarking Local LLMs Against Coding Agent Harnesses

I’ve been running a small benchmark, harness-bench, that pairs local LLMs (served via llama.cpp’s llama-server) with agent harnesses (Aider, Claude Code, OpenCode, Pi, Qwen CLI) on 16 software-engineering tasks across Python, PyTorch, JAX, C, C++, Rust, and SQL. Each (model, harness, task) cell is sandboxed: the agent only sees a scratch workspace/ and grading is done by a hidden test.sh that the agent never sees. The current sweep is 17 model-quants × 5 harnesses × 16 tasks = 1360 runs on a single M3 Max / 128 GB laptop.

(The benchmark repository itself – task prompts, hidden graders, raw agent traces – is private, to keep the task set out of training corpora. If the prompts and graders ended up in a future model’s pretraining data, the benchmark would stop measuring what it claims to measure. Aggregated results, the per-cell CSV, and the plotting source for this post are public; the tasks are not.)

These are very preliminary findings – some of the questions below probably deserve a careful re-run before I’d trust the rankings to two decimal places. But the headline patterns are stable enough across the Q4 and Q8 sweeps that I think it’s worth writing them down.

The 16 tasks span seven languages and a wide spread of difficulties:

Per-task difficulty across the full sweep

The hardest end (pt3_rope_gqa, jax1_complex_lp, pt7_prompt_blend, pt6_generate_cached, rs1_arena, pt5_logit_lens) is what discriminates the top tier of (model, harness) cells from the rest – everyone passes sql1_recursive and p2_shortest_path.

Q1 – What are the best (model, harness) combinations?

The single perfect cell on the matrix is Qwen3.6-27B (UD-Q4_K_XL) + pi – 16/16 at ~207 s/task. It is the only combination across all 85 tested cells (Q4 sweep + Q8 sweep) that clears the full benchmark.

If you care about throughput, gpt-oss-120b (MXFP4) + pi scores 15/16 at ~34 s/task – about 6× faster than the top combo for one extra failure, and the fastest member of the 15/16 tier overall. If you want a “dense-feeling” mid-size model, Qwen3.6-35B-A3B (UD-Q4_K_XL) + qwen hits 15/16 at ~108 s/task.

Eight cells cleared 15/16 or better:

Combo Pass Avg time
Qwen3.6-27B UD-Q4_K_XL + pi 16/16 207 s
gpt-oss-120b MXFP4 + pi 15/16 34 s
Qwen3.6-35B-A3B UD-Q4_K_XL + qwen 15/16 108 s
Qwen3.6-35B-A3B Q8_0 + claude 15/16 244 s
gemma-4-26B-A4B-it UD-Q4_K_XL + opencode 15/16 307 s
Qwen3.6-27B Q8_0 + qwen 15/16 395 s
gemma-4-31b-it Q4_K_M + pi 15/16 422 s
Qwen3.6-27B Q8_0 + opencode 15/16 462 s

pi shows up 3 times in the 15/16-or-better tier; qwen and opencode twice; claude once; aider not at all.

A clearer way to see the same information is to plot every cell in the matrix on (avg time, pass rate). The pareto frontier is what you actually care about as a user:

Speed-accuracy frontier across all (model, quant, harness) cells

Qwen3.6-27B + pi (top-left blue circle, ~207 s) is the only 100% point. gpt-oss-120b + pi (purple diamond at the far left) gets you ~94% in ~34 s – about 6× faster, one extra failure. The bottom-right cluster of cells past 700 s is mostly aider and opencode paying for verbose tool-use scaffolding without recovering accuracy.

Q2 – Ranking between models

Marginalising over the 5 harnesses (Q4 sweep, 5 × 16 = 80 cells per model):

Model Quant Pass Rate Avg agent time
Qwen3.6-27B UD-Q4_K_XL 66/80 82.5% 474 s
gemma-4-31b-it Q4_K_M 65/80 81.2% 582 s
Qwen3.6-35B-A3B UD-Q4_K_XL 64/80 80.0% 215 s
gpt-oss-120b MXFP4 62/80 77.5% 67 s
Qwen3-Coder-Next UD-Q4_K_XL 56/80 70.0% 162 s
gemma-4-26B-A4B-it UD-Q4_K_XL 56/80 70.0% 397 s
Qwen3.5-35B-A3B Q4_K_M 55/80 68.8% 123 s
Qwen3.5-27B Q4_K_M 55/80 68.8% 402 s
gpt-oss-20b Q4_K_M 45/80 56.2% 96 s
Qwen3-Omni-30B-A3B-Instruct Q4_K_M 27/80 33.8% 113 s

A few things stand out. The dense Qwen3.6-27B and the (relatively) new gemma-4-31b-it are roughly tied at the top, but the MoE Qwen3.6-35B-A3B (3B active in a 35B body) is essentially as accurate at less than half the wall-clock cost. gpt-oss-120b at MXFP4 buys you another ~3× speedup for ~3% accuracy. And Qwen3-Omni-30B-A3B-Instruct – the omni-modal sibling of Qwen3.6-35B-A3B – is the worst model in the sweep at 33.8%, suggesting that omni-tuning has a real cost for code-only agentic use.

Plotted against total parameter count (point size = active parameters, colour = family):

Model size vs pass rate, Q4 sweep

Two things to take from this. First, size alone is not the right axis – the 120-billion-parameter gpt-oss-120b lands below the 27-billion-parameter Qwen3.6-27B, and the omni-modal 30B point is the lowest in the chart. Second, active parameters do most of the work: in the 25-35B band the dense models (Qwen3.6-27B, gemma-4-31b-it) and the 3B-active MoEs (Qwen3.6-35B-A3B) sit on essentially the same accuracy line, and the (much smaller) gpt-oss-120b point sits ~5 points below. Family is the strongest single signal – the omni-tuned point is dramatically off-trend.

Q3 – Ranking between harnesses

Marginalising over the 10 Q4 models (10 × 16 = 160 cells per harness):

Harness Pass Rate Avg agent time
pi 123/160 76.9% 163 s
qwen 120/160 75.0% 191 s
claude 106/160 66.2% 306 s
opencode 102/160 63.8% 271 s
aider 100/160 62.5% 384 s

pi is the strongest harness on this benchmark – and also the fastest of the top three. The accuracy gap between pi/qwen and claude/opencode/aider is a clear ~10-15 percentage points, big enough that I wouldn’t want to call it noise. (Caveat: Claude Code in particular is being talked to via the Anthropic-compat shim against a local llama-server, which is not the configuration its prompt scaffolding was tuned for.)

Per-harness pass rate, Q4 sweep

The (model × harness) breakdown shows that the harness ranking isn’t uniform across models – some models are very harness-sensitive (gpt-oss-120b collapses with opencode, Qwen3.6-27B collapses with aider), others are relatively flat (Qwen3-Omni, gpt-oss-20b). Note the single 100-pass cell (top-left) and the gemma-4-26B-A4B-it + opencode outlier (94 pass rate against opencode’s overall mediocre numbers) – a clear hint that any cross-harness conclusion based on a single cell is shaky.

Pass rate per (model, harness) cell, Q4 sweep

Q4 – Do models or harnesses cheat by reading the hidden tests?

This is the question I was most curious about going in, and the answer is more interesting than I expected.

I grepped every agent.log for references to test.sh, tasks/.../test.sh, or “grader” – both in the model’s own reasoning and in tool calls (Read, Bash, Glob). Per harness:

Harness Mentions test.sh / grader Actually read or ran the hidden grader
pi 0 0
qwen 5 0
claude 7 0
aider 10 0
opencode 23 14

For qwen, claude, aider, the references are all benign: the model paraphrases the prompt’s own mention of “the hidden grader” or “the grader checks …” while reasoning out loud. None of them tried to glob, read, or invoke the grader.

opencode is a different story. Across 14 distinct (model, task) cells, opencode either ran bash <repo>/tasks/<id>/test.sh <workspace> directly to get a pass/fail signal, or read the contents of the hidden test.sh file with its Read tool – the latter giving the model the literal reference implementation and tolerance constants. A representative trace from Qwen3.5-27B-Q8 + opencode on pt3_rope_gqa:

AssertionError: output does not match either RoPE layout
  (half=0.9700, interleaved=0.8935)
→ Read ../../../../../tasks/pt3_rope_gqa/test.sh
"Let me look at the test file to understand exactly what it's expecting"
← Edit rope_gqa.py

Of those 14 peeking cells, 13 passed. So this isn’t a marginal effect: when opencode peeks at the grader, it usually wins. Now, two important caveats:

  1. The hidden test.sh files live outside the workspace (under tasks/<id>/test.sh). The benchmark is sandboxed only by convention – the agent has full filesystem access to ~/workspace/harness-bench/, so all five harnesses could technically peek. Only opencode actually does, by default. That’s a property of the harness, not of the underlying model.
  2. The two distinct behaviours (running the grader vs. reading the grader source) carry very different weights. Running bash test.sh workspace to iterate on a pass/fail signal is closer to “generous test-time compute” than to leaking labels; reading the test.sh source – which contains reference numerical tolerances and sometimes inlined reference implementations – is straight-up data leakage.

I do not see evidence of pi, qwen, claude, or aider attempting either of these things in any of the 640 (= 4 harnesses × 160) Q4 cells I inspected. Nor do I see models hard-coding test inputs into their solutions – the pt6_generate_cached task has an explicit anti-cheat (it counts embedding-layer token visits to catch implementations that ignore the KV cache and just re-run the model), and pass rates there look behaviourally consistent.

So: yes, one harness cheats by default, and it’s opencode. The implication is that any cross-harness comparison that includes opencode is mildly contaminated – specifically, ~14% of opencode’s passes (14 of 102) coincide with grader peeking. If I subtract those, opencode’s pass rate drops from 63.8% to roughly 55%, which would put it last. I’m planning to re-run with a stricter sandbox (chroot or container) before drawing harder conclusions, when/if I find the time.

Q5 – Does 4-bit vs 8-bit quantisation make a difference?

For the seven sub-50B models I re-ran the full sweep at Q8_0. The headline is that Q8 is a slight net regression, not an improvement, on this benchmark:

Model Q4 Pass Q8 Pass Δ Q4 jax1 Q8 jax1
Qwen3.6-35B-A3B 64/80 66/80 +2 5/5 3/5
Qwen3.6-27B 66/80 65/80 -1 2/5 4/5
gemma-4-31b-it 65/80 59/80 -6 0/5 0/5
Qwen3.5-27B 55/80 52/80 -3 2/5 0/5
Qwen3.5-35B-A3B 55/80 51/80 -4 5/5 3/5
Qwen3-Coder-Next 56/80 55/80 -1 0/5 2/5
gpt-oss-20b 45/80 52/80 +7 3/5 2/5
7-model total 406/560 400/560 -6 17/35 14/35

Q4 → Q8 paired pass rate slope chart

A few observations:

  • The aggregate is essentially flat (-6 / 560 = -1%) – well within the noise I’d expect from re-running a stochastic agent loop. Q4 is not obviously degraded relative to Q8 at this scale.
  • Q8 is also strictly slower. Qwen3.5-27B averages 402 s/task at Q4 vs longer at Q8; the bandwidth cost of 8-bit weights on Apple Silicon is real.
  • Per-model, the swings are noisy and not all in the same direction. Only gpt-oss-20b clearly benefits from Q8 (+7, the biggest single-model swing in either direction). gemma-4-31b-it loses the most at Q8 (-6, mostly distributed across small per-task regressions, including a 13/16 → 8/16 collapse for gemma-4-31b-it + opencode).
  • Quantisation effects show up most clearly on the hardest, numerically-finicky tasks. The two cleanest examples are jax1_complex_lp (ComplEx + filtered MRR) – where Q8 gives 14/35 vs Q4’s 17/35, mixed at the model level – and pt7_prompt_blend, the only task where Q8 cleanly beats Q4 (51% vs 42%). The dominant pt7 failure mode is a structural one (returning unmixed branch logits instead of probability-space-averaged log-probs), and Q8 appears to help models stay coherent over the longer reasoning chain that pt7 needs.

The summary I’d give on quantisation is: Q4_K_M (or UD-Q4_K_XL) is the right default on Apple Silicon for this kind of agentic-coding workload. You give up almost nothing on the 16-task average and you get 1.5-2× the throughput. The two exceptions are gpt-oss-20b (small, gains real accuracy at Q8) and any workload that leans heavily on long-chain numerical correctness like pt7_prompt_blend.

How to actually run the leading combinations

If you just want to reproduce the top cells of the matrix on your own laptop, the recipe is: launch llama-server in one terminal, point a harness at it in another. Every harness wrapper in harness-bench expects the same alias (bench-model) on :8001, so the only thing that changes between rows of the leaderboard is the -hf flag on the server and the harness binary you call.

Server: llama-server

-hf auto-downloads the GGUF from HuggingFace; the q8 KV cache is what makes 131k context fit in 128 GB. One line per top combo:

# Qwen3.6-27B (UD-Q4_K_XL) -- top combo with `pi`: 16/16 at ~207 s/task
llama-server -hf unsloth/Qwen3.6-27B-GGUF:UD-Q4_K_XL --alias bench-model --port 8001 --host 127.0.0.1 --ctx-size 131072 --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 --jinja -np 1

# gpt-oss-120b (MXFP4, 32k ctx) -- fastest 15/16 combo with `pi` (~34 s/task)
llama-server -hf ggml-org/gpt-oss-120b-GGUF --alias bench-model --port 8001 --host 127.0.0.1 --ctx-size 32768 --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 --jinja -np 1

# Qwen3.6-35B-A3B (MoE, 3B active, UD-Q4_K_XL) -- 15/16 with `qwen` at ~108 s/task
llama-server -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_XL --alias bench-model --port 8001 --host 127.0.0.1 --ctx-size 131072 --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 --jinja --chat-template-kwargs '{"enable_thinking":false}' -np 1

# gemma-4-26B-A4B-it (UD-Q4_K_XL) -- smallest 15/16 cell, with `opencode` (~307 s/task)
llama-server -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL --alias bench-model --port 8001 --host 127.0.0.1 --ctx-size 131072 --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 --jinja -np 1

# gemma-4-31b-it (Q4_K_M, 65k ctx) -- 15/16 with `pi` at ~422 s/task
llama-server -hf unsloth/gemma-4-31b-it-GGUF:Q4_K_M --alias bench-model --port 8001 --host 127.0.0.1 --ctx-size 65536 --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 --jinja -np 1

Wait for curl http://127.0.0.1:8001/v1/models to return something before pointing a harness at it. (The orchestrator uses scripts/wait_ready.sh for this – worth stealing if you script your own runs.)

Harness: interactive vs. non-interactive

Each harness has two modes worth knowing about: non-interactive (read prompt, do work, exit – this is what the benchmark runs and what you want for scripted use) and interactive (drop into a REPL/TUI, useful when you want to iterate or look over the agent’s shoulder). Pick the one matching your use case:

# pi -- top harness on this benchmark (76.9% on Q4 sweep)
# Non-interactive: -p reads the prompt and exits; -nc/-ns/-ne/-np disable AGENTS.md /
# skills / extensions / prompt-template discovery (fairness in the sweep)
pi --provider local-llama --model bench-model -p --no-session -nc -ns -ne -np "<prompt>"
# Interactive: drop the -p flag (and keep the discovery flags off if you want a clean run)
pi --provider local-llama --model bench-model

# qwen -- fastest near-top harness (75.0%)
# Non-interactive: pass the prompt as a positional argument; --approval-mode yolo
# auto-approves all tool calls (you really do want this for an unattended run)
qwen --openai-base-url "http://127.0.0.1:8001/v1" --openai-api-key "dummy" --model "bench-model" --approval-mode yolo "<prompt>"
# Interactive: omit the prompt argument
qwen --openai-base-url "http://127.0.0.1:8001/v1" --openai-api-key "dummy" --model "bench-model"

# claude -- via the Anthropic-compat shim against llama-server
# Non-interactive: -p reads stdin / a positional prompt; --bare disables MCP autoload;
# --dangerously-skip-permissions is the equivalent of qwen's `yolo`
ANTHROPIC_BASE_URL="http://127.0.0.1:8001" ANTHROPIC_API_KEY="sk-ant-local-dummy" \
  claude --bare --model bench-model -p --dangerously-skip-permissions "<prompt>"
# Interactive: drop -p (the TUI manages the prompt itself)
ANTHROPIC_BASE_URL="http://127.0.0.1:8001" ANTHROPIC_API_KEY="sk-ant-local-dummy" \
  claude --bare --model bench-model --dangerously-skip-permissions

# opencode -- run command is one-shot; opencode without it is the TUI
# Non-interactive: `opencode run` with the prompt
opencode run -m "llamacpp/bench-model" "<prompt>"
# Interactive: just `opencode` in the workspace
opencode

# aider -- git-native; --message is one-shot, no flags = REPL
# Non-interactive: --message + --yes-always; --no-stream is friendlier to local servers
OPENAI_API_BASE="http://127.0.0.1:8001/v1" OPENAI_API_KEY="dummy" \
  aider --model "openai/bench-model" --no-git --yes-always --no-stream --message "<prompt>"
# Interactive: drop --message and you get the standard aider REPL
OPENAI_API_BASE="http://127.0.0.1:8001/v1" OPENAI_API_KEY="dummy" \
  aider --model "openai/bench-model" --no-git

A couple of footguns worth flagging:

  • pi reads its provider wiring from ~/.pi/agent/models.json. A minimal entry pointing at the local server: {"providers": {"local-llama": {"baseUrl": "http://127.0.0.1:8001/v1", "api": "openai-completions", "apiKey": "dummy", "models": [{"id": "bench-model", "input": ["text"], "contextWindow": 131072, "maxTokens": 16384}]}}}. Without this, --provider local-llama will fail with a “no such provider” error before the harness even talks to the server.
  • Claude Code goes through the Anthropic-compat path on :8001 (no /v1). The other four harnesses use the OpenAI-compat path on :8001/v1. Easy to mix up.
  • opencode peeks at hidden files by default (see Q4). If you’re running it on tasks with hidden graders living outside the workspace, that’s a problem; if you’re using it for normal coding work, it’s a feature. Either way, worth knowing.
  • -np 1 on llama-server disables prompt-batch parallelism, which keeps memory predictable on the M3 Max. If you’re on a bigger machine, drop it.

If you want the orchestrator to do all of the above for you – start the server, walk a harness through every task, append a JSONL record per (harness, task) – the one-liner is python3 scripts/run_bench.py --model unsloth/Qwen3.6-27B-GGUF:UD-Q4_K_XL --ctx 131072 --harnesses pi --tasks all --resume. --resume skips cells already present in results/<model>.jsonl, which is what makes long sweeps survivable.

The harness-bench repo (task prompts, hidden graders, raw agent traces) is private for the contamination reasons noted at the top – happy to share access on request for collaborators or for replication, just drop me a line.

comments powered by Disqus