28 Apr 2026 • on llama-cpp agents coding-agents quantisation local-llms

[WIP] Benchmarking Local LLMs Against Coding Agent Harnesses

I’ve been running a small benchmark, harness-bench, that pairs local LLMs (served via llama.cpp’s llama-server) with agent harnesses (Aider, Claude Code, OpenCode, Pi, Qwen CLI) on 16 software-engineering tasks across Python, PyTorch, JAX, C, C++, Rust, and SQL. Each (model, harness, task) cell is sandboxed: the agent only sees a scratch workspace/ and grading is done by a hidden test.sh that the agent never sees. The current sweep is 17 model-quants × 5 harnesses × 16 tasks = 1360 runs on a single M3 Max / 128 GB laptop.

(The benchmark repository itself – task prompts, hidden graders, raw agent traces – is private, to keep the task set out of training corpora. If the prompts and graders ended up in a future model’s pretraining data, the benchmark would stop measuring what it claims to measure. Aggregated results, the per-cell CSV, and the plotting source for this post are public; the tasks are not.)

These are very preliminary findings – some of the questions below probably deserve a careful re-run before I’d trust the rankings to two decimal places. But the headline patterns are stable enough across the Q4 and Q8 sweeps that I think it’s worth writing them down.

The 16 tasks span seven languages and a wide spread of difficulties:

Per-task difficulty across the full sweep

The hardest end (pt3_rope_gqa, jax1_complex_lp, pt7_prompt_blend, pt6_generate_cached, rs1_arena, pt5_logit_lens) is what discriminates the top tier of (model, harness) cells from the rest – everyone passes sql1_recursive and p2_shortest_path.

Q1 – What are the best (model, harness) combinations?

The single perfect cell on the matrix is Qwen3.6-27B (UD-Q4_K_XL) + pi – 16/16 at ~207 s/task. It is the only combination across all 85 tested cells (Q4 sweep + Q8 sweep) that clears the full benchmark.

If you care about throughput, gpt-oss-120b (MXFP4) + pi scores 15/16 at ~34 s/task – about 6× faster than the top combo for one extra failure, and the fastest member of the 15/16 tier overall. If you want a “dense-feeling” mid-size model, Qwen3.6-35B-A3B (UD-Q4_K_XL) + qwen hits 15/16 at ~108 s/task.

Eight cells cleared 15/16 or better:

Combo	Pass	Avg time
`Qwen3.6-27B` UD-Q4_K_XL + `pi`	16/16	207 s
`gpt-oss-120b` MXFP4 + `pi`	15/16	34 s
`Qwen3.6-35B-A3B` UD-Q4_K_XL + `qwen`	15/16	108 s
`Qwen3.6-35B-A3B` Q8_0 + `claude`	15/16	244 s
`gemma-4-26B-A4B-it` UD-Q4_K_XL + `opencode`	15/16	307 s
`Qwen3.6-27B` Q8_0 + `qwen`	15/16	395 s
`gemma-4-31b-it` Q4_K_M + `pi`	15/16	422 s
`Qwen3.6-27B` Q8_0 + `opencode`	15/16	462 s

pi shows up 3 times in the 15/16-or-better tier; qwen and opencode twice; claude once; aider not at all.

A clearer way to see the same information is to plot every cell in the matrix on (avg time, pass rate). The pareto frontier is what you actually care about as a user:

Speed-accuracy frontier across all (model, quant, harness) cells

Qwen3.6-27B + pi (top-left blue circle, ~207 s) is the only 100% point. gpt-oss-120b + pi (purple diamond at the far left) gets you ~94% in ~34 s – about 6× faster, one extra failure. The bottom-right cluster of cells past 700 s is mostly aider and opencode paying for verbose tool-use scaffolding without recovering accuracy.

Q2 – Ranking between models

Marginalising over the 5 harnesses (Q4 sweep, 5 × 16 = 80 cells per model):

Model	Quant	Pass	Rate	Avg agent time
`Qwen3.6-27B`	UD-Q4_K_XL	66/80	82.5%	474 s
`gemma-4-31b-it`	Q4_K_M	65/80	81.2%	582 s
`Qwen3.6-35B-A3B`	UD-Q4_K_XL	64/80	80.0%	215 s
`gpt-oss-120b`	MXFP4	62/80	77.5%	67 s
`Qwen3-Coder-Next`	UD-Q4_K_XL	56/80	70.0%	162 s
`gemma-4-26B-A4B-it`	UD-Q4_K_XL	56/80	70.0%	397 s
`Qwen3.5-35B-A3B`	Q4_K_M	55/80	68.8%	123 s
`Qwen3.5-27B`	Q4_K_M	55/80	68.8%	402 s
`gpt-oss-20b`	Q4_K_M	45/80	56.2%	96 s
`Qwen3-Omni-30B-A3B-Instruct`	Q4_K_M	27/80	33.8%	113 s

A few things stand out. The dense Qwen3.6-27B and the (relatively) new gemma-4-31b-it are roughly tied at the top, but the MoE Qwen3.6-35B-A3B (3B active in a 35B body) is essentially as accurate at less than half the wall-clock cost. gpt-oss-120b at MXFP4 buys you another ~3× speedup for ~3% accuracy. And Qwen3-Omni-30B-A3B-Instruct – the omni-modal sibling of Qwen3.6-35B-A3B – is the worst model in the sweep at 33.8%, suggesting that omni-tuning has a real cost for code-only agentic use.

Plotted against total parameter count (point size = active parameters, colour = family):

Model size vs pass rate, Q4 sweep

Two things to take from this. First, size alone is not the right axis – the 120-billion-parameter gpt-oss-120b lands below the 27-billion-parameter Qwen3.6-27B, and the omni-modal 30B point is the lowest in the chart. Second, active parameters do most of the work: in the 25-35B band the dense models (Qwen3.6-27B, gemma-4-31b-it) and the 3B-active MoEs (Qwen3.6-35B-A3B) sit on essentially the same accuracy line, and the (much smaller) gpt-oss-120b point sits ~5 points below. Family is the strongest single signal – the omni-tuned point is dramatically off-trend.

Q3 – Ranking between harnesses

Marginalising over the 10 Q4 models (10 × 16 = 160 cells per harness):

Harness	Pass	Rate	Avg agent time
`pi`	123/160	76.9%	163 s
`qwen`	120/160	75.0%	191 s
`claude`	106/160	66.2%	306 s
`opencode`	102/160	63.8%	271 s
`aider`	100/160	62.5%	384 s

pi is the strongest harness on this benchmark – and also the fastest of the top three. The accuracy gap between pi/qwen and claude/opencode/aider is a clear ~10-15 percentage points, big enough that I wouldn’t want to call it noise. (Caveat: Claude Code in particular is being talked to via the Anthropic-compat shim against a local llama-server, which is not the configuration its prompt scaffolding was tuned for.)

Per-harness pass rate, Q4 sweep

The (model × harness) breakdown shows that the harness ranking isn’t uniform across models – some models are very harness-sensitive (gpt-oss-120b collapses with opencode, Qwen3.6-27B collapses with aider), others are relatively flat (Qwen3-Omni, gpt-oss-20b). Note the single 100-pass cell (top-left) and the gemma-4-26B-A4B-it + opencode outlier (94 pass rate against opencode’s overall mediocre numbers) – a clear hint that any cross-harness conclusion based on a single cell is shaky.

Pass rate per (model, harness) cell, Q4 sweep

Q4 – Do models or harnesses cheat by reading the hidden tests?

This is the question I was most curious about going in, and the answer is more interesting than I expected.

I grepped every agent.log for references to test.sh, tasks/.../test.sh, or “grader” – both in the model’s own reasoning and in tool calls (Read, Bash, Glob). Per harness:

Harness	Mentions test.sh / grader	Actually read or ran the hidden grader
`pi`	0	0
`qwen`	5	0
`claude`	7	0
`aider`	10	0
`opencode`	23	14

For qwen, claude, aider, the references are all benign: the model paraphrases the prompt’s own mention of “the hidden grader” or “the grader checks …” while reasoning out loud. None of them tried to glob, read, or invoke the grader.

opencode is a different story. Across 14 distinct (model, task) cells, opencode either ran bash <repo>/tasks/<id>/test.sh <workspace> directly to get a pass/fail signal, or read the contents of the hidden test.sh file with its Read tool – the latter giving the model the literal reference implementation and tolerance constants. A representative trace from Qwen3.5-27B-Q8 + opencode on pt3_rope_gqa:

AssertionError: output does not match either RoPE layout
  (half=0.9700, interleaved=0.8935)
→ Read ../../../../../tasks/pt3_rope_gqa/test.sh
"Let me look at the test file to understand exactly what it's expecting"
← Edit rope_gqa.py

Of those 14 peeking cells, 13 passed. So this isn’t a marginal effect: when opencode peeks at the grader, it usually wins. Now, two important caveats:

The hidden test.sh files live outside the workspace (under tasks/<id>/test.sh). The benchmark is sandboxed only by convention – the agent has full filesystem access to ~/workspace/harness-bench/, so all five harnesses could technically peek. Only opencode actually does, by default. That’s a property of the harness, not of the underlying model.
The two distinct behaviours (running the grader vs. reading the grader source) carry very different weights. Running bash test.sh workspace to iterate on a pass/fail signal is closer to “generous test-time compute” than to leaking labels; reading the test.sh source – which contains reference numerical tolerances and sometimes inlined reference implementations – is straight-up data leakage.

I do not see evidence of pi, qwen, claude, or aider attempting either of these things in any of the 640 (= 4 harnesses × 160) Q4 cells I inspected. Nor do I see models hard-coding test inputs into their solutions – the pt6_generate_cached task has an explicit anti-cheat (it counts embedding-layer token visits to catch implementations that ignore the KV cache and just re-run the model), and pass rates there look behaviourally consistent.

So: yes, one harness cheats by default, and it’s opencode. The implication is that any cross-harness comparison that includes opencode is mildly contaminated – specifically, ~14% of opencode’s passes (14 of 102) coincide with grader peeking. If I subtract those, opencode’s pass rate drops from 63.8% to roughly 55%, which would put it last. I’m planning to re-run with a stricter sandbox (chroot or container) before drawing harder conclusions, when/if I find the time.

Q5 – Does 4-bit vs 8-bit quantisation make a difference?

For the seven sub-50B models I re-ran the full sweep at Q8_0. The headline is that Q8 is a slight net regression, not an improvement, on this benchmark:

Model	Q4 Pass	Q8 Pass	Δ	Q4 jax1	Q8 jax1
`Qwen3.6-35B-A3B`	64/80	66/80	+2	5/5	3/5
`Qwen3.6-27B`	66/80	65/80	-1	2/5	4/5
`gemma-4-31b-it`	65/80	59/80	-6	0/5	0/5
`Qwen3.5-27B`	55/80	52/80	-3	2/5	0/5
`Qwen3.5-35B-A3B`	55/80	51/80	-4	5/5	3/5
`Qwen3-Coder-Next`	56/80	55/80	-1	0/5	2/5
`gpt-oss-20b`	45/80	52/80	+7	3/5	2/5
7-model total	406/560	400/560	-6	17/35	14/35

Q4 → Q8 paired pass rate slope chart

A few observations:

The aggregate is essentially flat (-6 / 560 = -1%) – well within the noise I’d expect from re-running a stochastic agent loop. Q4 is not obviously degraded relative to Q8 at this scale.
Q8 is also strictly slower. Qwen3.5-27B averages 402 s/task at Q4 vs longer at Q8; the bandwidth cost of 8-bit weights on Apple Silicon is real.
Per-model, the swings are noisy and not all in the same direction. Only gpt-oss-20b clearly benefits from Q8 (+7, the biggest single-model swing in either direction). gemma-4-31b-it loses the most at Q8 (-6, mostly distributed across small per-task regressions, including a 13/16 → 8/16 collapse for gemma-4-31b-it + opencode).
Quantisation effects show up most clearly on the hardest, numerically-finicky tasks. The two cleanest examples are jax1_complex_lp (ComplEx + filtered MRR) – where Q8 gives 14/35 vs Q4’s 17/35, mixed at the model level – and pt7_prompt_blend, the only task where Q8 cleanly beats Q4 (51% vs 42%). The dominant pt7 failure mode is a structural one (returning unmixed branch logits instead of probability-space-averaged log-probs), and Q8 appears to help models stay coherent over the longer reasoning chain that pt7 needs.

The summary I’d give on quantisation is: Q4_K_M (or UD-Q4_K_XL) is the right default on Apple Silicon for this kind of agentic-coding workload. You give up almost nothing on the 16-task average and you get 1.5-2× the throughput. The two exceptions are gpt-oss-20b (small, gains real accuracy at Q8) and any workload that leans heavily on long-chain numerical correctness like pt7_prompt_blend.

How to actually run the leading combinations

If you just want to reproduce the top cells of the matrix on your own laptop, the recipe is: launch llama-server in one terminal, point a harness at it in another. Every harness wrapper in harness-bench expects the same alias (bench-model) on :8001, so the only thing that changes between rows of the leaderboard is the -hf flag on the server and the harness binary you call.

Server: `llama-server`

-hf auto-downloads the GGUF from HuggingFace; the q8 KV cache is what makes 131k context fit in 128 GB. One line per top combo:

# Qwen3.6-27B (UD-Q4_K_XL) -- top combo with `pi`: 16/16 at ~207 s/task
llama-server -hf unsloth/Qwen3.6-27B-GGUF:UD-Q4_K_XL --alias bench-model --port 8001 --host 127.0.0.1 --ctx-size 131072 --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 --jinja -np 1

# gpt-oss-120b (MXFP4, 32k ctx) -- fastest 15/16 combo with `pi` (~34 s/task)
llama-server -hf ggml-org/gpt-oss-120b-GGUF --alias bench-model --port 8001 --host 127.0.0.1 --ctx-size 32768 --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 --jinja -np 1

# Qwen3.6-35B-A3B (MoE, 3B active, UD-Q4_K_XL) -- 15/16 with `qwen` at ~108 s/task
llama-server -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_XL --alias bench-model --port 8001 --host 127.0.0.1 --ctx-size 131072 --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 --jinja --chat-template-kwargs '{"enable_thinking":false}' -np 1

# gemma-4-26B-A4B-it (UD-Q4_K_XL) -- smallest 15/16 cell, with `opencode` (~307 s/task)
llama-server -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL --alias bench-model --port 8001 --host 127.0.0.1 --ctx-size 131072 --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 --jinja -np 1

# gemma-4-31b-it (Q4_K_M, 65k ctx) -- 15/16 with `pi` at ~422 s/task
llama-server -hf unsloth/gemma-4-31b-it-GGUF:Q4_K_M --alias bench-model --port 8001 --host 127.0.0.1 --ctx-size 65536 --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 --jinja -np 1

Wait for curl http://127.0.0.1:8001/v1/models to return something before pointing a harness at it. (The orchestrator uses scripts/wait_ready.sh for this – worth stealing if you script your own runs.)

Harness: interactive vs. non-interactive

Each harness has two modes worth knowing about: non-interactive (read prompt, do work, exit – this is what the benchmark runs and what you want for scripted use) and interactive (drop into a REPL/TUI, useful when you want to iterate or look over the agent’s shoulder). Pick the one matching your use case:

# pi -- top harness on this benchmark (76.9% on Q4 sweep)
# Non-interactive: -p reads the prompt and exits; -nc/-ns/-ne/-np disable AGENTS.md /
# skills / extensions / prompt-template discovery (fairness in the sweep)
pi --provider local-llama --model bench-model -p --no-session -nc -ns -ne -np "<prompt>"
# Interactive: drop the -p flag (and keep the discovery flags off if you want a clean run)
pi --provider local-llama --model bench-model

# qwen -- fastest near-top harness (75.0%)
# Non-interactive: pass the prompt as a positional argument; --approval-mode yolo
# auto-approves all tool calls (you really do want this for an unattended run)
qwen --openai-base-url "http://127.0.0.1:8001/v1" --openai-api-key "dummy" --model "bench-model" --approval-mode yolo "<prompt>"
# Interactive: omit the prompt argument
qwen --openai-base-url "http://127.0.0.1:8001/v1" --openai-api-key "dummy" --model "bench-model"

# claude -- via the Anthropic-compat shim against llama-server
# Non-interactive: -p reads stdin / a positional prompt; --bare disables MCP autoload;
# --dangerously-skip-permissions is the equivalent of qwen's `yolo`
ANTHROPIC_BASE_URL="http://127.0.0.1:8001" ANTHROPIC_API_KEY="sk-ant-local-dummy" \
  claude --bare --model bench-model -p --dangerously-skip-permissions "<prompt>"
# Interactive: drop -p (the TUI manages the prompt itself)
ANTHROPIC_BASE_URL="http://127.0.0.1:8001" ANTHROPIC_API_KEY="sk-ant-local-dummy" \
  claude --bare --model bench-model --dangerously-skip-permissions

# opencode -- run command is one-shot; opencode without it is the TUI
# Non-interactive: `opencode run` with the prompt
opencode run -m "llamacpp/bench-model" "<prompt>"
# Interactive: just `opencode` in the workspace
opencode

# aider -- git-native; --message is one-shot, no flags = REPL
# Non-interactive: --message + --yes-always; --no-stream is friendlier to local servers
OPENAI_API_BASE="http://127.0.0.1:8001/v1" OPENAI_API_KEY="dummy" \
  aider --model "openai/bench-model" --no-git --yes-always --no-stream --message "<prompt>"
# Interactive: drop --message and you get the standard aider REPL
OPENAI_API_BASE="http://127.0.0.1:8001/v1" OPENAI_API_KEY="dummy" \
  aider --model "openai/bench-model" --no-git

A couple of footguns worth flagging:

pi reads its provider wiring from ~/.pi/agent/models.json. A minimal entry pointing at the local server: {"providers": {"local-llama": {"baseUrl": "http://127.0.0.1:8001/v1", "api": "openai-completions", "apiKey": "dummy", "models": [{"id": "bench-model", "input": ["text"], "contextWindow": 131072, "maxTokens": 16384}]}}}. Without this, --provider local-llama will fail with a “no such provider” error before the harness even talks to the server.
Claude Code goes through the Anthropic-compat path on :8001 (no /v1). The other four harnesses use the OpenAI-compat path on :8001/v1. Easy to mix up.
opencode peeks at hidden files by default (see Q4). If you’re running it on tasks with hidden graders living outside the workspace, that’s a problem; if you’re using it for normal coding work, it’s a feature. Either way, worth knowing.
-np 1 on llama-server disables prompt-batch parallelism, which keeps memory predictable on the M3 Max. If you’re on a bigger machine, drop it.

If you want the orchestrator to do all of the above for you – start the server, walk a harness through every task, append a JSONL record per (harness, task) – the one-liner is python3 scripts/run_bench.py --model unsloth/Qwen3.6-27B-GGUF:UD-Q4_K_XL --ctx 131072 --harnesses pi --tasks all --resume. --resume skips cells already present in results/<model>.jsonl, which is what makes long sweeps survivable.

The harness-bench repo (task prompts, hidden graders, raw agent traces) is private for the contamination reasons noted at the top – happy to share access on request for collaborators or for replication, just drop me a line.