<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>neuralnoise.com</title>
    <description>Homepage of Dr Pasquale Minervini &lt;br/&gt; Researcher/Faculty at the University of Edinburgh, School of Informatics &lt;br/&gt; Co-Founder and CTO at Miniml.AI &lt;br/&gt; ELLIS Scholar, Edinburgh Unit</description>
    <link>https://neuralnoise.com///</link>
    <atom:link href="https://neuralnoise.com///feed.xml" rel="self" type="application/rss+xml" />
    <pubDate>Tue, 28 Apr 2026 14:37:49 +0100</pubDate>
    <lastBuildDate>Tue, 28 Apr 2026 14:37:49 +0100</lastBuildDate>
    <generator>Jekyll v4.3.4</generator>
    
      <item>
        <title>[WIP] Benchmarking Local LLMs Against Coding Agent Harnesses</title>
        <description>&lt;p&gt;I’ve been running a small benchmark, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;harness-bench&lt;/code&gt;, that pairs &lt;strong&gt;local LLMs&lt;/strong&gt; (served via &lt;a href=&quot;https://github.com/ggml-org/llama.cpp&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;llama.cpp&lt;/code&gt;&lt;/a&gt;’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;llama-server&lt;/code&gt;) with &lt;strong&gt;agent harnesses&lt;/strong&gt; (Aider, Claude Code, OpenCode, Pi, Qwen CLI) on 16 software-engineering tasks across Python, PyTorch, JAX, C, C++, Rust, and SQL. Each &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;(model, harness, task)&lt;/code&gt; cell is sandboxed: the agent only sees a scratch &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;workspace/&lt;/code&gt; and grading is done by a hidden &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;test.sh&lt;/code&gt; that the agent never sees. The current sweep is 17 model-quants × 5 harnesses × 16 tasks = &lt;strong&gt;1360 runs&lt;/strong&gt; on a single M3 Max / 128 GB laptop.&lt;/p&gt;

&lt;p&gt;(The benchmark repository itself – task prompts, hidden graders, raw agent traces – is &lt;strong&gt;private&lt;/strong&gt;, to keep the task set out of training corpora. If the prompts and graders ended up in a future model’s pretraining data, the benchmark would stop measuring what it claims to measure. Aggregated results, the per-cell CSV, and the plotting source for this post are public; the tasks are not.)&lt;/p&gt;

&lt;p&gt;These are very preliminary findings – some of the questions below probably deserve a careful re-run before I’d trust the rankings to two decimal places. But the headline patterns are stable enough across the Q4 and Q8 sweeps that I think it’s worth writing them down.&lt;/p&gt;

&lt;p&gt;The 16 tasks span seven languages and a wide spread of difficulties:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/harness-bench/task_difficulty.svg&quot; alt=&quot;Per-task difficulty across the full sweep&quot; class=&quot;nn-chart center-image&quot; width=&quot;92%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The hardest end (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pt3_rope_gqa&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;jax1_complex_lp&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pt7_prompt_blend&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pt6_generate_cached&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rs1_arena&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pt5_logit_lens&lt;/code&gt;) is what discriminates the top tier of (model, harness) cells from the rest – everyone passes &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sql1_recursive&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;p2_shortest_path&lt;/code&gt;.&lt;/p&gt;

&lt;h3 id=&quot;q1--what-are-the-best-model-harness-combinations&quot;&gt;Q1 – What are the best (model, harness) combinations?&lt;/h3&gt;

&lt;p&gt;The single perfect cell on the matrix is &lt;strong&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Qwen3.6-27B&lt;/code&gt; (UD-Q4_K_XL) + &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pi&lt;/code&gt;&lt;/strong&gt; – 16/16 at ~207 s/task. It is the only combination across all 85 tested cells (Q4 sweep + Q8 sweep) that clears the full benchmark.&lt;/p&gt;

&lt;p&gt;If you care about throughput, &lt;strong&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gpt-oss-120b&lt;/code&gt; (MXFP4) + &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pi&lt;/code&gt;&lt;/strong&gt; scores 15/16 at ~34 s/task – about 6× faster than the top combo for one extra failure, and the fastest member of the 15/16 tier overall. If you want a “dense-feeling” mid-size model, &lt;strong&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Qwen3.6-35B-A3B&lt;/code&gt; (UD-Q4_K_XL) + &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;qwen&lt;/code&gt;&lt;/strong&gt; hits 15/16 at ~108 s/task.&lt;/p&gt;

&lt;p&gt;Eight cells cleared 15/16 or better:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Combo&lt;/th&gt;
      &lt;th style=&quot;text-align: right&quot;&gt;Pass&lt;/th&gt;
      &lt;th style=&quot;text-align: right&quot;&gt;Avg time&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Qwen3.6-27B&lt;/code&gt; UD-Q4_K_XL + &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pi&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;16/16&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;207 s&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gpt-oss-120b&lt;/code&gt; MXFP4 + &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pi&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;15/16&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;34 s&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Qwen3.6-35B-A3B&lt;/code&gt; UD-Q4_K_XL + &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;qwen&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;15/16&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;108 s&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Qwen3.6-35B-A3B&lt;/code&gt; Q8_0 + &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;claude&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;15/16&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;244 s&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gemma-4-26B-A4B-it&lt;/code&gt; UD-Q4_K_XL + &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;opencode&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;15/16&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;307 s&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Qwen3.6-27B&lt;/code&gt; Q8_0 + &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;qwen&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;15/16&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;395 s&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gemma-4-31b-it&lt;/code&gt; Q4_K_M + &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pi&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;15/16&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;422 s&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Qwen3.6-27B&lt;/code&gt; Q8_0 + &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;opencode&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;15/16&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;462 s&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pi&lt;/code&gt; shows up 3 times in the 15/16-or-better tier; &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;qwen&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;opencode&lt;/code&gt; twice; &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;claude&lt;/code&gt; once; &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;aider&lt;/code&gt; not at all.&lt;/p&gt;

&lt;p&gt;A clearer way to see the same information is to plot every cell in the matrix on (avg time, pass rate). The pareto frontier is what you actually care about as a user:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/harness-bench/speed_vs_accuracy.svg&quot; alt=&quot;Speed-accuracy frontier across all (model, quant, harness) cells&quot; class=&quot;nn-chart center-image&quot; width=&quot;100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Qwen3.6-27B + pi&lt;/code&gt; (top-left blue circle, ~207 s) is the only 100% point. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gpt-oss-120b + pi&lt;/code&gt; (purple diamond at the far left) gets you ~94% in ~34 s – about 6× faster, one extra failure. The bottom-right cluster of cells past 700 s is mostly &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;aider&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;opencode&lt;/code&gt; paying for verbose tool-use scaffolding without recovering accuracy.&lt;/p&gt;

&lt;h3 id=&quot;q2--ranking-between-models&quot;&gt;Q2 – Ranking between models&lt;/h3&gt;

&lt;p&gt;Marginalising over the 5 harnesses (Q4 sweep, 5 × 16 = 80 cells per model):&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Model&lt;/th&gt;
      &lt;th&gt;Quant&lt;/th&gt;
      &lt;th style=&quot;text-align: right&quot;&gt;Pass&lt;/th&gt;
      &lt;th style=&quot;text-align: right&quot;&gt;Rate&lt;/th&gt;
      &lt;th style=&quot;text-align: right&quot;&gt;Avg agent time&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Qwen3.6-27B&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;UD-Q4_K_XL&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;66/80&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;82.5%&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;474 s&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gemma-4-31b-it&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;Q4_K_M&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;65/80&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;81.2%&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;582 s&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Qwen3.6-35B-A3B&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;UD-Q4_K_XL&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;64/80&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;80.0%&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;215 s&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gpt-oss-120b&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;MXFP4&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;62/80&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;77.5%&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;67 s&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Qwen3-Coder-Next&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;UD-Q4_K_XL&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;56/80&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;70.0%&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;162 s&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gemma-4-26B-A4B-it&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;UD-Q4_K_XL&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;56/80&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;70.0%&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;397 s&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Qwen3.5-35B-A3B&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;Q4_K_M&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;55/80&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;68.8%&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;123 s&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Qwen3.5-27B&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;Q4_K_M&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;55/80&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;68.8%&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;402 s&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gpt-oss-20b&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;Q4_K_M&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;45/80&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;56.2%&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;96 s&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Qwen3-Omni-30B-A3B-Instruct&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;Q4_K_M&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;27/80&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;33.8%&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;113 s&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;A few things stand out. The dense &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Qwen3.6-27B&lt;/code&gt; and the (relatively) new &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gemma-4-31b-it&lt;/code&gt; are roughly tied at the top, but the &lt;em&gt;MoE&lt;/em&gt; &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Qwen3.6-35B-A3B&lt;/code&gt; (3B active in a 35B body) is essentially as accurate at less than half the wall-clock cost. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gpt-oss-120b&lt;/code&gt; at MXFP4 buys you another ~3× speedup for ~3% accuracy. And &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Qwen3-Omni-30B-A3B-Instruct&lt;/code&gt; – the omni-modal sibling of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Qwen3.6-35B-A3B&lt;/code&gt; – is the worst model in the sweep at 33.8%, suggesting that omni-tuning has a real cost for code-only agentic use.&lt;/p&gt;

&lt;p&gt;Plotted against total parameter count (point size = active parameters, colour = family):&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/harness-bench/size_vs_pass_rate.svg&quot; alt=&quot;Model size vs pass rate, Q4 sweep&quot; class=&quot;nn-chart center-image&quot; width=&quot;100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Two things to take from this. First, &lt;strong&gt;size alone is not the right axis&lt;/strong&gt; – the 120-billion-parameter &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gpt-oss-120b&lt;/code&gt; lands below the 27-billion-parameter &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Qwen3.6-27B&lt;/code&gt;, and the omni-modal 30B point is the lowest in the chart. Second, &lt;strong&gt;active parameters do most of the work&lt;/strong&gt;: in the 25-35B band the dense models (Qwen3.6-27B, gemma-4-31b-it) and the 3B-active MoEs (Qwen3.6-35B-A3B) sit on essentially the same accuracy line, and the (much smaller) gpt-oss-120b point sits ~5 points below. Family is the strongest single signal – the omni-tuned point is dramatically off-trend.&lt;/p&gt;

&lt;h3 id=&quot;q3--ranking-between-harnesses&quot;&gt;Q3 – Ranking between harnesses&lt;/h3&gt;

&lt;p&gt;Marginalising over the 10 Q4 models (10 × 16 = 160 cells per harness):&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Harness&lt;/th&gt;
      &lt;th style=&quot;text-align: right&quot;&gt;Pass&lt;/th&gt;
      &lt;th style=&quot;text-align: right&quot;&gt;Rate&lt;/th&gt;
      &lt;th style=&quot;text-align: right&quot;&gt;Avg agent time&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pi&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;123/160&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;76.9%&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;163 s&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;qwen&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;120/160&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;75.0%&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;191 s&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;claude&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;106/160&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;66.2%&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;306 s&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;opencode&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;102/160&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;63.8%&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;271 s&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;aider&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;100/160&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;62.5%&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;384 s&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/badlogic/pi-coding-agent&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pi&lt;/code&gt;&lt;/a&gt; is the strongest harness on this benchmark – and also the &lt;em&gt;fastest&lt;/em&gt; of the top three. The accuracy gap between &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pi&lt;/code&gt;/&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;qwen&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;claude&lt;/code&gt;/&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;opencode&lt;/code&gt;/&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;aider&lt;/code&gt; is a clear ~10-15 percentage points, big enough that I wouldn’t want to call it noise. (Caveat: Claude Code in particular is being talked to via the Anthropic-compat shim against a local &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;llama-server&lt;/code&gt;, which is not the configuration its prompt scaffolding was tuned for.)&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/harness-bench/harness_ranking.svg&quot; alt=&quot;Per-harness pass rate, Q4 sweep&quot; class=&quot;nn-chart center-image&quot; width=&quot;92%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The (model × harness) breakdown shows that the harness ranking isn’t uniform across models – some models are very harness-sensitive (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gpt-oss-120b&lt;/code&gt; collapses with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;opencode&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Qwen3.6-27B&lt;/code&gt; collapses with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;aider&lt;/code&gt;), others are relatively flat (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Qwen3-Omni&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gpt-oss-20b&lt;/code&gt;). Note the single 100-pass cell (top-left) and the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gemma-4-26B-A4B-it + opencode&lt;/code&gt; outlier (94 pass rate against &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;opencode&lt;/code&gt;’s overall mediocre numbers) – a clear hint that any cross-harness conclusion based on a single cell is shaky.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/harness-bench/model_harness_heatmap.svg&quot; alt=&quot;Pass rate per (model, harness) cell, Q4 sweep&quot; class=&quot;nn-chart center-image&quot; width=&quot;100%&quot; /&gt;&lt;/p&gt;

&lt;h3 id=&quot;q4--do-models-or-harnesses-cheat-by-reading-the-hidden-tests&quot;&gt;Q4 – Do models or harnesses cheat by reading the hidden tests?&lt;/h3&gt;

&lt;p&gt;This is the question I was most curious about going in, and the answer is more interesting than I expected.&lt;/p&gt;

&lt;p&gt;I grepped every &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;agent.log&lt;/code&gt; for references to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;test.sh&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tasks/.../test.sh&lt;/code&gt;, or “grader” – both in the model’s own reasoning and in tool calls (Read, Bash, Glob). Per harness:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Harness&lt;/th&gt;
      &lt;th style=&quot;text-align: right&quot;&gt;Mentions test.sh / grader&lt;/th&gt;
      &lt;th style=&quot;text-align: right&quot;&gt;Actually read or ran the hidden grader&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pi&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;0&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;&lt;strong&gt;0&lt;/strong&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;qwen&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;5&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;claude&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;7&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;aider&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;10&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;opencode&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;23&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;&lt;strong&gt;14&lt;/strong&gt;&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;For &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;qwen&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;claude&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;aider&lt;/code&gt;, the references are all benign: the model paraphrases the prompt’s own mention of “the hidden grader” or “the grader checks …” while reasoning out loud. None of them tried to glob, read, or invoke the grader.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;opencode&lt;/code&gt; is a different story.&lt;/strong&gt; Across 14 distinct (model, task) cells, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;opencode&lt;/code&gt; either ran &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;bash /Users/pminervi/workspace/harness-bench/tasks/&amp;lt;id&amp;gt;/test.sh &amp;lt;workspace&amp;gt;&lt;/code&gt; directly to get a pass/fail signal, or read the contents of the hidden &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;test.sh&lt;/code&gt; file with its &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Read&lt;/code&gt; tool – the latter giving the model the literal reference implementation and tolerance constants. A representative trace from &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Qwen3.5-27B-Q8 + opencode&lt;/code&gt; on &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pt3_rope_gqa&lt;/code&gt;:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;AssertionError: output does not match either RoPE layout
  (half=0.9700, interleaved=0.8935)
→ Read ../../../../../tasks/pt3_rope_gqa/test.sh
&quot;Let me look at the test file to understand exactly what it&apos;s expecting&quot;
← Edit rope_gqa.py
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Of those 14 peeking cells, &lt;strong&gt;13 passed&lt;/strong&gt;. So this isn’t a marginal effect: when &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;opencode&lt;/code&gt; peeks at the grader, it usually wins. Now, two important caveats:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;The hidden &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;test.sh&lt;/code&gt; files live &lt;em&gt;outside&lt;/em&gt; the workspace (under &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tasks/&amp;lt;id&amp;gt;/test.sh&lt;/code&gt;). The benchmark is sandboxed only by convention – the agent has full filesystem access to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;~/workspace/harness-bench/&lt;/code&gt;, so all five harnesses &lt;em&gt;could&lt;/em&gt; technically peek. Only &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;opencode&lt;/code&gt; actually does, by default. That’s a property of the harness, not of the underlying model.&lt;/li&gt;
  &lt;li&gt;The two distinct behaviours (running the grader vs. reading the grader source) carry very different weights. Running &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;bash test.sh workspace&lt;/code&gt; to iterate on a pass/fail signal is closer to “generous test-time compute” than to leaking labels; reading the test.sh source – which contains reference numerical tolerances and sometimes inlined reference implementations – is straight-up data leakage.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I do &lt;em&gt;not&lt;/em&gt; see evidence of pi, qwen, claude, or aider attempting either of these things in any of the 640 (= 4 harnesses × 160) Q4 cells I inspected. Nor do I see models hard-coding test inputs into their solutions – the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pt6_generate_cached&lt;/code&gt; task has an explicit anti-cheat (it counts embedding-layer token visits to catch implementations that ignore the KV cache and just re-run the model), and pass rates there look behaviourally consistent.&lt;/p&gt;

&lt;p&gt;So: yes, one harness cheats by default, and it’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;opencode&lt;/code&gt;. The implication is that any cross-harness comparison that includes &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;opencode&lt;/code&gt; is mildly contaminated – specifically, ~14% of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;opencode&lt;/code&gt;’s passes (14 of 102) coincide with grader peeking. If I subtract those, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;opencode&lt;/code&gt;’s pass rate drops from 63.8% to roughly 55%, which would put it last. I’m planning to re-run with a stricter sandbox (chroot or container) before drawing harder conclusions.&lt;/p&gt;

&lt;h3 id=&quot;q5--does-4-bit-vs-8-bit-quantisation-make-a-difference&quot;&gt;Q5 – Does 4-bit vs 8-bit quantisation make a difference?&lt;/h3&gt;

&lt;p&gt;For the seven sub-50B models I re-ran the full sweep at &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Q8_0&lt;/code&gt;. The headline is that &lt;strong&gt;Q8 is a slight net regression&lt;/strong&gt;, not an improvement, on this benchmark:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Model&lt;/th&gt;
      &lt;th style=&quot;text-align: right&quot;&gt;Q4 Pass&lt;/th&gt;
      &lt;th style=&quot;text-align: right&quot;&gt;Q8 Pass&lt;/th&gt;
      &lt;th style=&quot;text-align: right&quot;&gt;Δ&lt;/th&gt;
      &lt;th style=&quot;text-align: right&quot;&gt;Q4 jax1&lt;/th&gt;
      &lt;th style=&quot;text-align: right&quot;&gt;Q8 jax1&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Qwen3.6-35B-A3B&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;64/80&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;66/80&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;+2&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;5/5&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;3/5&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Qwen3.6-27B&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;66/80&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;65/80&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;-1&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;2/5&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;4/5&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gemma-4-31b-it&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;65/80&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;59/80&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;-6&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;0/5&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;0/5&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Qwen3.5-27B&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;55/80&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;52/80&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;-3&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;2/5&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;0/5&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Qwen3.5-35B-A3B&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;55/80&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;51/80&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;-4&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;5/5&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;3/5&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Qwen3-Coder-Next&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;56/80&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;55/80&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;-1&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;0/5&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;2/5&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gpt-oss-20b&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;45/80&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;52/80&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;+7&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;3/5&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;2/5&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;7-model total&lt;/strong&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;&lt;strong&gt;406/560&lt;/strong&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;&lt;strong&gt;400/560&lt;/strong&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;&lt;strong&gt;-6&lt;/strong&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;&lt;strong&gt;17/35&lt;/strong&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;&lt;strong&gt;14/35&lt;/strong&gt;&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;&lt;img src=&quot;/images/harness-bench/q4_vs_q8.svg&quot; alt=&quot;Q4 → Q8 paired pass rate slope chart&quot; class=&quot;nn-chart center-image&quot; width=&quot;100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;A few observations:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;The aggregate is essentially flat (-6 / 560 = -1%)&lt;/strong&gt; – well within the noise I’d expect from re-running a stochastic agent loop. Q4 is &lt;em&gt;not&lt;/em&gt; obviously degraded relative to Q8 at this scale.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Q8 is also strictly slower.&lt;/strong&gt; &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Qwen3.5-27B&lt;/code&gt; averages 402 s/task at Q4 vs longer at Q8; the bandwidth cost of 8-bit weights on Apple Silicon is real.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Per-model, the swings are noisy and not all in the same direction.&lt;/strong&gt; Only &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gpt-oss-20b&lt;/code&gt; clearly &lt;em&gt;benefits&lt;/em&gt; from Q8 (+7, the biggest single-model swing in either direction). &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gemma-4-31b-it&lt;/code&gt; &lt;em&gt;loses&lt;/em&gt; the most at Q8 (-6, mostly distributed across small per-task regressions, including a 13/16 → 8/16 collapse for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gemma-4-31b-it + opencode&lt;/code&gt;).&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Quantisation effects show up most clearly on the hardest, numerically-finicky tasks.&lt;/strong&gt; The two cleanest examples are &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;jax1_complex_lp&lt;/code&gt; (ComplEx + filtered MRR) – where Q8 gives 14/35 vs Q4’s 17/35, mixed at the model level – and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pt7_prompt_blend&lt;/code&gt;, the only task where Q8 &lt;em&gt;cleanly&lt;/em&gt; beats Q4 (51% vs 42%). The dominant pt7 failure mode is a structural one (returning unmixed branch logits instead of probability-space-averaged log-probs), and Q8 appears to help models stay coherent over the longer reasoning chain that pt7 needs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The summary I’d give on quantisation is: &lt;strong&gt;Q4_K_M (or UD-Q4_K_XL) is the right default on Apple Silicon for this kind of agentic-coding workload&lt;/strong&gt;. You give up almost nothing on the 16-task average and you get 1.5-2× the throughput. The two exceptions are &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gpt-oss-20b&lt;/code&gt; (small, gains real accuracy at Q8) and any workload that leans heavily on long-chain numerical correctness like &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pt7_prompt_blend&lt;/code&gt;.&lt;/p&gt;

&lt;h3 id=&quot;whats-still-wip&quot;&gt;What’s still WIP&lt;/h3&gt;

&lt;p&gt;This post is tagged WIP for a reason. Loose ends I want to close before I’d call any of this final:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Stricter sandboxing&lt;/strong&gt; so &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;opencode&lt;/code&gt;’s grader-peeking behaviour doesn’t contaminate the harness ranking. I want to know what &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;opencode&lt;/code&gt; actually scores when it can’t see &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;test.sh&lt;/code&gt;.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Multiple seeds per cell&lt;/strong&gt;. Right now each &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;(model, harness, task)&lt;/code&gt; is one run. Variance on the borderline cells (the difference between 13/16 and 14/16) is probably comparable to the harness-to-harness gap I’m citing.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Task coverage&lt;/strong&gt;. Sixteen tasks is small, and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pt3_rope_gqa&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rs1_arena&lt;/code&gt; are doing a lot of the work in separating the top tier. I’d like to add a few more genuinely-hard tasks, especially in C/C++ and SQL where the catalog is currently thin.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Vendor-hosted baselines.&lt;/strong&gt; The benchmark is currently entirely local. Running &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;claude&lt;/code&gt; against the actual Anthropic API and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;qwen&lt;/code&gt; against Alibaba’s API would let me decompose how much of the harness gap is “the harness scaffolding” vs “the local model is weaker than what the harness was designed against”.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The harness-bench repo (task prompts, hidden graders, raw agent traces) is private for the contamination reasons noted at the top – happy to share access on request for collaborators or for replication, just drop me a line. The aggregated per-cell CSV and the seaborn source for every figure in this post are public and live alongside the post itself, under &lt;a href=&quot;https://github.com/pminervini/neuralnoise-uno/tree/main/scripts/plots/harness_bench&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;scripts/plots/harness_bench/&lt;/code&gt;&lt;/a&gt; – run &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;python3 scripts/plots/harness_bench/make_plots.py&lt;/code&gt; to regenerate.&lt;/p&gt;
</description>
        <pubDate>Tue, 28 Apr 2026 01:00:00 +0100</pubDate>
        <link>https://neuralnoise.com///2026/harness-bench-wip/</link>
        <guid isPermaLink="true">https://neuralnoise.com///2026/harness-bench-wip/</guid>
        
        <category>llama-cpp</category>
        
        <category>agents</category>
        
        <category>coding-agents</category>
        
        <category>quantisation</category>
        
        <category>local-llms</category>
        
        
        <category>tools</category>
        
        <category>agents</category>
        
        <category>benchmarks</category>
        
      </item>
    
      <item>
        <title>Deep Research MCP</title>
        <description>&lt;p&gt;To make my life a bit easier, I built &lt;a href=&quot;https://github.com/pminervini/deep-research-mcp&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;deep-research-mcp&lt;/code&gt;&lt;/a&gt;, a small Python agent that exposes several “deep research” backends through a single &lt;a href=&quot;https://modelcontextprotocol.io&quot;&gt;Model Context Protocol&lt;/a&gt; server (&lt;a href=&quot;https://www.anthropic.com/news/model-context-protocol&quot;&gt;Anthropic, 2024&lt;/a&gt;). It allows Claude Code, Codex, Gemini CLI, or any MCP client to fire off long-running research tasks against whichever backend the user prefers.&lt;/p&gt;

&lt;h3 id=&quot;what-the-server-exposes&quot;&gt;What the server exposes&lt;/h3&gt;

&lt;p&gt;The MCP server exposes &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;deep_research&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;research_with_context&lt;/code&gt;, and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;research_status&lt;/code&gt;. The first kicks off a task; the second resumes one after an optional clarification round; the third polls. Everything else – provider selection, timeouts, clarification models, system prompts – lives in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;~/.deep_research&lt;/code&gt;.&lt;/p&gt;

&lt;h3 id=&quot;architecture&quot;&gt;Architecture&lt;/h3&gt;

&lt;div class=&quot;mermaid&quot;&gt;
flowchart LR
    Client[&quot;Claude Code / Codex / Gemini CLI&quot;] --&amp;gt;|MCP| Server[&quot;FastMCP server&quot;]
    Server --&amp;gt; Agent[&quot;DeepResearchAgent&quot;]
    Agent --&amp;gt; Clar[&quot;Clarification&quot;]
    Agent --&amp;gt; Instr[&quot;Instruction builder&quot;]
    Agent --&amp;gt; Backend{{&quot;Backend&quot;}}
    Backend --&amp;gt; OAI[&quot;OpenAI Responses / Chat Completions&quot;]
    Backend --&amp;gt; Gem[&quot;Gemini Deep Research&quot;]
    Backend --&amp;gt; DrT[&quot;DR-Tulu /chat&quot;]
    Backend --&amp;gt; ODR[&quot;Open Deep Research (smolagents)&quot;]
&lt;/div&gt;

&lt;script src=&quot;/js/mermaid.min.js&quot;&gt;&lt;/script&gt;

&lt;script&gt;
  mermaid.initialize({ startOnLoad: false, theme: &apos;default&apos; });
  mermaid.run({ querySelector: &apos;.mermaid&apos; });
&lt;/script&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DeepResearchAgent&lt;/code&gt; optionally runs a clarification pass (triage the query, ask follow-up questions, enrich it), optionally rewrites the query into a longer research brief, and then delegates to one of four backends behind a common interface. All backends return the same normalised record – report, citations, reasoning steps, task id, execution time – so the MCP tools and the CLI do not care which one ran.&lt;/p&gt;

&lt;h3 id=&quot;backends&quot;&gt;Backends&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;OpenAI Responses API&lt;/strong&gt; &lt;em&gt;(default)&lt;/em&gt;. Uses a Deep Research model such as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;o4-mini-deep-research&lt;/code&gt; with built-in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;web_search_preview&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;code_interpreter&lt;/code&gt; tools, background mode, and polling. Follows the reference pattern in OpenAI’s cookbook (&lt;a href=&quot;https://cookbook.openai.com/examples/deep_research_api/introduction_to_deep_research_api_agents&quot;&gt;OpenAI, 2025a&lt;/a&gt;; &lt;a href=&quot;https://openai.com/index/introducing-deep-research/&quot;&gt;OpenAI, 2025b&lt;/a&gt;).&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;OpenAI Chat Completions&lt;/strong&gt;. Same backend file, different code path (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;api_style = &quot;chat_completions&quot;&lt;/code&gt;). No built-in tools, no polling – just a blocking chat call. This is the escape hatch for any OpenAI-compatible endpoint: Perplexity’s Sonar Deep Research (&lt;a href=&quot;https://docs.perplexity.ai/getting-started/models/models/sonar-deep-research&quot;&gt;Perplexity, 2025&lt;/a&gt;), Groq, Together, Ollama, vLLM, or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;llama.cpp&lt;/code&gt;’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;llama-server&lt;/code&gt;.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Gemini Deep Research&lt;/strong&gt;. Implemented against the Interactions API, with Google Search and URL context as built-in tools (&lt;a href=&quot;https://blog.google/products/gemini/google-gemini-deep-research/&quot;&gt;Google, 2024&lt;/a&gt;).&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;DR-Tulu&lt;/strong&gt;. Allen AI’s open research agent, called over its &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/chat&lt;/code&gt; endpoint; the client-side integration is intentionally thin (&lt;a href=&quot;https://github.com/allenai/dr-tulu&quot;&gt;AI2, 2025&lt;/a&gt;).&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Open Deep Research&lt;/strong&gt;. A self-contained &lt;a href=&quot;https://github.com/huggingface/smolagents&quot;&gt;smolagents&lt;/a&gt; stack with a text browser and search tools, using LiteLLM as the model layer (&lt;a href=&quot;https://huggingface.co/blog/open-deep-research&quot;&gt;Roucher et al., 2025&lt;/a&gt;).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;why-mcp&quot;&gt;Why MCP?&lt;/h3&gt;

&lt;p&gt;Plugging a deep-research tool into Claude Code takes one line:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;claude mcp add deep-research &lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; uv run &lt;span class=&quot;nt&quot;&gt;--directory&lt;/span&gt; /path/to/deep-research-mcp deep-research-mcp
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Stdio for local spawning, Streamable HTTP on &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/mcp&lt;/code&gt; for a shared server. The repository also ships a CLI (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cli/deep-research-cli.py&lt;/code&gt;) and a Textual TUI (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cli/deep-research-tui.py&lt;/code&gt;) that can either run the agent directly or act as MCP clients against the HTTP endpoint – useful for debugging providers without involving the model in the loop.&lt;/p&gt;

&lt;h3 id=&quot;clarification&quot;&gt;Clarification&lt;/h3&gt;

&lt;p&gt;Optional, disabled by default in my setup and probably going away at some point. Three small chat models – a triage model, a clarifier, and an instruction builder – decide whether a query is underspecified, ask follow-up questions, and merge the answers into a longer brief before the expensive provider call. The pattern is adapted from OpenAI’s cookbook. On local endpoints (Ollama, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;llama-server&lt;/code&gt;), clarification needs a model with reasonable structured-output behaviour; very small models trip the triage step.&lt;/p&gt;

&lt;h3 id=&quot;notes&quot;&gt;Notes&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;Long-running tasks can take hours. Client timeouts matter more than the agent’s own – raise &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;MCP_TOOL_TIMEOUT&lt;/code&gt; (Claude Code) or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tool_timeout_sec&lt;/code&gt; (Codex) and poll &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;research_status&lt;/code&gt; rather than blocking a single tool call.&lt;/li&gt;
  &lt;li&gt;Clarification and instruction building remain OpenAI-compatible even when the main provider is Gemini or DR-Tulu; set &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;clarification_base_url&lt;/code&gt; separately.&lt;/li&gt;
  &lt;li&gt;The repo ships a Claude Code skill and a Codex skill so the assistant knows how to use the tools without re-reading the README.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Code, tests, and setup: &lt;a href=&quot;https://github.com/pminervini/deep-research-mcp&quot;&gt;github.com/pminervini/deep-research-mcp&lt;/a&gt;.&lt;/p&gt;
</description>
        <pubDate>Tue, 21 Apr 2026 01:00:00 +0100</pubDate>
        <link>https://neuralnoise.com///2026/deep-research-mcp/</link>
        <guid isPermaLink="true">https://neuralnoise.com///2026/deep-research-mcp/</guid>
        
        <category>mcp</category>
        
        <category>deep-research</category>
        
        <category>agents</category>
        
        <category>claude-code</category>
        
        
        <category>tools</category>
        
        <category>agents</category>
        
      </item>
    
      <item>
        <title>Some Notes on Gradient Estimation</title>
        <description>&lt;p&gt;Assume we have a scalar function $f(x)$ of interest, such as a reward we want to maximise or a loss we want to minimise, and that $x$ is drawn from a distribution $p_{\theta}(x)$ parameterised by $\theta$.
A natural quantity to study is the expected value of $f$ under this distribution,&lt;/p&gt;

\[J(\theta) = \mathbb{E}_{x \sim p_{\theta}(x)}[f(x)],\]

&lt;p&gt;and we would like to optimise $J(\theta)$ with respect to $\theta$.
If we want to do this via gradient-based optimisation, we need a way to compute, or at least estimate, the gradient $\nabla_{\theta} J(\theta)$.
The difficulty is that $\theta$ enters $J$ through the sampling distribution rather than through the integrand directly, so we cannot simply differentiate $f$ and call it a day.
The rest of these notes collects some of the standard tricks for getting around this.&lt;/p&gt;

&lt;h3 id=&quot;reinforce-based-estimators&quot;&gt;REINFORCE-Based Estimators&lt;/h3&gt;

&lt;h4 id=&quot;reinforce&quot;&gt;REINFORCE&lt;/h4&gt;

&lt;p&gt;Assuming $f$ depends on $\theta$ only through the sample $x$,&lt;/p&gt;

\[\begin{aligned}
\nabla_{\theta} J(\theta)
&amp;amp;= \nabla_{\theta} \int p_{\theta}(x) f(x)\, dx \\
&amp;amp;= \int f(x)\, \nabla_{\theta} p_{\theta}(x)\, dx \\
&amp;amp;= \int f(x)\, p_{\theta}(x)\, \nabla_{\theta} \log p_{\theta}(x)\, dx \\
&amp;amp;= \mathbb{E}_{x \sim p_{\theta}(x)} \left[ f(x)\, \nabla_{\theta} \log p_{\theta}(x) \right].
\end{aligned}\]

&lt;p&gt;This is the REINFORCE, or score-function, gradient estimator
(&lt;a href=&quot;https://doi.org/10.1007/BF00992696&quot;&gt;Williams, 1992&lt;/a&gt;).&lt;/p&gt;

&lt;h4 id=&quot;reinforce--baseline&quot;&gt;REINFORCE + Baseline&lt;/h4&gt;

&lt;p&gt;For any baseline $b$ that does not depend on the sampled variable $x$,&lt;/p&gt;

\[\mathbb{E}_{x \sim p_{\theta}(x)}
\left[
\left(
f(x) - b
\right)
\nabla_{\theta} \log p_{\theta}(x)
\right]
=
\nabla_{\theta} J(\theta),\]

&lt;p&gt;because&lt;/p&gt;

\[\mathbb{E}_{x \sim p_{\theta}(x)}
\left[
b \nabla_{\theta} \log p_{\theta}(x)
\right]
=
b \nabla_{\theta} \int p_{\theta}(x)\, dx
= 0.\]

&lt;p&gt;So subtracting such a baseline preserves unbiasedness.
For a scalar parameter, writing $s(x) = \nabla_{\theta} \log p_{\theta}(x)$, the variance is&lt;/p&gt;

\[\operatorname{Var}\left[(f(x) - b)s(x)\right]
=
\mathbb{E}\left[(f(x) - b)^{2} s(x)^{2}\right]
- \left(\nabla_{\theta} J(\theta)\right)^{2}.\]

&lt;p&gt;As a function of $b$, this is a quadratic:&lt;/p&gt;

\[\mathbb{E}[s(x)^{2}]\, b^{2}
- 2 \mathbb{E}[f(x) s(x)^{2}]\, b
+ \text{const},\]

&lt;p&gt;so it is minimized at&lt;/p&gt;

\[b^{\star}
=
\frac{\mathbb{E}[f(x) s(x)^{2}]}{\mathbb{E}[s(x)^{2}]}.\]

&lt;p&gt;Therefore an appropriate baseline reduces variance, and the optimal constant baseline is $b^{\star}$.&lt;/p&gt;

&lt;h4 id=&quot;proximal-policy-optimization&quot;&gt;Proximal Policy Optimization&lt;/h4&gt;

&lt;p&gt;Let $p_{\mathrm{old}}$ be the sampling policy and let $A(x)$ be an advantage estimate treated as constant with respect to $\theta$.
The unclipped policy-gradient surrogate is&lt;/p&gt;

\[L_{\mathrm{PG}}(\theta)
=
\mathbb{E}_{x \sim p_{\mathrm{old}}(x)}
\left[
r_{\theta}(x)\, A(x)
\right],
\qquad
r_{\theta}(x)
=
\frac{p_{\theta}(x)}{p_{\mathrm{old}}(x)}.\]

&lt;p&gt;Then&lt;/p&gt;

\[\begin{aligned}
\nabla_{\theta} L_{\mathrm{PG}}(\theta)
&amp;amp;=
\mathbb{E}_{x \sim p_{\mathrm{old}}(x)}
\left[
A(x)\, \nabla_{\theta} r_{\theta}(x)
\right] \\
&amp;amp;=
\mathbb{E}_{x \sim p_{\mathrm{old}}(x)}
\left[
A(x)\, r_{\theta}(x)\, \nabla_{\theta} \log p_{\theta}(x)
\right].
\end{aligned}\]

&lt;p&gt;So PPO is just an importance-weighted REINFORCE estimator.
Its clipped objective replaces $r_{\theta}(x) A(x)$ by&lt;/p&gt;

\[\min
\left(
r_{\theta}(x) A(x),
\operatorname{clip}(r_{\theta}(x), 1-\varepsilon, 1+\varepsilon) A(x)
\right),\]

&lt;p&gt;which keeps the same REINFORCE form when the ratio is unclipped and truncates the update when it is too large
(&lt;a href=&quot;https://arxiv.org/abs/1707.06347&quot;&gt;Schulman et al., 2017&lt;/a&gt;).&lt;/p&gt;

&lt;h4 id=&quot;group-relative-policy-optimization&quot;&gt;Group Relative Policy Optimization&lt;/h4&gt;

&lt;p&gt;For the same prompt $q$, let $o_{1}, \ldots, o_{G} \sim \pi_{\mathrm{old}}(\cdot \mid q)$ be a group of sampled outputs with rewards $R_{1}, \ldots, R_{G}$.
GRPO uses the group-normalized advantage&lt;/p&gt;

\[\hat{A}_{i}
=
\frac{R_{i} - \operatorname{mean}(R_{1}, \ldots, R_{G})}
{\operatorname{std}(R_{1}, \ldots, R_{G})}.\]

&lt;p&gt;Ignoring clipping and the reference-policy KL term, the surrogate objective is&lt;/p&gt;

\[L_{\mathrm{GRPO}}(\theta)
=
\mathbb{E}
\left[
\frac{1}{G}
\sum_{i=1}^{G}
\frac{1}{|o_{i}|}
\sum_{t=1}^{|o_{i}|}
r_{i,t}(\theta)\, \hat{A}_{i}
\right],\]

&lt;p&gt;with token-level importance ratio&lt;/p&gt;

\[r_{i,t}(\theta)
=
\frac{\pi_{\theta}(o_{i,t} \mid q, o_{i,&amp;lt;t})}
{\pi_{\mathrm{old}}(o_{i,t} \mid q, o_{i,&amp;lt;t})}.\]

&lt;p&gt;Therefore&lt;/p&gt;

\[\nabla_{\theta} L_{\mathrm{GRPO}}(\theta)
=
\mathbb{E}
\left[
\frac{1}{G}
\sum_{i=1}^{G}
\frac{1}{|o_{i}|}
\sum_{t=1}^{|o_{i}|}
\hat{A}_{i}\,
r_{i,t}(\theta)\,
\nabla_{\theta} \log \pi_{\theta}(o_{i,t} \mid q, o_{i,&amp;lt;t})
\right].\]

&lt;p&gt;So GRPO is again REINFORCE with importance weighting, but the baseline is estimated from the rewards of the other samples in the same group rather than from a critic; the full GRPO objective then adds PPO-style clipping and a KL penalty to a reference policy
(&lt;a href=&quot;https://arxiv.org/abs/2402.03300&quot;&gt;Shao et al., 2024&lt;/a&gt;).&lt;/p&gt;

&lt;h4 id=&quot;reinforce-leave-one-out&quot;&gt;REINFORCE Leave-One-Out&lt;/h4&gt;

&lt;p&gt;For $K$ independent samples $x_{1}, \ldots, x_{K} \sim p_{\theta}(x)$, define the leave-one-out baseline&lt;/p&gt;

\[b_{k} = \frac{1}{K-1} \sum_{j \neq k} f(x_{j}).\]

&lt;p&gt;Then&lt;/p&gt;

\[\widehat{\nabla_{\theta} J}
=
\frac{1}{K}
\sum_{k=1}^{K}
\left[
\left(
f(x_{k}) - b_{k}
\right)
\nabla_{\theta} \log p_{\theta}(x_{k})
\right].\]

&lt;p&gt;This is unbiased because, conditional on $x_{-k}$, the baseline $b_{k}$ does not depend on $x_{k}$, so&lt;/p&gt;

\[\mathbb{E}
\left[
b_{k} \nabla_{\theta} \log p_{\theta}(x_{k})
\mid x_{-k}
\right]
=
b_{k}\,
\mathbb{E}
\left[
\nabla_{\theta} \log p_{\theta}(x_{k})
\right]
= 0.\]

&lt;p&gt;This is the leave-one-out control-variate idea underlying REINFORCE Leave-One-Out (RLOO) and VIMCO
(&lt;a href=&quot;https://proceedings.mlr.press/v48/mnihb16.html&quot;&gt;Mnih and Rezende, 2016&lt;/a&gt;;
&lt;a href=&quot;https://aclanthology.org/2024.acl-long.662/&quot;&gt;Ahmadian et al., 2024&lt;/a&gt;).&lt;/p&gt;

&lt;h4 id=&quot;augment-reinforce-merge--disarm&quot;&gt;Augment-REINFORCE-Merge / DisARM&lt;/h4&gt;

&lt;p&gt;For a Bernoulli random variable $z \sim \mathrm{Bernoulli}(\sigma(\phi))$, the Augment-REINFORCE-Merge (ARM) estimator is&lt;/p&gt;

\[\nabla_{\phi} \mathbb{E}[f(z)]
=
\mathbb{E}_{u \sim \mathrm{Uniform}(0, 1)}
\left[
\left(
f(\mathbf{1}[u &amp;gt; \sigma(-\phi)])
-
f(\mathbf{1}[u &amp;lt; \sigma(\phi)])
\right)
\left(
u - \frac{1}{2}
\right)
\right].\]

&lt;p&gt;DisARM integrates out the continuous augmentation and, for Bernoulli logits $\alpha_{\theta}$ and an antithetic pair $(b, \tilde{b})$, yields&lt;/p&gt;

\[\widehat{\nabla_{\theta} J}
=
\sum_{i}
\left[
\frac{1}{2}
\left(
f(b) - f(\tilde{b})
\right)
(-1)^{\tilde{b}_{i}}
\mathbf{1}[b_{i} \neq \tilde{b}_{i}]
\sigma(|(\alpha_{\theta})_{i}|)
\nabla_{\theta} (\alpha_{\theta})_{i}
\right].\]

&lt;p&gt;These are unbiased binary estimators based on antithetic sampling
(&lt;a href=&quot;https://arxiv.org/abs/1807.11143&quot;&gt;Yin and Zhou, 2019&lt;/a&gt;;
&lt;a href=&quot;https://arxiv.org/abs/2006.10680&quot;&gt;Dong et al., 2020&lt;/a&gt;).&lt;/p&gt;

&lt;h4 id=&quot;rebar&quot;&gt;REBAR&lt;/h4&gt;

&lt;p&gt;Let $b = H(z)$ with $z \sim p(z \mid \theta)$ and let $\tilde{z} \sim p(z \mid b, \theta)$ be a conditional relaxed sample.
With the control variate $c(z) = \eta f(\sigma_{\lambda}(z))$, REBAR uses the unbiased estimator&lt;/p&gt;

\[\widehat{\nabla_{\theta} J}
=
\left[
f(b) - \eta f(\sigma_{\lambda}(\tilde{z}))
\right]
\nabla_{\theta} \log p(b \mid \theta)
+
\eta \nabla_{\theta} f(\sigma_{\lambda}(z))
-
\eta \nabla_{\theta} f(\sigma_{\lambda}(\tilde{z})).\]

&lt;p&gt;So REBAR is REINFORCE plus a debiased continuous-relaxation control variate
(&lt;a href=&quot;https://arxiv.org/abs/1703.07370&quot;&gt;Tucker et al., 2017&lt;/a&gt;).&lt;/p&gt;

&lt;h4 id=&quot;relax&quot;&gt;RELAX&lt;/h4&gt;

&lt;p&gt;RELAX replaces the hand-designed REBAR control variate by a learned differentiable surrogate $c_{\phi}$.
With the same notation as above,&lt;/p&gt;

\[\widehat{\nabla_{\theta} J}
=
\left[
f(b) - c_{\phi}(\tilde{z})
\right]
\nabla_{\theta} \log p(b \mid \theta)
+
\nabla_{\theta} c_{\phi}(z)
-
\nabla_{\theta} c_{\phi}(\tilde{z}).\]

&lt;p&gt;This estimator is unbiased for any $c_{\phi}$; the point is to learn $c_{\phi}$ so as to reduce variance
(&lt;a href=&quot;https://arxiv.org/abs/1711.00123&quot;&gt;Grathwohl et al., 2018&lt;/a&gt;).&lt;/p&gt;

&lt;h3 id=&quot;pathwise-derivative--reparameterization-trick&quot;&gt;Pathwise Derivative / Reparameterization Trick&lt;/h3&gt;

&lt;p&gt;If we can write the random variable as a differentiable transformation
$x = g_{\theta}(\varepsilon)$ with $\varepsilon \sim p(\varepsilon)$ independent of $\theta$, then&lt;/p&gt;

\[\begin{aligned}
\nabla_{\theta} J(\theta)
&amp;amp;= \nabla_{\theta} \mathbb{E}_{\varepsilon \sim p(\varepsilon)}
\left[
f(g_{\theta}(\varepsilon))
\right] \\
&amp;amp;= \mathbb{E}_{\varepsilon \sim p(\varepsilon)}
\left[
\nabla_{\theta} f(g_{\theta}(\varepsilon))
\right] \\
&amp;amp;= \mathbb{E}_{\varepsilon \sim p(\varepsilon)}
\left[
\frac{\partial f}{\partial x}
\frac{\partial g_{\theta}(\varepsilon)}{\partial \theta}
\right].
\end{aligned}\]

&lt;p&gt;This is the pathwise, or reparameterization, gradient estimator
(&lt;a href=&quot;https://arxiv.org/abs/1312.6114&quot;&gt;Kingma and Welling, 2013&lt;/a&gt;;
&lt;a href=&quot;https://arxiv.org/abs/1401.4082&quot;&gt;Rezende et al., 2014&lt;/a&gt;).&lt;/p&gt;

&lt;h3 id=&quot;evolution-strategies&quot;&gt;Evolution Strategies&lt;/h3&gt;

&lt;p&gt;For a Gaussian search distribution over parameters,&lt;/p&gt;

\[J(\theta) = \mathbb{E}_{\varepsilon \sim \mathcal{N}(0, I)}[f(\theta + \sigma \varepsilon)]
= \mathbb{E}_{x \sim \mathcal{N}(\theta, \sigma^{2} I)}[f(x)].\]

&lt;p&gt;Applying the same identity,&lt;/p&gt;

\[\begin{aligned}
\nabla_{\theta} J(\theta)
&amp;amp;= \mathbb{E}_{x \sim \mathcal{N}(\theta, \sigma^{2} I)}
\left[
f(x)\, \nabla_{\theta} \log \mathcal{N}(x; \theta, \sigma^{2} I)
\right] \\
&amp;amp;= \mathbb{E}_{x \sim \mathcal{N}(\theta, \sigma^{2} I)}
\left[
f(x)\, \frac{x - \theta}{\sigma^{2}}
\right] \\
&amp;amp;= \frac{1}{\sigma}\,
\mathbb{E}_{\varepsilon \sim \mathcal{N}(0, I)}
\left[
f(\theta + \sigma \varepsilon)\, \varepsilon
\right].
\end{aligned}\]

&lt;p&gt;Equivalently, if we define the scaled perturbation $\delta = \sigma \varepsilon$, then
$\delta \sim \mathcal{N}(0, \sigma^{2} I)$ and&lt;/p&gt;

\[\nabla_{\theta} J(\theta)
= \frac{1}{\sigma^{2}}\,
\mathbb{E}_{\delta \sim \mathcal{N}(0, \sigma^{2} I)}
\left[
f(\theta + \delta)\, \delta
\right].\]

&lt;p&gt;This is the same estimator: it is only a change of variables.
Indeed, substituting $\delta = \sigma \varepsilon$ gives&lt;/p&gt;

\[\frac{1}{\sigma}\,
\mathbb{E}_{\varepsilon \sim \mathcal{N}(0, I)}
\left[
f(\theta + \sigma \varepsilon)\, \varepsilon
\right]
=
\frac{1}{\sigma^{2}}\,
\mathbb{E}_{\delta \sim \mathcal{N}(0, \sigma^{2} I)}
\left[
f(\theta + \delta)\, \delta
\right].\]

&lt;p&gt;So Evolution Strategies are just REINFORCE applied to a Gaussian distribution over parameters; the standardized-noise form $\varepsilon \sim \mathcal{N}(0, I)$ is the one used by
(&lt;a href=&quot;https://arxiv.org/abs/1703.03864&quot;&gt;Salimans et al., 2017&lt;/a&gt;),
while the scaled-perturbation form $\delta \sim \mathcal{N}(0, \sigma^{2} I)$ is the notation used by
(&lt;a href=&quot;https://www.inference.vc/evolutionary-strategies-embarrassingly-parallelizable-optimization/&quot;&gt;Huszár, 2017&lt;/a&gt;)
and
(&lt;a href=&quot;https://davidbarber.github.io/blog/2017/04/03/variational-optimisation/&quot;&gt;Barber, 2017&lt;/a&gt;).&lt;/p&gt;

&lt;h3 id=&quot;straight-through-estimation&quot;&gt;Straight-Through Estimation&lt;/h3&gt;

&lt;p&gt;Let $z = H(a_{\theta}(\xi))$ be a hard discrete decision, where $H$ is a step function or an $\arg\max$.
The exact pathwise derivative is unusable because $\partial H / \partial a$ is zero almost everywhere or undefined.
The Straight-Through Estimator (STE) replaces the backward derivative by that of a smooth surrogate $\tilde{H}$:&lt;/p&gt;

\[\frac{\partial z}{\partial a}
\; := \;
\frac{\partial \tilde{H}(a)}{\partial a}.\]

&lt;p&gt;Therefore, for $J(\theta) = \mathbb{E}_{\xi}[L(z)]$,&lt;/p&gt;

\[\nabla_{\theta} J(\theta)
\approx
\mathbb{E}_{\xi}
\left[
\frac{\partial L}{\partial z}
\frac{\partial \tilde{H}(a_{\theta}(\xi))}{\partial a}
\frac{\partial a_{\theta}(\xi)}{\partial \theta}
\right].\]

&lt;p&gt;This is not an exact identity: STE is a biased surrogate-gradient estimator
(&lt;a href=&quot;https://arxiv.org/abs/1308.3432&quot;&gt;Bengio et al., 2013&lt;/a&gt;).&lt;/p&gt;

&lt;h3 id=&quot;implicit-maximum-likelihood-estimation&quot;&gt;Implicit Maximum Likelihood Estimation&lt;/h3&gt;

&lt;p&gt;Let $\operatorname{MAP}(\theta)$ denote the maximum a posteriori solution, i.e. the feasible discrete structure that maximizes the score induced by $\theta$.
In practice this can be top-$k$ selection, shortest-path inference via Dijkstra, or an integer linear programming solver.
If we add random perturbations $\varepsilon \sim \rho$ to the scores and solve the perturbed optimization problem, we obtain&lt;/p&gt;

\[z_{\theta}(\varepsilon) = \operatorname{MAP}(\theta + \varepsilon),
\qquad
\varepsilon \sim \rho.\]

&lt;p&gt;The map $\varepsilon \mapsto \operatorname{MAP}(\theta + \varepsilon)$ pushes the perturbation law $\rho$ forward to an implicit distribution over feasible solutions,
\(p_{\mathrm{PM}}(z; \theta)
=
\mathbb{P}_{\varepsilon \sim \rho}
\left[
z = \operatorname{MAP}(\theta + \varepsilon)
\right].\)
Sampling from this implicit distribution by drawing $\varepsilon \sim \rho$ and solving one perturbed MAP problem is the perturb-and-MAP construction introduced by
(&lt;a href=&quot;https://doi.org/10.1109/ICCV.2011.6126242&quot;&gt;Papandreou and Yuille, 2011&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Let $g = \nabla_{z} f(z_{\theta}(\varepsilon))$ be the downstream gradient and define target parameters&lt;/p&gt;

\[\theta^\prime = \theta - \lambda g.\]

&lt;p&gt;Here $\theta^\prime$ is chosen so as to move in a direction expected to reduce the downstream loss.
This is the key idea behind perturbation-based implicit differentiation: in the smooth-marginal setting, perturbing parameters in the negative downstream-gradient direction yields a lower-loss solution
(&lt;a href=&quot;https://papers.nips.cc/paper/4107-implicit-differentiation-by-perturbation&quot;&gt;Domke, 2010&lt;/a&gt;),
and I-MLE adapts this idea to the discrete perturb-and-MAP setting.&lt;/p&gt;

&lt;p&gt;For a constrained exponential-family distribution&lt;/p&gt;

\[p(z; \theta) \propto \exp(\langle z, \theta \rangle - A(\theta)),\]

&lt;p&gt;the negative log-likelihood of one observation $\hat{z}$ is&lt;/p&gt;

\[\ell(\theta; \hat{z}) = A(\theta) - \langle \hat{z}, \theta \rangle,\]

&lt;p&gt;so its gradient is&lt;/p&gt;

\[\nabla_{\theta} \ell(\theta; \hat{z})
=
\nabla_{\theta} A(\theta) - \hat{z}
=
\mu(\theta) - \hat{z},\]

&lt;p&gt;where $\mu(\theta) = \mathbb{E}_{p(z; \theta)}[z]$.
More generally, if the target is itself a distribution $q(z; \theta^\prime)$ from the same exponential family, the implicit maximum-likelihood objective is&lt;/p&gt;

\[L(\theta, \theta^\prime)
=
-\mathbb{E}_{z \sim q(z; \theta^\prime)}[\log p(z; \theta)],\]

&lt;p&gt;and its gradient is&lt;/p&gt;

\[\nabla_{\theta} L(\theta, \theta^\prime)
=
\mu(\theta) - \mathbb{E}_{z \sim q(z; \theta^\prime)}[z]
=
\mu(\theta) - \mu(\theta^\prime).\]

&lt;p&gt;Implicit Maximum Likelihood Estimation (I-MLE) chooses the target distribution $q$ through $\theta^\prime$ and then approximates these marginals by perturb-and-MAP samples:&lt;/p&gt;

\[\widehat{\nabla_{\theta} J}
\propto
z_{\theta}(\varepsilon) - z_{\theta^\prime}(\varepsilon)
=
\operatorname{MAP}(\theta + \varepsilon)
-
\operatorname{MAP}(\theta - \lambda g + \varepsilon).\]

&lt;p&gt;So the I-MLE estimator is obtained by taking the maximum-likelihood gradient
\(\mu(\theta) - \mu(\theta^\prime)\)
for exponential-family distributions and replacing the intractable marginals by perturb-and-MAP approximations
\(\mu(\theta) \approx \mathbb{E}_{\varepsilon}[z_{\theta}(\varepsilon)],
\qquad
\mu(\theta^\prime) \approx \mathbb{E}_{\varepsilon}[z_{\theta^\prime}(\varepsilon)].\)&lt;/p&gt;

&lt;p&gt;With multiple independent perturbations, one uses&lt;/p&gt;

\[\widehat{\nabla_{\theta} J}
=
\frac{1}{S}
\sum_{i=1}^{S}
\left[
\operatorname{MAP}(\theta + \varepsilon_{i})
-
\operatorname{MAP}(\theta - \lambda g_{i} + \varepsilon_{i})
\right],
\qquad
\varepsilon_{i} \sim \rho,
\qquad
g_{i} = \nabla_{z} f(z_{\theta}(\varepsilon_{i})),\]

&lt;p&gt;with the factor $1 / \lambda$ often absorbed into the learning rate.
This is a biased finite-difference estimator based on an implicit target distribution
(&lt;a href=&quot;https://arxiv.org/abs/2106.01798&quot;&gt;Niepert et al., 2021&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Adaptive I-MLE (AIMLE) extends this by adapting the perturbation scale $\lambda$ during training, explicitly trading off sparsity and bias in the estimate, and by also considering centered finite differences
(&lt;a href=&quot;https://arxiv.org/abs/2209.04862&quot;&gt;Minervini et al., 2022&lt;/a&gt;).&lt;/p&gt;

&lt;h3 id=&quot;gumbel-softmax--gumbel-sigmoid&quot;&gt;Gumbel-Softmax / Gumbel-Sigmoid&lt;/h3&gt;

&lt;p&gt;For a categorical variable with logits $\ell_{\theta}$, sample i.i.d. Gumbel noise
$g_{i} = - \log(-\log u_{i})$ with $u_{i} \sim \mathrm{Uniform}(0, 1)$, and define&lt;/p&gt;

\[y_{i}
=
\frac{\exp((\ell_{\theta, i} + g_{i}) / \tau)}
{\sum_{j} \exp((\ell_{\theta, j} + g_{j}) / \tau)}.\]

&lt;p&gt;Then $y$ is a differentiable function of $\theta$ and noise $g$, so the pathwise gradient applies:&lt;/p&gt;

\[\nabla_{\theta} J(\theta)
=
\mathbb{E}_{g}
\left[
\nabla_{\theta} f(y_{\theta}(g))
\right]
=
\mathbb{E}_{g}
\left[
\frac{\partial f}{\partial y}
\frac{\partial y_{\theta}(g)}{\partial \theta}
\right].\]

&lt;p&gt;This is the pathwise, or reparameterization, gradient estimator underlying Gumbel-Softmax
(&lt;a href=&quot;https://arxiv.org/abs/1611.01144&quot;&gt;Jang et al., 2017&lt;/a&gt;;
&lt;a href=&quot;https://arxiv.org/abs/1611.00712&quot;&gt;Maddison et al., 2017&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;In the binary case, often called Gumbel-Sigmoid or Binary Concrete, sample logistic noise
$\lambda = \log u - \log(1-u)$ with $u \sim \mathrm{Uniform}(0, 1)$ and define&lt;/p&gt;

\[y = \sigma \left( \frac{a_{\theta} + \lambda}{\tau} \right).\]

&lt;p&gt;Then the same reparameterization argument gives&lt;/p&gt;

\[\nabla_{\theta} \mathbb{E}_{\lambda}[f(y_{\theta}(\lambda))]
=
\mathbb{E}_{\lambda}
\left[
\frac{\partial f}{\partial y}
\frac{\partial y_{\theta}(\lambda)}{\partial \theta}
\right].\]

&lt;h3 id=&quot;measure-valued-derivatives&quot;&gt;Measure-Valued Derivatives&lt;/h3&gt;

&lt;p&gt;Suppose the derivative of the measure can be decomposed as&lt;/p&gt;

\[\nabla_{\theta_{i}} p(x; \theta)
=
c_{\theta_{i}}
\left[
p_{i}^{+}(x; \theta) - p_{i}^{-}(x; \theta)
\right].\]

&lt;p&gt;Then&lt;/p&gt;

\[\begin{aligned}
\nabla_{\theta_{i}} J(\theta)
&amp;amp;=
\int f(x)\, \nabla_{\theta_{i}} p(x; \theta)\, dx \\
&amp;amp;=
c_{\theta_{i}}
\left(
\mathbb{E}_{p_{i}^{+}(x; \theta)}[f(x)]
-
\mathbb{E}_{p_{i}^{-}(x; \theta)}[f(x)]
\right).
\end{aligned}\]

&lt;p&gt;So the gradient becomes a scaled difference of two expectations under the positive and negative parts of the weak derivative
(&lt;a href=&quot;https://www.jmlr.org/papers/v21/19-346.html&quot;&gt;Mohamed et al., 2020&lt;/a&gt;).&lt;/p&gt;
</description>
        <pubDate>Sun, 29 Mar 2026 00:00:00 +0000</pubDate>
        <link>https://neuralnoise.com///2026/some-notes-on-gradient-estimation/</link>
        <guid isPermaLink="true">https://neuralnoise.com///2026/some-notes-on-gradient-estimation/</guid>
        
        <category>gradient estimation</category>
        
        <category>reinforce</category>
        
        <category>evolution strategies</category>
        
        <category>optimization</category>
        
        
        <category>machine learning</category>
        
        <category>reinforcement learning</category>
        
        <category>optimization</category>
        
      </item>
    
      <item>
        <title>Real-World Impact of Our Research</title>
        <description>&lt;p&gt;Academic research sometimes risks to be disconnected from real-world applications. Research from our group demonstrated significant real-world impact across multiple domains, from improving the efficiency of LLM inference and training to new state-of-the-art evaluation protocols, and contributing to several industry products.&lt;/p&gt;

&lt;h2 id=&quot;industry-adoption&quot;&gt;Industry Adoption&lt;/h2&gt;

&lt;h3 id=&quot;kv-cache-compression&quot;&gt;KV Cache Compression&lt;/h3&gt;

&lt;p&gt;Our work on &lt;a href=&quot;https://arxiv.org/abs/2406.11430&quot;&gt;KV cache compression&lt;/a&gt;, presented at EMNLP 2024 as an oral presentation (top 9% of papers), has been directly adopted by NVIDIA’s &lt;a href=&quot;https://github.com/NVIDIA/kvpress&quot;&gt;KV Press&lt;/a&gt; library. This library is now widely used across the industry for reducing the inference footprint of LLMs&lt;/p&gt;

&lt;h3 id=&quot;mmlu-redux&quot;&gt;MMLU-Redux&lt;/h3&gt;

&lt;p&gt;Our &lt;a href=&quot;https://arxiv.org/abs/2406.04127&quot;&gt;MMLU-Redux benchmark&lt;/a&gt;, presented at NAACL 2025, is being widely adopted for evaluating LLMs across multiple frontier labs. After identifying a concerning 57% error rate in the original MMLU benchmark’s Virology subset, we created a manually curated, expert-verified dataset of 5,700 questions. Some industry adopters include &lt;strong&gt;&lt;a href=&quot;https://huggingface.co/deepseek-ai/DeepSeek-R1&quot;&gt;DeepSeek&lt;/a&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;a href=&quot;https://arxiv.org/abs/2505.09388&quot;&gt;Alibaba Qwen&lt;/a&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;a href=&quot;https://github.com/Tencent-Hunyuan/Hunyuan-A13B&quot;&gt;Tencent Hunyuan&lt;/a&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;a href=&quot;https://github.com/MoonshotAI/Kimi-K2&quot;&gt;Moonshot KIMI&lt;/a&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;a href=&quot;https://github.com/NVIDIA/NeMo-Skills&quot;&gt;NVIDIA NeMo Skills&lt;/a&gt;&lt;/strong&gt;, and &lt;strong&gt;&lt;a href=&quot;https://github.com/LG-AI-EXAONE/EXAONE-4.0&quot;&gt;LG AI Research EXAONE&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;h3 id=&quot;frontier-model-pre-training&quot;&gt;Frontier Model Pre-Training&lt;/h3&gt;

&lt;p&gt;One of our pre-training techniques from “&lt;a href=&quot;https://arxiv.org/abs/2402.13991&quot;&gt;Analysing The Impact of Sequence Composition on Language Model Pre-Training&lt;/a&gt;” (a project led by &lt;a href=&quot;https://yuzhaouoe.github.io/&quot;&gt;Yu Zhao&lt;/a&gt; and presented at ACL 2024), specifically &lt;strong&gt;intra-document causal masking&lt;/strong&gt;, was adopted by Meta in training &lt;a href=&quot;https://ai.meta.com/blog/meta-llama-3/&quot;&gt;Llama3&lt;/a&gt;, their flagship line of LLMs, by Hugging Face for training their &lt;a href=&quot;https://huggingface.co/blog/smollm3&quot;&gt;SmolLM3 models&lt;/a&gt;, and by Allen AI for training their &lt;a href=&quot;https://arxiv.org/abs/2512.13961&quot;&gt;OLMo 3 models&lt;/a&gt;.&lt;/p&gt;

&lt;h3 id=&quot;complex-query-answering&quot;&gt;Complex Query Answering&lt;/h3&gt;

&lt;p&gt;Our work on answering complex queries on large and incomplete Knowledge Graphs (“&lt;a href=&quot;https://arxiv.org/abs/2011.03459&quot;&gt;Complex Query Answering with Neural Link Predictors&lt;/a&gt;”, which received an &lt;strong&gt;&lt;a href=&quot;https://iclr-conf.medium.com/announcing-iclr-2021-outstanding-paper-awards-9ae0514734ab&quot;&gt;Outstanding Paper Award&lt;/a&gt;&lt;/strong&gt; at ICLR 2021), is also having some impact beyond academia. Based on personal communications, our complex query answering techniques are being adopted by several tech companies to develop their AI products (I am happy to discuss this verbally). We are now continuing this line of research – for example, see our &lt;a href=&quot;https://arxiv.org/abs/2301.12313&quot;&gt;NeurIPS 2023&lt;/a&gt; and &lt;a href=&quot;https://arxiv.org/abs/2410.12537&quot;&gt;ICML 2025&lt;/a&gt; papers on this topic.&lt;/p&gt;

&lt;h3 id=&quot;mmlongbench&quot;&gt;MMLongBench&lt;/h3&gt;

&lt;p&gt;&lt;a href=&quot;https://arxiv.org/abs/2505.10610&quot;&gt;MMLongBench&lt;/a&gt;, a benchmark for evaluating multimodal long-context models by &lt;a href=&quot;https://zhaowei-wang-nlp.github.io/&quot;&gt;Zhaowei Wang&lt;/a&gt; et al., is being adopted by &lt;strong&gt;Xiaomi&lt;/strong&gt; in their &lt;a href=&quot;https://arxiv.org/abs/2506.03569&quot;&gt;MiMo-VL&lt;/a&gt; project, a vision-language model for general visual understanding and multimodal reasoning.&lt;/p&gt;

&lt;h2 id=&quot;media-coverage&quot;&gt;Media Coverage&lt;/h2&gt;

&lt;p&gt;Our research received some media coverage. For example, our work “&lt;a href=&quot;https://arxiv.org/abs/2502.05092&quot;&gt;Lost in Time: Clock and Calendar Understanding Challenges in Multimodal LLMs&lt;/a&gt;”, led by the amazing &lt;a href=&quot;https://saxenarohit.github.io/&quot;&gt;Rohit Saxena&lt;/a&gt;, received extensive media coverage, including &lt;a href=&quot;https://www.nytimes.com/2025/05/05/technology/ai-hallucinations-chatgpt-google.html&quot;&gt;&lt;strong&gt;The New York Times&lt;/strong&gt;&lt;/a&gt;, &lt;a href=&quot;https://www.vice.com/en/article/ai-still-cant-tell-time-or-read-a-calendar/&quot;&gt;&lt;strong&gt;VICE&lt;/strong&gt;&lt;/a&gt;, &lt;a href=&quot;https://gizmodo.com/ai-sucks-at-reading-clocks-2000576329&quot;&gt;&lt;strong&gt;Gizmodo&lt;/strong&gt;&lt;/a&gt;, &lt;a href=&quot;https://www.theengineer.co.uk/content/news/ai-systems-fail-to-read-clocks-and-decode-calendars/&quot;&gt;&lt;strong&gt;The Engineer&lt;/strong&gt;&lt;/a&gt;, &lt;a href=&quot;https://uk.news.yahoo.com/edinburgh-university-study-says-most-080351077.html&quot;&gt;&lt;strong&gt;Yahoo! News&lt;/strong&gt;&lt;/a&gt;, and &lt;a href=&quot;https://techxplore.com/&quot;&gt;&lt;strong&gt;Tech Xplore&lt;/strong&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Our project “&lt;a href=&quot;https://arxiv.org/abs/2507.14417&quot;&gt;Inverse Scaling in Test-Time Compute&lt;/a&gt;” led by &lt;a href=&quot;https://aryopg.github.io/&quot;&gt;Aryo Gema&lt;/a&gt; in collaboration with &lt;a href=&quot;https://alignment.anthropic.com/2025/inverse-scaling/&quot;&gt;Anthropic&lt;/a&gt;, also received some media attention, with coverage in &lt;a href=&quot;https://venturebeat.com/ai/anthropic-researchers-discover-the-weird-ai-problem-why-thinking-longer-makes-models-dumber/&quot;&gt;&lt;strong&gt;VentureBeat&lt;/strong&gt;&lt;/a&gt;, &lt;a href=&quot;https://yourstory.com/ai-story/anthropic-inverse-scaling-test-time-compute&quot;&gt;&lt;strong&gt;YourStory&lt;/strong&gt;&lt;/a&gt;, &lt;a href=&quot;https://www.aitimes.com/news/articleView.html?idxno=200904&quot;&gt;&lt;strong&gt;AI Times&lt;/strong&gt;&lt;/a&gt;, &lt;a href=&quot;https://www.techzine.eu/news/applications/133252/thinking-too-long-makes-ai-models-dumber/&quot;&gt;&lt;strong&gt;TechZine&lt;/strong&gt;&lt;/a&gt;, &lt;a href=&quot;https://theinterviewtimes.com/anthropic-study-reveals-ai-inverse-scaling/&quot;&gt;&lt;strong&gt;The Interview Times&lt;/strong&gt;&lt;/a&gt;, &lt;a href=&quot;https://www.ainews.com/p/anthropic-finds-longer-ai-reasoning-can-hurt-model-performance&quot;&gt;&lt;strong&gt;AI News&lt;/strong&gt;&lt;/a&gt;, &lt;a href=&quot;https://www.marktechpost.com/2025/07/30/too-much-thinking-can-break-llms-inverse-scaling-in-test-time-compute/&quot;&gt;&lt;strong&gt;MarkTechPost&lt;/strong&gt;&lt;/a&gt;, and others.&lt;/p&gt;
</description>
        <pubDate>Tue, 15 Jul 2025 01:00:00 +0100</pubDate>
        <link>https://neuralnoise.com///2025/research-impact/</link>
        <guid isPermaLink="true">https://neuralnoise.com///2025/research-impact/</guid>
        
        <category>machine learning</category>
        
        <category>natural language processing</category>
        
        <category>research</category>
        
        <category>impact</category>
        
        <category>industry</category>
        
        <category>real-world</category>
        
        
        <category>machine learning</category>
        
        <category>natural language processing</category>
        
        <category>research</category>
        
        <category>impact</category>
        
        <category>industry</category>
        
      </item>
    
      <item>
        <title>March 2025 in Research</title>
        <description>&lt;p&gt;We have been working on language model evaluation, knowledge utilization, efficiency, and multimodal reasoning. We had papers at &lt;strong&gt;ICLR 2025&lt;/strong&gt;, &lt;strong&gt;NAACL 2025 (x3)&lt;/strong&gt;, &lt;strong&gt;AAAI 2025&lt;/strong&gt;, and others, along with several ongoing works.&lt;/p&gt;

&lt;h3 id=&quot;naacl2025-controlling-knowledge--reasoning&quot;&gt;NAACL 2025 – Controlling Knowledge &amp;amp; Reasoning&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://arxiv.org/abs/2410.15999&quot;&gt;Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering&lt;/a&gt;, by &lt;a href=&quot;https://huggingface.co/yuzhaouoe&quot;&gt;Yu Zhao&lt;/a&gt; et al. – We introduce &lt;strong&gt;SpARE&lt;/strong&gt;, a training‑free method to control whether an LLM relies on its internal parametric knowledge or given context when conflicts arise. By analyzing mid‑layer activations with sparse autoencoders, &lt;strong&gt;SpARE&lt;/strong&gt; identifies conflict signals and manipulates them to steer the model at inference time, significantly improving performance on open‑domain QA compared to prior methods. &lt;em&gt;(Oral presentation)&lt;/em&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://arxiv.org/abs/2406.04127&quot;&gt;Are We Done with MMLU?&lt;/a&gt;, by &lt;a href=&quot;https://aryopg.github.io/&quot;&gt;Aryo Gema&lt;/a&gt; and many others – We analyze the Massive Multitask Language Understanding benchmark, uncovering a fairly high error rate – for example, in the &lt;em&gt;Virology&lt;/em&gt; subset, 57% of sampled questions had issues. We introduce &lt;strong&gt;MMLU‑Redux&lt;/strong&gt;, a manually curated subset of 5,700 expert‑verified questions, and show that corrected evaluations can substantially alter model rankings. MMLU‑Redux is open‑sourced also adopted for example by &lt;a href=&quot;https://arxiv.org/abs/2412.19437&quot;&gt;DeepSeek&lt;/a&gt; and &lt;a href=&quot;https://arxiv.org/abs/2412.15115&quot;&gt;Qwen&lt;/a&gt;!&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://arxiv.org/abs/2502.05867&quot;&gt;Self-Training Large Language Models for Tool-Use Without Demonstrations&lt;/a&gt;, based on &lt;a href=&quot;https://twitter.com/neluo19&quot;&gt;Ne Luo&lt;/a&gt;’s MSc project – We explore whether LLMs can learn tool usage (e.g., search engines, calculators) without hand‑crafted examples. Starting with zero‑shot prompts, we generate synthetic tool‑using traces and then fine‑tune the model with them. On PopQA, the self‑trained model gains +3.7% accuracy, though results vary on other datasets, highlighting both promise and challenges in autonomous tool‑use learning. &lt;a href=&quot;https://twitter.com/neluo19&quot;&gt;Ne Luo&lt;/a&gt; is looking for a PhD position, &lt;a href=&quot;https://twitter.com/neluo19&quot;&gt;contact her&lt;/a&gt; if you are interested in working with her!&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;iclr2025-learning--evaluation&quot;&gt;ICLR 2025 – Learning &amp;amp; Evaluation&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://arxiv.org/abs/2410.19406&quot;&gt;An Auditing Test to Detect Behavioral Shift in Language Models&lt;/a&gt;, by the amazing &lt;a href=&quot;https://scholar.google.com/citations?user=1BMnCH0AAAAJ&amp;amp;hl=en&quot;&gt;Leo Richter&lt;/a&gt; – We propose a method for continual &lt;strong&gt;Behavioral Shift Auditing (BSA)&lt;/strong&gt; of LLMs. This statistical test monitors an LLM’s outputs for significant deviations from a reference model’s behavior, with theoretical guarantees on detecting genuine shifts while avoiding false alarms. Our BSA approach relies on catching subtle changes in a model’s toxicity and translation performance after fine-tuning, using only a few hundred examples, offering a practical tool to ensure that an LLM remains aligned during its deployment/lifetime.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;reasoning-and-planning-for-large-language-models&quot;&gt;Reasoning and Planning for Large Language Models&lt;/h4&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://arxiv.org/abs/2502.05092&quot;&gt;Lost in Time: Clock and Calendar Understanding Challenges in Multimodal LLMs&lt;/a&gt;, by &lt;a href=&quot;https://saxenarohit.github.io/&quot;&gt;Rohit Saxena&lt;/a&gt; et al. – We introduce &lt;strong&gt;ClockQA&lt;/strong&gt; and &lt;strong&gt;CalendarQA&lt;/strong&gt; for testing multimodal LLMs’ temporal reasoning from images, revealing widespread failures and motivating models with better time‑date understanding. This article got plenty of media coverage – e.g., on &lt;a href=&quot;https://gizmodo.com/ai-sucks-at-reading-clocks-2000576329&quot;&gt;Gizmodo&lt;/a&gt;, &lt;a href=&quot;https://www.theengineer.co.uk/content/news/ai-systems-fail-to-read-clocks-and-decode-calendars/&quot;&gt;The Engineer&lt;/a&gt;, &lt;a href=&quot;https://uk.news.yahoo.com/edinburgh-university-study-says-most-080351077.html&quot;&gt;Yahoo! News&lt;/a&gt;, &lt;a href=&quot;https://www.vice.com/en/article/ai-still-cant-tell-time-or-read-a-calendar/&quot;&gt;VICE&lt;/a&gt;, &lt;a href=&quot;https://techxplore.com/news/2025-03-ai-struggles-clocks-calendars.html&quot;&gt;Tech Xplore&lt;/a&gt;, and many others!&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;aaai2025-efficient-inference&quot;&gt;AAAI 2025 – Efficient Inference&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://arxiv.org/abs/2312.10193&quot;&gt;Adaptive Computation Modules: Granular Conditional Computation for Efficient Inference&lt;/a&gt;, by &lt;a href=&quot;https://scholar.google.com/citations?user=2HBGrsEAAAAJ&amp;amp;hl=en&quot;&gt;Bartosz Wójcik&lt;/a&gt;, &lt;a href=&quot;https://alessiodevoto.github.io/&quot;&gt;Alessio Devoto&lt;/a&gt;, et al. – We propose &lt;strong&gt;Adaptive Computation Modules (ACMs)&lt;/strong&gt; for dynamic, per‑token computation in Transformers. ACMs consist of cascaded sub‑modules with gating functions that allow easy tokens to exit early. We propose a distillation method to retrofit pre‑trained models with ACMs, cutting inference cost without accuracy loss in vision and speech tasks, offering a plug‑and‑play approach to more efficient AI systems!&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;coling2025-multilingual-resources&quot;&gt;COLING 2025 – Multilingual Resources&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://arxiv.org/abs/2406.14425&quot;&gt;SynDARin: Synthesising Datasets for Automated Reasoning in Low-Resource Languages&lt;/a&gt;, by &lt;a href=&quot;https://x.com/GayaGhazaryan&quot;&gt;Gayane Ghazaryan&lt;/a&gt;, &lt;a href=&quot;https://osoblanco.github.io/&quot;&gt;Erik Arakelyan&lt;/a&gt; (who just got his PhD and joined NVIDIA! 🚀) et al. – &lt;strong&gt;SynDARin&lt;/strong&gt; synthesizes QA datasets in low‑resource languages (e.g., Armenian) by generating English questions via LLMs from parallel corpora, translating and validating them. The resulting 16,000+ QA pairs produce a challenging benchmark where models often perform near chance, highlighting critical gaps and enabling rapid evaluation in languages lacking resources.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;frontiers-in-ai2025-human-ai-collaboration&quot;&gt;Frontiers in AI 2025 – Human-AI Collaboration&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://www.frontiersin.org/articles/10.3389/frai.2023.1464690&quot;&gt;Fostering Effective Hybrid Human-LLM Reasoning and Decision Making&lt;/a&gt; – We examine frameworks combining LLMs and human judgment for complex tasks, offering design principles for AI‑assisted decision systems. Through case studies, we show that integrating LLM‑generated insights with human oversight yields more reliable and interpretable outcomes than either alone, providing guidelines for principled human‑in‑the‑loop systems.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;whatsbrewing&quot;&gt;What’s Brewing&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://arxiv.org/abs/2504.02911&quot;&gt;Noiser: Bounded Input Perturbations for Attributing Large Language Models&lt;/a&gt;, by &lt;a href=&quot;https://qasemii.github.io/&quot;&gt;Reza Madani&lt;/a&gt; et al. – &lt;strong&gt;Noiser&lt;/strong&gt; perturbs input embeddings to attribute token importance, introducing an “answerability” check to validate attributions. Outperforming gradients and attention, Noiser offers robust post‑hoc explanations for LLM predictions.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://arxiv.org/abs/2503.23415&quot;&gt;An Analysis of Decoding Methods for LLM-based Agents for Faithful Multi-Hop Question Answering&lt;/a&gt;, by Alex, Sanad, and others amazing students at the UoE – We analyse how faithfulness‑enhancing decoding (e.g., DeCoRe) within the ReAct agent framework improves multi‑hop QA, boosting HotpotQA F1 from 19.5 to 32.6, underscoring the role of decoding in reliable LLM reasoning.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://arxiv.org/abs/2503.02812&quot;&gt;Q-Filters: Leveraging QK Geometry for Efficient KV Cache Compression&lt;/a&gt;, led by &lt;a href=&quot;https://nathangodey.github.io&quot;&gt;Nathan Godey&lt;/a&gt; – &lt;strong&gt;Q-Filters&lt;/strong&gt; uses query‑key geometric projections to filter past tokens on the fly, compressing KV cache without retraining and matching attention‑based methods like SnapKV, enabling long‑context generation with minimal memory.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://arxiv.org/abs/2502.17540&quot;&gt;PosterSum: A Multimodal Benchmark for Scientific Poster Summarization&lt;/a&gt;, by the amazing &lt;a href=&quot;https://saxenarohit.github.io/&quot;&gt;Rohit Saxena&lt;/a&gt; – &lt;strong&gt;PosterSum&lt;/strong&gt; offers 16,000+ re search posters paired with abstracts for evaluating vision‑language summarization. Our “Segment &amp;amp; Summarize” approach secures a 3.1% ROUGE‑L gain, highlighting this benchmark’s challenge.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;
</description>
        <pubDate>Thu, 17 Apr 2025 01:00:00 +0100</pubDate>
        <link>https://neuralnoise.com///2025/march-research/</link>
        <guid isPermaLink="true">https://neuralnoise.com///2025/march-research/</guid>
        
        <category>machine learning</category>
        
        <category>natural language processing</category>
        
        <category>research</category>
        
        <category>academia</category>
        
        
        <category>machine learning</category>
        
        <category>natural language processing</category>
        
        <category>research</category>
        
        <category>academia</category>
        
      </item>
    
      <item>
        <title>Postdoc Position in Multimodal Foundation Models</title>
        <description>&lt;p&gt;Amazing opportunity to join our team at the &lt;a href=&quot;https://informatics.ed.ac.uk/&quot;&gt;School of Informatics, University of Edinburgh&lt;/a&gt;! The School of Informatics is seeking a &lt;strong&gt;Postdoctoral Research Associate&lt;/strong&gt; to work on evaluating and improving multimodal foundation models, with a particular focus on Vision-Language Models (VLMs).&lt;/p&gt;

&lt;h3 id=&quot;about-the-position&quot;&gt;About the Position&lt;/h3&gt;

&lt;p&gt;This is a full-time position running until January 2029, fully funded by the &lt;a href=&quot;https://www.genai.ac.uk/&quot;&gt;AI Hub in Generative Models&lt;/a&gt;. The successful candidate will join the &lt;a href=&quot;https://edinburghnlp.inf.ed.ac.uk/&quot;&gt;Edinburgh NLP Group&lt;/a&gt;, one of the best NLP research groups in the world!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Details:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Duration:&lt;/strong&gt; Fixed-term contract until January 2029&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Application Deadline:&lt;/strong&gt; April 8th, 2025, 12:59 AM (UK time)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For more details or to apply, visit &lt;a href=&quot;https://edin.ac/3DDQK1o&quot;&gt;https://edin.ac/3DDQK1o&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For informal enquiries, feel free to reach out to me directly at &lt;a href=&quot;mailto:p.minervini@ed.ac.uk&quot;&gt;p.minervini@ed.ac.uk&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/posts/genai25.png&quot; alt=&quot;Multimodal Foundation Models Research&quot; class=&quot;center-image&quot; width=&quot;80%&quot; /&gt;&lt;/p&gt;
</description>
        <pubDate>Sat, 01 Mar 2025 00:00:00 +0000</pubDate>
        <link>https://neuralnoise.com///2025/multimodal/</link>
        <guid isPermaLink="true">https://neuralnoise.com///2025/multimodal/</guid>
        
        <category>machine learning</category>
        
        <category>vision-language models</category>
        
        <category>research</category>
        
        <category>academia</category>
        
        <category>edinburgh</category>
        
        <category>postdoc</category>
        
        <category>job opening</category>
        
        
        <category>machine learning</category>
        
        <category>computer vision</category>
        
        <category>research</category>
        
        <category>academia</category>
        
        <category>edinburgh</category>
        
        <category>job opportunity</category>
        
      </item>
    
      <item>
        <title>November 2024 in Research</title>
        <description>&lt;p&gt;My amazing collaborators will be presenting three papers at &lt;a href=&quot;https://2024.emnlp.org/&quot;&gt;EMNLP 2024&lt;/a&gt; (main track), a leading conference in natural language processing, happening in Miami later this month! A few weeks ago I also blogged about our &lt;a href=&quot;https://2024.aclweb.org/&quot;&gt;ACL 2024&lt;/a&gt;, &lt;a href=&quot;https://icml.cc/&quot;&gt;ICML 2024&lt;/a&gt;, and &lt;a href=&quot;https://colmweb.org/&quot;&gt;CoLM 2024&lt;/a&gt; papers – you can check the post &lt;a href=&quot;https://neuralnoise.com/2024/research/&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h3 id=&quot;our-work-at-emnlp-2024&quot;&gt;Our work at EMNLP 2024&lt;/h3&gt;

&lt;p&gt;We will be presenting three papers this year at EMNLP, a flagship NLP conference:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://arxiv.org/abs/2406.11430&quot;&gt;A Simple and Effective $L_{2}$ Norm-Based Strategy for KV Cache Compression&lt;/a&gt;, by &lt;a href=&quot;https://huggingface.co/yuzhaouoe&quot;&gt;Yu Zhao&lt;/a&gt;, &lt;a href=&quot;https://alessiodevoto.github.io/&quot;&gt;Alessio Devoto&lt;/a&gt;, et al. – we introduce a simple strategy for compressing the Key-Value (KV) cache in large language models by utilizing the $L_{2}$ norm of key embeddings; specifically, we found a correlation between low $L_{2}$ norms and high attention scores, allowing them to identify influential KV pairs before querying.
&lt;img src=&quot;/images/papers/attentions.jpg&quot; alt=&quot;Attention Values&quot; /&gt;
&lt;img src=&quot;/images/papers/key-norms.jpg&quot; alt=&quot;Key Norms&quot; /&gt;
&lt;em&gt;For example, here we can see the attention distributions for five heads at layer 9 in Llama2-7B – we can see that the attention scores (top) and the key $L_{2}$ norms (bottom) are highly correlated.&lt;/em&gt; —
Our method effectively reduces KV cache size by up to 90% without loss of accuracy and is compatible with &lt;a href=&quot;https://github.com/Dao-AILab/flash-attention&quot;&gt;FlashAttention&lt;/a&gt;. This paper will be presented as an Oral – top 8% of the accepted papers! An extended version of this paper will also be presented at the &lt;a href=&quot;https://neurips2024-enlsp.github.io/&quot;&gt;Efficient Natural Language and Speech Processing&lt;/a&gt; workshop at NeurIPS 2024! &lt;a href=&quot;https://x.com/PMinervini/status/1856300083140047120&quot;&gt;EMNLP Poster&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://arxiv.org/abs/2305.13214&quot;&gt;Atomic Inference for NLI with Generated Facts as Atoms&lt;/a&gt;, by &lt;a href=&quot;https://lama.doc.ic.ac.uk/team/joe&quot;&gt;Joe Stacey&lt;/a&gt; et al. – we propose an atomic inference approach for Natural Language Inference (NLI) that decomposes inputs into individual facts or atoms, and explicitly models the entailment relationships between such atoms. Furthermore, we propose a multi-stage fact generation process and a specialized training regime for incorporates such facts, achieving state-of-the-art results in several hard NLI tasks. Our best system, FGLR, produces significantly more robust and accurate results than large-scale language models while providing clear interpretability guarantees by identifying the specific atoms responsible for each prediction! Joe wrote an amazing &lt;a href=&quot;https://www.marekrei.com/blog/creating-interpretable-models-with-atomic-inference/&quot;&gt;blog post&lt;/a&gt; on this work, check it out!&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://arxiv.org/abs/2410.15438&quot;&gt;Unveiling and Consulting Core Experts in Retrieval-Augmented MoE-based LLMs&lt;/a&gt;, by &lt;a href=&quot;https://scholar.google.com/citations?hl=en&amp;amp;user=8AWfEb0AAAAJ&amp;amp;view_op=list_works&amp;amp;sortby=pubdate&quot;&gt;Xin Zhou&lt;/a&gt;, &lt;a href=&quot;https://scholar.google.ca/citations?hl=zh-CN&amp;amp;user=z0phsK4AAAAJ&amp;amp;view_op=list_works&amp;amp;sortby=pubdate&quot;&gt;Ping Nie&lt;/a&gt; et al. – we analyse Mixture-of-Expert (MoE)-based Large Language Models (LLMs) in the context of Retrieval-Augmented Generation (RAG); we identify the groups of experts that are primarily responsible for RAG-related behaviors, such as identifying whether the parametric knowledge is sufficient to solve a given knowledge-intensive task; assessing the quality of retrieved documents; and improving the utilisation of context. Based on these findings, we propose several strategies to improve the efficiency and effectiveness of RAG systems by adjusting expert activations.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;whats-brewing&quot;&gt;What’s brewing&lt;/h3&gt;

&lt;p&gt;We have several super-interesting works in the pipeline! Here are some of them:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://arxiv.org/abs/2410.15999&quot;&gt;Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering&lt;/a&gt;, by &lt;a href=&quot;https://huggingface.co/yuzhaouoe&quot;&gt;Yu Zhao&lt;/a&gt; et al. – we introduce SpARE, a training-free method that leverages pre-trained &lt;a href=&quot;https://blog.eleuther.ai/autointerp/&quot;&gt;sparse auto-encoders&lt;/a&gt; (SAEs) to control the knowledge selection behavior of large language models (LLMs) when faced with conflicts between their internal (parametric) knowledge and external (contextual) information. By identifying and manipulating functional features within the LLMs’ internal activations, SpARE can steer the model to prioritize either parametric or contextual knowledge during inference. We show that SpARE is surprisingly effective at resolving knowledge conflicts in open-domain question-answering tasks, producing significantly better results than existing representation engineering and contrastive decoding methods. The insights in this paper are based on another paper, &lt;a href=&quot;https://arxiv.org/abs/2410.16090&quot;&gt;Analysing the Residual Stream of Language Models Under Knowledge Conflicts&lt;/a&gt; also by &lt;a href=&quot;https://huggingface.co/yuzhaouoe&quot;&gt;Yu Zhao&lt;/a&gt; et al. that will appear in the &lt;a href=&quot;https://sites.google.com/view/mint-2024/home&quot;&gt;Workshop on Foundation Model Interventions @ NeurIPS 2024&lt;/a&gt;.&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://arxiv.org/abs/2411.02830&quot;&gt;Mixtures of In-Context Learners&lt;/a&gt;, by &lt;a href=&quot;https://honggiwon.github.io/&quot;&gt;Giwon Hong&lt;/a&gt; et al. – we propose Mixtures of In-Context Learners (MoICL), a method that trains a set of &lt;em&gt;experts&lt;/em&gt; via in-context learning, and learns a weighting function to merge their outputs, addressing many of the limitations of standard in-context learning (ICL). MoICL yields significantly more accurate results than many strong baselines (up to +13% compared to ICL and LENS); reduces inference time by achieving similar performance with fewer demonstrations; and shows greater robustness to out-of-domain, imbalanced, or noisy demonstrations.&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://arxiv.org/abs/2410.18860&quot;&gt;DeCoRe: Decoding by Contrasting Retrieval Heads to Mitigate Hallucinations&lt;/a&gt;, by &lt;a href=&quot;https://aryopg.github.io/&quot;&gt;Aryo Gema&lt;/a&gt; et al. – we introduce DeCoRe (Decoding by Contrasting Retrieval Heads), a novel, training-free decoding strategy designed to mitigate hallucinations in large language models (LLMs). DeCoRe works by masking specific &lt;a href=&quot;https://arxiv.org/abs/2404.15574&quot;&gt;retrieval heads&lt;/a&gt; — attention heads responsible for extracting relevant contextual information — to induce hallucinations, and then contrast the outputs of the base LLM and the masked LLM, using conditional entropy as a guide. DeCoRe significantly improves performance on tasks requiring high contextual faithfulness, such as summarization, instruction following, and open-book question answering, and surprisingly (to us), it also helps with factual recall!&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://arxiv.org/abs/2410.11900&quot;&gt;FLARE: Faithful Logic-Aided Reasoning and Exploration&lt;/a&gt;, by &lt;a href=&quot;https://osoblanco.github.io/&quot;&gt;Erik Arakelyan&lt;/a&gt; et al. – we introduce FLARE (Faithful Logic-Aided Reasoning and Exploration), a framework designed to improve the reasoning abilities of LLMs in knowledge-intensive reasoning tasks. FLARE use an intermediate logic programming-inspired representation of the reasoning process by generating Prolog code and simulating a program execution, ensuring that the reasoning process remains faithful and interpretable without relying on external solvers. FLARE achieves state-of-the-art results on seven out of nine diverse reasoning benchmarks, and we identify a strong correlation between the faithfulness of the reasoning process and the downstream model accuracy.&lt;/li&gt;
&lt;/ul&gt;
</description>
        <pubDate>Fri, 01 Nov 2024 00:00:00 +0000</pubDate>
        <link>https://neuralnoise.com///2024/nov-research/</link>
        <guid isPermaLink="true">https://neuralnoise.com///2024/nov-research/</guid>
        
        <category>machine learning</category>
        
        <category>natural language processing</category>
        
        <category>research</category>
        
        <category>academia</category>
        
        <category>edinburgh</category>
        
        <category>research</category>
        
        
        <category>machine learning</category>
        
        <category>natural language processing</category>
        
        <category>research</category>
        
        <category>academia</category>
        
        <category>edinburgh</category>
        
      </item>
    
      <item>
        <title>July 2024 in Research</title>
        <description>&lt;p&gt;My amazing collaborators will be presenting several works at &lt;a href=&quot;https://2024.aclweb.org/&quot;&gt;ACL 2024&lt;/a&gt;, &lt;a href=&quot;https://icml.cc/&quot;&gt;ICML 2024&lt;/a&gt;, and &lt;a href=&quot;https://colmweb.org/&quot;&gt;CoLM 2024&lt;/a&gt; in the upcoming weeks/months!&lt;/p&gt;

&lt;h3 id=&quot;our-work-at-acl-2024&quot;&gt;Our work at ACL 2024&lt;/h3&gt;

&lt;p&gt;We will be presenting four papers this year at ACL, the flagship NLP conference:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://arxiv.org/abs/2402.13991&quot;&gt;Analysing The Impact of Sequence Composition on Language Model Pre-Training&lt;/a&gt;, by &lt;a href=&quot;https://huggingface.co/yuzhaouoe&quot;&gt;Yu Zhao&lt;/a&gt; et al. – we analyse several language model pre-training schemes and find out that, e.g., intra-document causal masking helps both in terms of pre-training dynamics, and accuracy on a wide array of downstream tasks! This approach was later adopted by &lt;a href=&quot;https://llama.meta.com/&quot;&gt;Llama 3&lt;/a&gt;, Meta’s flagship language model family. This paper will be presented as an Oral – top 8% of the accepted papers!&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://arxiv.org/abs/2305.13235&quot;&gt;SparseFit: Few-shot Prompting with Sparse Fine-tuning for Jointly Generating Predictions and Natural Language Explanations&lt;/a&gt;, by &lt;a href=&quot;https://scholar.google.com/citations?user=KqAy5mQAAAAJ&amp;amp;hl=en&quot;&gt;Jesus Solano&lt;/a&gt;, &lt;a href=&quot;https://www.linkedin.com/in/mardhiyah-sanni/?originalSubdomain=uk&quot;&gt;Mardhiyah Sanni&lt;/a&gt; et al.  – we introduce SparseFit, a sparse fine-tuning method that uses few-shot prompting and discrete prompts to efficiently generate both predictions and natural language explanations (NLEs) with large pre-trained language models, achieving competitive performance while significantly reducing the number of fine-tuning parameters! [&lt;a href=&quot;https://x.com/PMinervini/status/1822912368940081265&quot;&gt;poster&lt;/a&gt;]&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://arxiv.org/abs/2311.07556&quot;&gt;Using Natural Language Explanations to Improve Robustness of In-context Learning&lt;/a&gt;, by &lt;a href=&quot;https://xlhex.github.io&quot;&gt;Xuanli He&lt;/a&gt; et al. – we found that integrating NLEs into in-context learning significantly improves the robustness of large language models against adversarial inputs, and show that generating NLEs with frontier models in a few-shot setting can significantly improve accuracy on challenging natural language inference tasks compared to traditional in-context learning and human-generated NLEs.&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://arxiv.org/abs/2406.13229&quot;&gt;Probing the Emergence of Cross-lingual Alignment during LLM Training&lt;/a&gt;, by &lt;a href=&quot;https://x.com/ErikaaWang&quot;&gt;Hetong Wang&lt;/a&gt; et al. – we analyse how cross-lingual alignment emerges during the training of multilingual large language models by probing neuron activity in different languages. We find that higher neuron overlap between languages correlates strongly with improved zero-shot cross-lingual transfer performance, but also identifies phases during training where both alignment and performance degrade, offering new insights into the dynamics of multilingual model training! [&lt;a href=&quot;https://x.com/ErikaaWang/status/1822297520334094724&quot;&gt;poster&lt;/a&gt;]&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;our-work-at-icml-2024&quot;&gt;Our work at ICML 2024&lt;/h3&gt;

&lt;p&gt;We will be presenting three works this year at &lt;a href=&quot;https://icml.cc/&quot;&gt;ICML&lt;/a&gt; – one in the main conference and two in co-located workshops:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://arxiv.org/abs/2404.08458&quot;&gt;On the Independence Assumption in Neurosymbolic Learning&lt;/a&gt;, by  &lt;a href=&quot;https://www.emilevankrieken.com&quot;&gt;Emile van Krieken&lt;/a&gt; et al. – we analyse the common assumption in neurosymbolic learning that symbols are conditionally independent given the input, and argue that this assumption biases models towards deterministic solutions and limits their ability to express uncertainty.&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://arxiv.org/abs/2407.15516&quot;&gt;Attention Is All You Need But You Don’t Need All Of It For Inference of Large Language Models&lt;/a&gt;, by &lt;a href=&quot;https://www.linkedin.com/in/georgy-tyukin-644048150/?originalSubdomain=uk&quot;&gt;Georgy Tyukin&lt;/a&gt; et al. – we investigate the effects of removing MLP and attention layers in large language models during inference, finding that removing deeper attention layers can only marginally reduce performance while significantly improving inference speed!&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://openreview.net/forum?id=N0lEOF2eDm&quot;&gt;An Auditing Test to Detect Behavioral Shift in Language Models&lt;/a&gt;, by &lt;a href=&quot;https://scholar.google.com/citations?hl=en&amp;amp;user=1BMnCH0AAAAJ&amp;amp;view_op=list_works&amp;amp;sortby=pubdate&quot;&gt;Leo Richter&lt;/a&gt; et al. – we propose a continuous online auditing framework to detect behavioural shifts in language models, ensuring that deployed models remain aligned with societal values and preventing vendors or attackers from covertly deploying unaligned models for malicious purposes.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;our-work-at-colm-2024&quot;&gt;Our work at CoLM 2024&lt;/h3&gt;

&lt;p&gt;The &lt;a href=&quot;https://colmweb.org/&quot;&gt;Conference on Language Modeling&lt;/a&gt; (CoLM) is a very new thing. I have been area-chairing for CoLM this year, and I’m really impressed by the quality of all submissions!
We will be presenting two papers:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://arxiv.org/abs/2404.16041&quot;&gt;Forklift: An Extensible Neural Lifter&lt;/a&gt;, by &lt;a href=&quot;https://jordiae.com/&quot;&gt;Jordi Armengol-Estapé&lt;/a&gt; et al. – we introduce Forklift, a framework that uses neural models to translate assembly code across different instruction set architectures by “lifting” source assembly code into an intermediate representation, thereby reducing the engineering effort required for cross-architecture software migration!&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://arxiv.org/abs/2405.15984&quot;&gt;Evaluating the Adversarial Robustness of Retrieval-Based In-Context Learning for Large Language Models&lt;/a&gt;, by &lt;a href=&quot;https://simon-yu.netlify.app/&quot;&gt;Simon Yu&lt;/a&gt;, &lt;a href=&quot;https://probe2.github.io/&quot;&gt;Jie He&lt;/a&gt; et al. – we analyse the adversarial robustness of retrieval-based in-context learning, finding that while retrieval-augmented methods improve robustness against test sample attacks, they increase vulnerability to adversarially perturbed demonstrations; to address this, we propose a new training-free defence method, which significantly improves adversarial robustness.&lt;/li&gt;
&lt;/ul&gt;
</description>
        <pubDate>Mon, 01 Jul 2024 01:00:00 +0100</pubDate>
        <link>https://neuralnoise.com///2024/research/</link>
        <guid isPermaLink="true">https://neuralnoise.com///2024/research/</guid>
        
        <category>machine learning</category>
        
        <category>natural language processing</category>
        
        <category>research</category>
        
        <category>academia</category>
        
        <category>edinburgh</category>
        
        <category>research</category>
        
        
        <category>machine learning</category>
        
        <category>natural language processing</category>
        
        <category>research</category>
        
        <category>academia</category>
        
        <category>edinburgh</category>
        
      </item>
    
      <item>
        <title>Looking for Postdocs, June 2024 Edition</title>
        <description>&lt;p&gt;We have an opening for a 3-year postdoc – &lt;a href=&quot;https://ellis.eu/jobs/post-doctoral-research-associate&quot;&gt;more details are available here&lt;/a&gt; – on a project funded by Huawei via the Huawei-Edinburgh Joint Lab initiative, with me as the Principal Investigator (PI).&lt;/p&gt;

&lt;p&gt;The researcher will work on projects involving the design and application of improving the robustness and trustworthiness of Large Language Models when solving complex reasoning tasks, while improving their explainability and generalisation properties. They will be part of the &lt;a href=&quot;https://edinburghnlp.inf.ed.ac.uk/&quot;&gt;Edinburgh NLP Group&lt;/a&gt;, a world-leading research group in Natural Language Processing.&lt;/p&gt;
</description>
        <pubDate>Sat, 01 Jun 2024 01:00:00 +0100</pubDate>
        <link>https://neuralnoise.com///2024/postdoc/</link>
        <guid isPermaLink="true">https://neuralnoise.com///2024/postdoc/</guid>
        
        <category>machine learning</category>
        
        <category>natural language processing</category>
        
        <category>hallucinations</category>
        
        <category>reasoning</category>
        
        <category>academia</category>
        
        <category>edinburgh</category>
        
        <category>postdoc</category>
        
        
        <category>machine learning</category>
        
        <category>natural language processing</category>
        
        <category>hallucinations</category>
        
        <category>reasoning</category>
        
        <category>academia</category>
        
        <category>edinburgh</category>
        
        <category>postdoc</category>
        
      </item>
    
      <item>
        <title>Looking for Postdocs!</title>
        <description>&lt;p&gt;We have an opening for a 2-year postdoc – &lt;a href=&quot;https://elxw.fa.em3.oraclecloud.com/hcmUI/CandidateExperience/en/sites/CX_1001/job/5583&quot;&gt;more details are available here&lt;/a&gt; – on a project titled &lt;a href=&quot;https://web.inf.ed.ac.uk/eliai/projects/gradient-based-learning-of-complex-latent-structur&quot;&gt;Gradient-based Learning of Complex Latent Structures&lt;/a&gt;, with me as the Principal Investigator (PI), and &lt;a href=&quot;http://nolovedeeplearning.com/&quot;&gt;Antonio Vergari&lt;/a&gt; (&lt;a href=&quot;https://web.inf.ed.ac.uk/anc&quot;&gt;IANC&lt;/a&gt;) and &lt;a href=&quot;https://ducdauge.github.io/&quot;&gt;Edoardo Ponti&lt;/a&gt; (&lt;a href=&quot;https://web.inf.ed.ac.uk/ilcc&quot;&gt;ILCC&lt;/a&gt;) as co-PIs. The position is entirely funded by the &lt;a href=&quot;https://web.inf.ed.ac.uk/eliai&quot;&gt;Edinburgh Laboratory for Integrated Artificial Intelligence&lt;/a&gt; (ELIAI) – if you want to know more, feel free to reach out!&lt;/p&gt;

&lt;p&gt;You can apply &lt;a href=&quot;https://elxw.fa.em3.oraclecloud.com/hcmUI/CandidateExperience/en/sites/CX_1001/job/5583&quot;&gt;at this link&lt;/a&gt;.&lt;/p&gt;

&lt;h3 id=&quot;project-description&quot;&gt;Project description&lt;/h3&gt;

&lt;p&gt;Imposing structural constraints on the latent representations learned by deep neural models has several applications, which can improve their explainability, their robustness, and their ability to generalise to out-of-domain distributions. For example, we can learn more explainable models by making them selectively decide which parts of the input to consider; and we can improve their generalisation properties by learning representations suitable for reasoning tasks, such as deductive reasoning and planning, and comply with any desired constraints. For instance, the intermediate structure can represent a relational graph between objects in the world; the relationships between multiple sub-questions in a complex question; or computation graphs which can be executed to produce a prediction.&lt;/p&gt;

&lt;p&gt;In this project, we aim to investigate how we can derive better methods for back-propagating through mixed continuous-discrete complex latent structures, and how we can leverage them for learning more explainable, data-efficient, and robust deep neural models. The reason why discrete latent representations are not widely adopted by deep neural models is that they tend to not interact well with gradient-based optimisation methods, but this started to change recently (e.g., see &lt;a href=&quot;https://arxiv.org/abs/2106.01798&quot;&gt;Niepert et al., 2021&lt;/a&gt;; &lt;a href=&quot;https://arxiv.org/abs/2209.04862&quot;&gt;Minervini et al. 2022&lt;/a&gt;), enabling a wide range of applications and use cases.&lt;/p&gt;

&lt;p&gt;Related papers:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Niepert, Minervini, and Franceschi - &lt;a href=&quot;https://arxiv.org/abs/2106.01798&quot;&gt;Implicit MLE: Backpropagating Through Discrete Exponential Family Distributions&lt;/a&gt;. NeurIPS 2021&lt;/li&gt;
  &lt;li&gt;Minervini, Franceschi, and Niepert - &lt;a href=&quot;https://arxiv.org/abs/2209.04862&quot;&gt;Adaptive Perturbation-Based Gradient Estimation for Discrete Latent Variable Models&lt;/a&gt;. AAAI 2023&lt;/li&gt;
  &lt;li&gt;Ahmed, Teso, Chang, Van den Broeck, Vergari - &lt;a href=&quot;https://arxiv.org/abs/2206.00426&quot;&gt;Semantic Probabilistic Layers for Neuro-Symbolic Learning&lt;/a&gt;. NeurIPS 2022&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;position&quot;&gt;Position&lt;/h3&gt;

&lt;p&gt;The post holder will work on projects involving the design and application of deep learning models with discrete latent structures for improving their explainability, generalisation, and robustness properties. They will be part of the new &lt;a href=&quot;https://web.inf.ed.ac.uk/eliai&quot;&gt;Edinburgh Laboratory for Integrated Artificial Intelligence&lt;/a&gt; and the &lt;a href=&quot;https://edinburghnlp.inf.ed.ac.uk/&quot;&gt;Edinburgh NLP Group&lt;/a&gt;, a world-leading research group in Natural Language Processing.&lt;/p&gt;

&lt;p&gt;The School of Informatics is one of the largest research centres in Computer Science in Europe, and it has been &lt;a href=&quot;https://www.ed.ac.uk/informatics/news-events/stories/2022/informatics-ref2021-results-global-reach-genuine-i&quot;&gt;ranked #1 in the UK&lt;/a&gt; in terms of research power by a large margin. The Edinburgh NLP Group is consistently ranked among the &lt;a href=&quot;https://csrankings.org/#/index?nlp&amp;amp;world&quot;&gt;world’s leading research groups&lt;/a&gt; in Natural Language Processing. We are offering an exciting opportunity to work in an interdisciplinary, collaborative, friendly, and supportive environment, integrating different sub-fields of Computer Science and Artificial Intelligence.&lt;/p&gt;
</description>
        <pubDate>Tue, 01 Nov 2022 00:00:00 +0000</pubDate>
        <link>https://neuralnoise.com///2022/postdoc/</link>
        <guid isPermaLink="true">https://neuralnoise.com///2022/postdoc/</guid>
        
        <category>machine learning</category>
        
        <category>natural language processing</category>
        
        <category>knowledge graphs</category>
        
        <category>neuro-symbolic reasoning</category>
        
        <category>academia</category>
        
        <category>edinburgh</category>
        
        <category>postdoc</category>
        
        
        <category>machine learning</category>
        
        <category>natural language processing</category>
        
        <category>knowledge graphs</category>
        
        <category>neuro-symbolic reasoning</category>
        
        <category>academia</category>
        
        <category>edinburgh</category>
        
        <category>postdoc</category>
        
      </item>
    
  </channel>
</rss>
