17 Apr 2025 • on machine learning natural language processing research academia

March 2025 in Research

We have been working on language model evaluation, knowledge utilization, efficiency, and multimodal reasoning. We had papers at ICLR 2025, NAACL 2025 (x3), AAAI 2025, and others, along with several ongoing works.

NAACL 2025 – Controlling Knowledge & Reasoning

Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering, by Yu Zhao et al. – We introduce SpARE, a training‑free method to control whether an LLM relies on its internal parametric knowledge or given context when conflicts arise. By analyzing mid‑layer activations with sparse autoencoders, SpARE identifies conflict signals and manipulates them to steer the model at inference time, significantly improving performance on open‑domain QA compared to prior methods. (Oral presentation)
Are We Done with MMLU?, by Aryo Gema and many others – We analyze the Massive Multitask Language Understanding benchmark, uncovering a fairly high error rate – for example, in the Virology subset, 57% of sampled questions had issues. We introduce MMLU‑Redux, a manually curated subset of 5,700 expert‑verified questions, and show that corrected evaluations can substantially alter model rankings. MMLU‑Redux is open‑sourced also adopted for example by DeepSeek and Qwen!
Self-Training Large Language Models for Tool-Use Without Demonstrations, based on Ne Luo’s MSc project – We explore whether LLMs can learn tool usage (e.g., search engines, calculators) without hand‑crafted examples. Starting with zero‑shot prompts, we generate synthetic tool‑using traces and then fine‑tune the model with them. On PopQA, the self‑trained model gains +3.7% accuracy, though results vary on other datasets, highlighting both promise and challenges in autonomous tool‑use learning. Ne Luo is looking for a PhD position, contact her if you are interested in working with her!

ICLR 2025 – Learning & Evaluation

An Auditing Test to Detect Behavioral Shift in Language Models, by the amazing Leo Richter – We propose a method for continual Behavioral Shift Auditing (BSA) of LLMs. This statistical test monitors an LLM’s outputs for significant deviations from a reference model’s behavior, with theoretical guarantees on detecting genuine shifts while avoiding false alarms. Our BSA approach relies on catching subtle changes in a model’s toxicity and translation performance after fine-tuning, using only a few hundred examples, offering a practical tool to ensure that an LLM remains aligned during its deployment/lifetime.

Reasoning and Planning for Large Language Models

Lost in Time: Clock and Calendar Understanding Challenges in Multimodal LLMs, by Rohit Saxena et al. – We introduce ClockQA and CalendarQA for testing multimodal LLMs’ temporal reasoning from images, revealing widespread failures and motivating models with better time‑date understanding. This article got plenty of media coverage – e.g., on Gizmodo, The Engineer, Yahoo! News, VICE, Tech Xplore, and many others!

AAAI 2025 – Efficient Inference

Adaptive Computation Modules: Granular Conditional Computation for Efficient Inference, by Bartosz Wójcik, Alessio Devoto, et al. – We propose Adaptive Computation Modules (ACMs) for dynamic, per‑token computation in Transformers. ACMs consist of cascaded sub‑modules with gating functions that allow easy tokens to exit early. We propose a distillation method to retrofit pre‑trained models with ACMs, cutting inference cost without accuracy loss in vision and speech tasks, offering a plug‑and‑play approach to more efficient AI systems!

COLING 2025 – Multilingual Resources

SynDARin: Synthesising Datasets for Automated Reasoning in Low-Resource Languages, by Gayane Ghazaryan, Erik Arakelyan (who just got his PhD and joined NVIDIA! 🚀) et al. – SynDARin synthesizes QA datasets in low‑resource languages (e.g., Armenian) by generating English questions via LLMs from parallel corpora, translating and validating them. The resulting 16,000+ QA pairs produce a challenging benchmark where models often perform near chance, highlighting critical gaps and enabling rapid evaluation in languages lacking resources.

Frontiers in AI 2025 – Human-AI Collaboration

Fostering Effective Hybrid Human-LLM Reasoning and Decision Making – We examine frameworks combining LLMs and human judgment for complex tasks, offering design principles for AI‑assisted decision systems. Through case studies, we show that integrating LLM‑generated insights with human oversight yields more reliable and interpretable outcomes than either alone, providing guidelines for principled human‑in‑the‑loop systems.

What’s Brewing

Noiser: Bounded Input Perturbations for Attributing Large Language Models, by Reza Madani et al. – Noiser perturbs input embeddings to attribute token importance, introducing an “answerability” check to validate attributions. Outperforming gradients and attention, Noiser offers robust post‑hoc explanations for LLM predictions.
An Analysis of Decoding Methods for LLM-based Agents for Faithful Multi-Hop Question Answering, by Alex, Sanad, and others amazing students at the UoE – We analyse how faithfulness‑enhancing decoding (e.g., DeCoRe) within the ReAct agent framework improves multi‑hop QA, boosting HotpotQA F1 from 19.5 to 32.6, underscoring the role of decoding in reliable LLM reasoning.
Q-Filters: Leveraging QK Geometry for Efficient KV Cache Compression, led by Nathan Godey – Q-Filters uses query‑key geometric projections to filter past tokens on the fly, compressing KV cache without retraining and matching attention‑based methods like SnapKV, enabling long‑context generation with minimal memory.
PosterSum: A Multimodal Benchmark for Scientific Poster Summarization, by the amazing Rohit Saxena – PosterSum offers 16,000+ re search posters paired with abstracts for evaluating vision‑language summarization. Our “Segment & Summarize” approach secures a 3.1% ROUGE‑L gain, highlighting this benchmark’s challenge.