The Quest for the Right Mediator: Surveying Mechanistic Interpretability for NLP Through the Lens of Causal Mediation Analysis

Authors

  • Aaron Mueller Boston University
  • Jannik Brinkmann University of Mannheim
  • Millicent Li Northeastern University
  • Samuel Marks Anthropic
  • Koyena Pal Northeastern University
  • Nikhil Prakash Northeastern University
  • Can Rager Northeastern University
  • Aruna Sankaranarayanan Massachusetts Institute of Technology
  • Arnab Sen Sharma Northeastern University
  • Jiuding Sun Stanford University
  • Eric Todd Northeastern University
  • David Bau Northeastern University
  • Yonatan Belinkov Technion – Israel Institute of Technology

Keywords:

Interpretability, Causal Mediation Analysis, Language Model, Causal Graph, Counterfactual

Abstract

Interpretability provides a toolset for understanding how and why language models behave in certain ways. However, there is little unity in the field: most studies employ ad-hoc evaluations and do not share theoretical foundations, making it difficult to measure progress and compare the pros and cons of different techniques. Furthermore, while mechanistic understanding is frequently discussed, the basic causal units underlying these mechanisms are often not explicitly defined. In this article, we propose a perspective on interpretability research grounded in causal mediation analysis. Specifically, we describe the history and current state of interpretability taxonomized according to the types of causal units (mediators) employed, as well as methods used to search over mediators. We discuss the pros and cons of each mediator, providing insights as to when particular kinds of mediators and search methods are most appropriate. We argue that this framing yields a more cohesive narrative of the field and helps researchers select appropriate methods based on their research objective. Our analysis yields actionable recommendations for future work, including the discovery of new mediators and the development of standardized evaluations tailored to these goals.

Author Biographies

  • Aaron Mueller, Boston University

    Asst. Prof. of Computer Science at Boston University (Fall 2025).

  • David Bau, Northeastern University

    Asst. Prof. of Computer Science at Northeastern University.

  • Yonatan Belinkov, Technion – Israel Institute of Technology

    Asst. Prof. of Computer Science at the Technion.

Published

2026-06-27