Aryaman Arora

I am a third-year Ph.D. student at Stanford University advised by Dan Jurafsky and Christopher Potts, funded by the NSF Graduate Research Fellowship Program. Concurrently, I am a researcher at Transluce.

I work on interpretability of language models. Not only am I curious about how language models work, but I want to discover principles that can enable better language models.

I completed my B.S. in Computer Science and Linguistics at Georgetown University, where I worked with Nathan Schneider on computational linguistics. I interned at ETH Zürich with Ryan Cotterell working on information theory, as well as at Apple and Redwood Research.

I am currently recruiting students to work on interpretability. [» more info]

Contact

Google Scholar · GitHub · Twitter · Email

Greatest Hits [» more papers]

Language model circuits are sparse in the neuron basis
Aryaman Arora*, Zhengxuan Wu*, Jacob Steinhardt, Sarah Schwettmann
Transluce Blog, 2025

Mechanistic evaluation of Transformers and state space models
Aryaman Arora, Neil Rathi, Nikil Roashan Selvam, Róbert Csórdas, Dan Jurafsky, Christopher Potts
NeurIPS Mechanistic Interpretability Workshop, 2025 Spotlight

AxBench: Steering LLMs? Even simple baselines outperform sparse autoencoders
Zhengxuan Wu*, Aryaman Arora*, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher D. Manning, Christopher Potts
ICML, 2025 Spotlight

ReFT: Representation finetuning for language models
Zhengxuan Wu*, Aryaman Arora*, Zheng Wang, Atticus Geiger, Dan Jurafsky, Christopher D. Manning, Christopher Potts
NeurIPS, 2024 Spotlight

CausalGym: Benchmarking causal interpretability methods on linguistic tasks
Aryaman Arora, Dan Jurafsky, Christopher Potts
ACL, 2024 Outstanding Paper Award Senior Area Chair Award