Learning to Reason in LLMs by Expectation Maximization

Dec 23, 2025·

Junghyun Lee

Branislav Kveton

Anup Rao

Subhojyoti Mukherjee

Ryan A. Rossi

Sunav Choudhary

Alexa Siu

· 0 min read

PDF

Abstract

Large language models (LLMs) solve reasoning problems by first generating a rationale and then answering. We formalize reasoning as a latent variable model and derive a reward-based filtered expectation-maximization (FEM) objective for learning to reason. This view connects EM and modern reward-based optimization, and shows that the main challenge lies in designing a sampling distribution of rationales that justify correct answers. We instantiate and compare three sampling schemes: rejection sampling with a budget, self-taught reasoner (STaR), and prompt posterior sampling (PPS), which only keeps the rationalization stage of STaR that conditions on the correct answer in the prompt. We experiment with LLM-as-a-judge calibration and summarization from feedback tasks, where conditioning on the correct answer provides a strong guidance for generating rationales. Our experiments show the efficacy of PPS over other sampling schemes, and that the sampling scheme can have a significant impact on performance.

Type

Preprint

Publication

arXiv preprint arXiv:2512.20169

Last updated on Dec 23, 2025

RL LLMs Reasoning Probabilistic Machine Learning

Authors

Junghyun Lee

PhD Candidate in AI

PhD candidate at KAIST AI, jointly advised by Se-Young Yun and Chulhee Yun. I work on interactive machine learning, theoretical aspects of LLMs, learning/optimization theory, and statistical analysis of large networks.

← A Jointly Efficient and Optimal Algorithm for Heteroskedastic Generalized Linear Bandits with Adversarial Corruptions Feb 12, 2026

Preliminary Empirical Study of Low-Rank, Hierarchical Gaussian Linear Bandits Dec 16, 2025 →

No results found

Learning to Reason in LLMs by Expectation Maximization