Papers
arxiv:2411.04282

Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding

Published on Nov 6
· Submitted by hlnchen on Nov 11
#2 Paper of the day
Authors:
,
,
,
,
,
,
,

Abstract

Large language models (LLMs) have shown impressive capabilities, but still struggle with complex reasoning tasks requiring multiple steps. While prompt-based methods like Chain-of-Thought (CoT) can improve LLM reasoning at inference time, optimizing reasoning capabilities during training remains challenging. We introduce LaTent Reasoning Optimization (LaTRO), a principled framework that formulates reasoning as sampling from a latent distribution and optimizes it via variational approaches. LaTRO enables LLMs to concurrently improve both their reasoning process and ability to evaluate reasoning quality, without requiring external feedback or reward models. We validate LaTRO through experiments on GSM8K and ARC-Challenge datasets using multiple model architectures. On GSM8K, LaTRO improves zero-shot accuracy by an average of 12.5% over base models and 9.6% over supervised fine-tuning across Phi-3.5-mini, Mistral-7B, and Llama-3.1-8B. Our findings suggest that pre-trained LLMs possess latent reasoning capabilities that can be unlocked and enhanced through our proposed optimization approach in a self-improvement manner. The code of LaTRO is available at https://github.com/SalesforceAIResearch/LaTRO.

Community

Paper author Paper submitter

Chain-of-thought (CoT) demonstrated strong reasoning capabilities of LLMs. But how to train them to reason?
Introducing LaTent Reasoning Optimization (LaTRO): a principled framework that formulates the reasoning trajectory as a latent variable and optimize the reasoning via variational approaches.

  • LaTRO has good performance: we improve zero-shot accuracy by an average of 12.5% over 3 different base models: Phi-3.5-mini, Mistral-7B, and Llama-3.1-8B.
  • LaTRO is reward model-free: Surprisingly but reasonable, the log probabilities of producing the correct answer after the reasoning trajectory serves as a natural reward function, which we call "Self-rewarding". No need to train additional reward models as in RLHF!
  • LaTRO shifts the inference-time scaling back to training time - by self-generating multiple reasoning trajectories and self-rewarding them with groundtruth during each training update
  • Free side benefit: one can compress the length of reasoning trajectories via LaTRO - on GSM8K, a model with 200 reasoning tokens achieves 78% performance of a model with 500 reasoning tokens.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2411.04282 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2411.04282 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2411.04282 in a Space README.md to link it from this page.

Collections including this paper 6