Research
Towards Multi-Turn RLHF
Aligning language models with conversational trajectories
The Problem
Current reinforcement learning from human feedback (RLHF) techniques overwhelmingly optimize language models on single-turn interactions: one prompt, one response, one evaluation.
However, real human–AI interaction is multi-turn, where meaning and usefulness emerge from the continuity of dialogue. Training only on isolated Q–A pairs introduces a structural misalignment between training objectives and deployment realities.
Consider a tutoring session. A model might give a technically correct answer in turn 1, but fail to build on student confusion expressed in turn 2, or contradict its own explanation by turn 4. Single-turn RLHF cannot capture or optimize for these conversational qualities.
Three Failure Modes
Myopic Rewards
Single-turn reward models cannot capture qualities like “the model stayed consistent across five turns” or “the clarification in turn 3 resolved confusion introduced in turn 1.” The temporal structure of conversation and coherence across turns remains invisible to isolated evaluations.
Misaligned Optimization
Models are optimized for one-shot helpfulness, while users judge performance on conversation arcs. A model might maximize turn-level rewards while producing incoherent multi-turn trajectories.
Missing Benchmarks
Current evaluation suites—MT-Bench, HELM, AlpacaEval—rarely assess dialogue trajectories. We lack standardized ways to measure conversational coherence, memory consistency, or long-horizon task success.
We frame multi-turn dialogue as an RL episode, where each utterance is a step and the final outcome provides a trajectory-level reward.
Our Approach
Key Components
Trajectory Rewards
Instead of evaluating isolated turns, human or synthetic annotators rate entire conversations. This captures qualities like coherence across turns, successful task completion, and maintained context.
Credit Assignment
We propagate trajectory-level rewards back through individual turns using policy gradient methods, temporal difference learning, or Monte Carlo rollouts. This allows the model to learn which conversational moves contributed to overall success.
Coherence Models
We train specialized reward models to score properties like memory consistency, tone stability, and context preservation—qualities that only emerge across multiple turns.
Hybrid Supervision
Our approach combines dense per-turn feedback (for local quality) with sparse trajectory-level feedback (for conversational coherence), balancing immediate correctness with long-horizon helpfulness.
Expected Contributions
If this direction proves fruitful, we aim to deliver:
A conceptual framework for conversation-level alignment that extends current RLHF theory to multi-turn settings.
Datasets of multi-turn dialogues annotated with trajectory-level preference labels, enabling community research.
Benchmarks measuring long-horizon conversational qualities like coherence, memory, and task success across turns.
A prototype training pipeline demonstrating conversation-level RLHF at scale, with open-source implementation.
Open Challenges
This approach faces several meaningful obstacles:
Data Collection
Annotating entire dialogues is more expensive and cognitively demanding than rating individual turns. Inter-annotator agreement may be lower for subjective conversational qualities. Scaling to thousands of trajectory-level labels requires careful protocol design.
Evaluation Complexity
Designing objective, reproducible metrics for conversational flow remains an open problem. Human judgments of dialogue quality are context-dependent and nuanced. Automated evaluation risks optimizing for proxy metrics rather than true conversational helpfulness.
Training Stability
Multi-turn credit assignment introduces optimization challenges. Long-horizon policy gradients may suffer from high variance. Balancing exploration of conversational strategies with exploitation of known-good patterns requires careful algorithmic design.
Computational Cost
Training with trajectory-level optimization is inherently more computationally intensive than single-turn RLHF. Generating, evaluating, and learning from full conversations scales poorly compared to isolated examples.
Research Agenda
Start with constrained environments where conversation-level success is well-defined—negotiation games, tutoring tasks with measurable learning outcomes, multi-step planning scenarios.
Build a conversation preference dataset by collecting A vs. B dialogue comparisons, starting with synthetic conversations and progressively incorporating human dialogues.
Train reward models for trajectory quality using the preference data, exploring both learned models and rule-based heuristics for conversational coherence.
Adapt policy optimization methods—PPO, DPO variants, and other recent RLHF techniques—to multi-step feedback settings, measuring both performance and training stability.
Release benchmarks and datasets to the research community, inviting collaboration and enabling systematic progress on conversation-level alignment.
Current Stage
Hypothesis formulation
Literature review in progress
Seeking early feedback
Next Steps
Q1 2026: Dataset design
Q2 2026: Prototype experiments
Q3 2026: Early results
Get Involved
Feedback on the research blueprint
Exploration partners and collaborators
Conversation datasets and field contexts
Currently self-funded by Tanay Pratap — if the research areas resonate, please reach out: tanay@ongroundlabs.org
References
This work builds on foundational research in RLHF:
Ouyang et al., 2022 – Training language models to follow instructions with human feedback
Bai et al., 2022 – Constitutional AI: Harmlessness from AI feedback
Christiano et al., 2017 – Deep reinforcement learning from human preferences