On Ground Labs

Research

Towards Multi-Turn RLHF

Aligning language models with conversational trajectories

Launching 2026Currently in exploration

The Problem

Current reinforcement learning from human feedback (RLHF) techniques overwhelmingly optimize language models on single-turn interactions: one prompt, one response, one evaluation.

However, real human–AI interaction is multi-turn, where meaning and usefulness emerge from the continuity of dialogue. Training only on isolated Q–A pairs introduces a structural misalignment between training objectives and deployment realities.

Consider a tutoring session. A model might give a technically correct answer in turn 1, but fail to build on student confusion expressed in turn 2, or contradict its own explanation by turn 4. Single-turn RLHF cannot capture or optimize for these conversational qualities.

Three Failure Modes

Myopic Rewards

Single-turn reward models cannot capture qualities like “the model stayed consistent across five turns” or “the clarification in turn 3 resolved confusion introduced in turn 1.” The temporal structure of conversation and coherence across turns remains invisible to isolated evaluations.

Misaligned Optimization

Models are optimized for one-shot helpfulness, while users judge performance on conversation arcs. A model might maximize turn-level rewards while producing incoherent multi-turn trajectories.

Missing Benchmarks

Current evaluation suites—MT-Bench, HELM, AlpacaEval—rarely assess dialogue trajectories. We lack standardized ways to measure conversational coherence, memory consistency, or long-horizon task success.

We frame multi-turn dialogue as an RL episode, where each utterance is a step and the final outcome provides a trajectory-level reward.

Our Approach

We propose conversation-level RLHF, where the unit of optimization shifts from individual responses to complete dialogue trajectories.

Key Components

Trajectory Rewards

Instead of evaluating isolated turns, human or synthetic annotators rate entire conversations. This captures qualities like coherence across turns, successful task completion, and maintained context.

Credit Assignment

We propagate trajectory-level rewards back through individual turns using policy gradient methods, temporal difference learning, or Monte Carlo rollouts. This allows the model to learn which conversational moves contributed to overall success.

Coherence Models

We train specialized reward models to score properties like memory consistency, tone stability, and context preservation—qualities that only emerge across multiple turns.

Hybrid Supervision

Our approach combines dense per-turn feedback (for local quality) with sparse trajectory-level feedback (for conversational coherence), balancing immediate correctness with long-horizon helpfulness.

Expected Contributions

If this direction proves fruitful, we aim to deliver:

A conceptual framework for conversation-level alignment that extends current RLHF theory to multi-turn settings.

Datasets of multi-turn dialogues annotated with trajectory-level preference labels, enabling community research.

Benchmarks measuring long-horizon conversational qualities like coherence, memory, and task success across turns.

A prototype training pipeline demonstrating conversation-level RLHF at scale, with open-source implementation.

Open Challenges

This approach faces several meaningful obstacles:

Data Collection

Annotating entire dialogues is more expensive and cognitively demanding than rating individual turns. Inter-annotator agreement may be lower for subjective conversational qualities. Scaling to thousands of trajectory-level labels requires careful protocol design.

Evaluation Complexity

Designing objective, reproducible metrics for conversational flow remains an open problem. Human judgments of dialogue quality are context-dependent and nuanced. Automated evaluation risks optimizing for proxy metrics rather than true conversational helpfulness.

Training Stability

Multi-turn credit assignment introduces optimization challenges. Long-horizon policy gradients may suffer from high variance. Balancing exploration of conversational strategies with exploitation of known-good patterns requires careful algorithmic design.

Computational Cost

Training with trajectory-level optimization is inherently more computationally intensive than single-turn RLHF. Generating, evaluating, and learning from full conversations scales poorly compared to isolated examples.

Research Agenda

We're approaching this in stages:

Start with constrained environments where conversation-level success is well-defined—negotiation games, tutoring tasks with measurable learning outcomes, multi-step planning scenarios.

Build a conversation preference dataset by collecting A vs. B dialogue comparisons, starting with synthetic conversations and progressively incorporating human dialogues.

Train reward models for trajectory quality using the preference data, exploring both learned models and rule-based heuristics for conversational coherence.

Adapt policy optimization methods—PPO, DPO variants, and other recent RLHF techniques—to multi-step feedback settings, measuring both performance and training stability.

Release benchmarks and datasets to the research community, inviting collaboration and enabling systematic progress on conversation-level alignment.

Current Stage

Hypothesis formulation

Literature review in progress

Seeking early feedback

Next Steps

Q1 2026: Dataset design

Q2 2026: Prototype experiments

Q3 2026: Early results

Get Involved

On Ground Labs is preparing for a 2026 launch. While these directions are still being explored, we're gathering the people and context that will shape the work.

Feedback on the research blueprint

Point out gaps, related work, or alternative framings we should investigate before launch.

Exploration partners and collaborators

Researchers, engineers, or students who want to co-design experiments during the exploration phase.

Conversation datasets and field contexts

Organisations willing to share multi-turn dialogue data or host pilots as we shape the 2026 rollout.

Currently self-funded by Tanay Pratap — if the research areas resonate, please reach out: tanay@ongroundlabs.org

References

This work builds on foundational research in RLHF:

Ouyang et al., 2022 – Training language models to follow instructions with human feedback

Bai et al., 2022 – Constitutional AI: Harmlessness from AI feedback

Christiano et al., 2017 – Deep reinforcement learning from human preferences