Skip to content

Support for realistic multi-step rollouts via async vLLM API #3284

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
BjarniHaukur opened this issue Apr 12, 2025 · 6 comments · May be fixed by #3469
Open

Support for realistic multi-step rollouts via async vLLM API #3284

BjarniHaukur opened this issue Apr 12, 2025 · 6 comments · May be fixed by #3469
Labels
✨ enhancement New feature or request 🏋 GRPO Related to GRPO

Comments

@BjarniHaukur
Copy link

BjarniHaukur commented Apr 12, 2025

Feature request

I propose adding a new OpenAI-compatible vLLM API server for use with the GRPOTrainer.

The implementation mirrors the weight syncing logic from trl/scripts/vllm_serve.py, but offloads most complexity to the existing vllm.entrypoints.openai.api_server infrastructure.

This enables training on significantly more complex rollouts than the standard synchronous .generate() endpoint can support. By supporting the OpenAI API interface, it also allows seamless integration with a wide range of existing agent frameworks and products.

This direction is a step toward reproducing pipelines like OpenHands LM 32B. I strongly suspect that Claude 3.7 Sonnet was trained in a similar fashion, iteratively reinforced using rollouts generated through its own Claude Code scaffolding.

Motivation

Currently, TRL only supports synchronous, batched .generate() calls for inference. This restricts the types of rollouts that can be created, especially in domains that benefit from having multi-step approaches, tool use, or environment interaction.

I’ve been using TRL for my Master’s thesis on reinforcement learning for language models in the program repair domain. In several GRPO experiments, I repeatedly encountered the same limitation: with .generate(), all context construction, planning, and feedback extraction must happen within a single call. For example, in tasks from SWE-Gym, the model needs to generate code edits for real repositories. To do this in one .generate() call, the user must manually construct the relevant repo context and later parse outputs like diffs to extract useful reward signals. This makes experimentation slow and always feels like “reinventing the wheel.”

Rather than building ad-hoc scaffolding from scratch, I began exploring how to integrate existing coding agents like Aider directly into the training loop. These agents already support rich workflows such as repo mapping, diff parsing, and iterative interaction—and they use the OpenAI API interface. Enabling TRL to train models through this interface would allow us to run them in situ, inside the same environment they’re meant to be deployed in.

This proposal aims to bridge that gap and enable more realistic, multi-step training workflows as a first-class feature in TRL.

Your contribution

I have already developed an initial working implementation in this PR: #3285, which introduces vllm_serve_openai_compatible.py.

I intend to wrap up remaining loose ends and properly test this approach, both for functional correctness and throughput benchmarking.

The draft PR also includes a few project-specific utilities (WIP) to illustrate how this can be used in practice. For example, it shows how to parallelize existing Aider instances that interact with this server to generate training data.

One open issue is how to reliably access full conversation histories for each rollout. Since API calls happen internally within the agent, we cannot assume access to .get_conversation_history() or similar. A possible approach is to record all requests and responses server-side and map them back to the original prompt to reconstruct complete rollouts to train on.

I’d be happy to align the implementation with TRL’s design goals and iterate toward something mergeable.

@github-actions github-actions bot added ✨ enhancement New feature or request 🏋 GRPO Related to GRPO labels Apr 12, 2025
@BjarniHaukur BjarniHaukur changed the title OpenAI-Compatible vLLM Server with Weight Synchronization Support for Realistic Multi-Step Rollouts via Async vLLM API in TRL Apr 14, 2025
@BjarniHaukur BjarniHaukur changed the title Support for Realistic Multi-Step Rollouts via Async vLLM API in TRL Support for Realistic Multi-Step Rollouts via Async vLLM API Apr 14, 2025
@BjarniHaukur BjarniHaukur changed the title Support for Realistic Multi-Step Rollouts via Async vLLM API Support for realistic multi-step rollouts via async vLLM API Apr 14, 2025
@qgallouedec
Copy link
Member

Great job on all the work you've done, and thanks for sharing it with the TRL community!

This likely involves significant changes to TRL, but your motivations seem solid and well thought out—it feels like the right time to explore this direction.

Would it be possible to break your work into several smaller PRs? That would make the review process much smoother. For example, you could start with a PR focused on leveraging vLLM server, followed by another that integrates the tools/agents. (Of course, feel free to divide it differently if you think there's a better approach.)

@BjarniHaukur
Copy link
Author

BjarniHaukur commented Apr 16, 2025

Will do!

I believe I’ve found a clean abstraction that minimizes the impact on existing code. Specifically, I’m exploring repurposing vllm_client to become a proper client interface, rather than just a wrapper around vllm_serve.py. With this in place, users could extend it and implement their own .generate() as needed.

The only other change would be to pass full data dictionaries (rather than just prompts) into .generate(), and expect modified dictionaries in return.

I’ll keep iterating on this until I find something that’s both elegant and fits my specific use case. Once it’s settled, I’ll split it into smaller, reviewable PRs.

I believe this could meaningfully lower the barrier to entry in this specific domain of RL training.

Here’s a minimal example showing how my use case looks now. With this, the "normal" GRPOTrainer setup can directly train on SWE-GYM with no extra scaffolding:

import os, multiprocessing as mp
from contextlib import redirect_stdout, redirect_stderr
from datasets import load_dataset
from aider.coders import Coder
from aider.models import Model
from aider.io import InputOutput
from trl import GRPOConfig, GRPOTrainer
from trl.extras.vllm_client import VLLMClient

class AiderClient(VLLMClient):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        os.environ["OPENAI_API_BASE"] = f"http://{self.host}:{self.server_port}/v1/completions"

    def process_one(self, data: dict[str, any]) -> tuple[str, list]:
        orig = os.getcwd()
        try:
            temp = clone_repo_at_commit(data["repo_url"], data["base_commit"])
            os.chdir(temp)
            with open(os.devnull, "w") as d, redirect_stdout(d), redirect_stderr(d):
                coder = Coder.create(main_model=Model("openai/our-model"), io=InputOutput(yes=True), suggest_shell_commands=False)
                coder.run(data["problem_statement"])
                messages = coder.format_chat_chunks().all_messages()
                diff = get_head_commit_diff(temp)
        finally:
            clean_repo_dir(temp)
            os.chdir(orig)
        return diff, messages

    def generate(self, data: list[dict[str, any]], timeout: int = 300, **kwargs) -> list[dict[str, any]]:
        with mp.Pool(min(len(data), mp.cpu_count())) as p:
            results = p.map_async(self.process_one, data).get(timeout=timeout)
        for i, (d, m) in zip(data, results): i["generated_diff"] = d; i["messages"] = m
        return data

trainer = GRPOTrainer(
    args=GRPOConfig(use_vllm=True),
    client=AiderClient(host="0.0.0.0", server_port=8000),
    train_dataset=load_dataset("SWE-Gym/SWE-Gym", split="train")
)

trainer.train()

@qgallouedec
Copy link
Member

I believe I’ve found a clean abstraction that minimizes the impact on existing code. Specifically, I’m exploring repurposing vllm_client to become a proper client interface, rather than just a wrapper around vllm_serve.py. With this in place, users could extend it and implement their own .generate() as needed.

Ok, modifying the client-server seems acceptable to me, especially if it can allow easier customization.
I look forward to hearing about your progress, keep us posted!

@kwanUm
Copy link

kwanUm commented May 14, 2025

Hey @BjarniHaukur – I’m also looking at migrating GRPO roll-outs to an online vLLM setup for better performance and agent-style usability.
Also, really like the idea of letting the server keep the full conversation so the client doesn’t have to resend context; we’re doing that today in a separate RolloutManager on the client side, but server-side seems cleaner.
Do you already have a branch/PR that swaps the current LLM() offline path for AsyncLLMEngine? Have you been able to test throughput or latency compared to the batched .generate() route?
Would love to follow the work and possibly help test / contribute.

@BjarniHaukur
Copy link
Author

BjarniHaukur commented May 14, 2025

Hey @kwanUm, still working on it. I've been super close to finishing this for a while now. The main problem resides in the online weight syncing behavior for AsyncLLM. I've tried a bunch of things out but the system ends up in a deadlock somewhere internally in vLLM. That isn't too surprising, vLLM only just recently merged a PR adding collective_rpc_async, which is the method we would need to use to initialize the updates. I'm also checking out SGLang as an async rollout client. That seems a bit more mature but still a WIP on my end.

I've put quite a lot of though into how it would be best to integrate custom clients, and I'm relatively convinced of my approach. It decouples all the generation logic from the GRPOTrainer, and offloads it to client.generate(). It receives the inputs dictionary containing everything from your HF dataset and returns prompts/completions + anything the user wants to pass to the reward functions.

(overly simplified example)

class GenerationResult(TypedDict, total=False):
    """GRPO payload with required prompt/completion; extras allowed."""
    # Shared inputs across N rollouts in GRPO (across many GenerationResults)
    prompt: list[dict[str, str]]  # {role: str, content: str}
    # This comes after that, N different rollouts of the same prompt
    completion: list[dict[str, str]]  # {role: str, content: str}
    # Extra keys and values are forwarded to the user specified reward functions


class VLLMClient(ABC):
    @abstractmethod
    def generate(self, data: list[dict], **kwargs) -> list[GenerationResult]:
        pass

# Inside GRPOTrainer
...
output = client.generate(inputs)
...
rewards = reward_func(**output)

You can check out my working branch (though cautionary warning its not stable / ready at all), it might help you.
https://github.com/ASSERT-KTH/trl/tree/dev/trl

I'll post here again when I have something more concrete. Would love some help in integrating this type of behavior into TRL though! There's some semblance of it in verl, openpipe/art, and verifiers, but nothing that quite checks all the boxes. Most existing approaches miss one of the following: full OpenAI compatibility (true async multi-step, support tool-calling etc.), or just general ease of use.

@BjarniHaukur
Copy link
Author

Hey @qgallouedec

Finally got it working and found an abstraction that I believe could fit in TRL (#3469).

The new vllm_serve_async.py script behaves exactly like vllm_serve.py. It supports tensor and data parallelism but runs into the same error you mentioned here: vllm-project/vllm#17079.

Instead of extending VLLMClient, I opted for adding a rollout_func to the GRPOTrainer. This gives users full control over how rollouts are generated and enables users to forward reward signals that are not present in the completions (e.g. a program's runtime, test coverage, or other environment-based feedback).

When vllm_mode is set to "async_server", the server that is launched exposes a fully featured /v1/ OpenAI-compatible endpoint with tool calling support. All endpoint complexity is offloaded to vLLM’s upstream implementation; the script just mirrors their design and adds our weight syncing logic.

trl vllm-serve-async \
  --model Qwen/Qwen3-8B \
  --max_model_len 8192 \
  --enable-auto-tool-choice \
  --reasoning_parser deepseek_r1 \
  --tool-call-parser hermes

This allows any LLM-powered application with measurable reward metrics to be trained with GRPOTrainer by minimally wrapping the app in a rollout_func which interacts with this server.

My CodeRepairRL project provides an example of a rollout_func using a terminal-based coding agent (Nano-Agent).

@kwanUm, this might be of interest to you too!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
✨ enhancement New feature or request 🏋 GRPO Related to GRPO
Projects
None yet
3 participants