Scalable policy iteration in language space
This repository contains the NLAC (Natural Language Actor-Critic) codebase.
-
Install verl using the official installation instructions.
-
Prepare your data in parquet format.
-
Update configuration in
nlrl/tasks/config/nlac_trainer.yamlor override via command line -
Run training:
bash train_example.shFor detailed instructions (including implementing your own environment), see the sections below.
The main configuration is in nlrl/tasks/config/nlac_trainer.yaml. Key sections include:
-
Data Configuration (
data):train_files: Path to training data (parquet format)val_files: Path to validation datamax_prompt_length: Maximum prompt lengthmax_response_length: Maximum response lengthmax_turn_length: Maximum length per turn in multi-turn conversations
-
Model Configuration (
actor_rollout_ref.model):path: Path to the base model (HuggingFace format or HDFS path)external_lib: External library for model loadinguse_rmpad: Whether to use right-padding- Note: Only FSDP training is supported
-
Rollout Configuration (
actor_rollout_ref.rollout):name: Backend name - onlysglangorvllmare supported (verl only supports these two)- Use
sglangfor agent loops and multi-turn interactions (recommended) - Use
vllmfor standard single-turn rollout
- Use
multi_turn.enable: Enable multi-turn interactionsmulti_turn.max_user_turns: Maximum user turnsmulti_turn.max_assistant_turns: Maximum assistant turnsmulti_turn.interaction_config_path: Path to interaction configagent.default_agent_loop: Agent loop type (default tonlrl)agent.agent_loop_config_path: Path to agent loop configrefine_kwargs.num_refinement_steps: Number of refinement stepsrefine_kwargs.num_nlq_samples: Number of NLQ evaluation samples
-
Training Configuration (
trainer):project_name: Project name for loggingexperiment_name: Experiment namennodes: Number of nodesn_gpus_per_node: GPUs per nodetotal_epochs: Total training epochssave_freq: Checkpoint save frequencytest_freq: Validation frequency
To use custom environments, implement them in nlrl/environments/ following the BaseEnvironment interface:
from nlrl.environments.base import BaseEnvironment, register_env
@register_env
class MyEnvironment(BaseEnvironment):
env_str_prefix = "MyEnv"
@classmethod
def from_env_str(cls, env_str: str):
# Parse environment string and create instance
pass
def step(self, action: str, timeout=None):
# Execute action and return (success, output)
pass
@property
def reward(self):
# Return current reward
pass
@property
def finished(self):
# Return completion status
passThe environment string format is: "EnvPrefix@{...json_config...}"
The codebase supports various reward functions configured via reward_model.style:
rule-openai/gsm8k: GSM8K rule-based scoringrule-lighteval/MATH: MATH dataset scoringcode-sandbox: Code execution in sandboxverifier_service: External verifier serviceagent_env: Environment-based rewards- And more (see
nlrl/utils/reward_score.py)
If you use this code, please cite:
@article{nlac2025,
title={Natural Language Actor-Critic: Scalable Off-Policy Learning in Language Space},
author={Hong, Joey and Liu, Kang and Ling, Zhan and Chen, Jiecao and Levine, Sergey},
journal={arXiv preprint arXiv:2512.04601},
year={2025}
}