Natural Language Actor-Critic

Scalable policy iteration in language space

Overview

This repository contains the NLAC (Natural Language Actor-Critic) codebase.

Quick Start

Install verl using the official installation instructions.
Prepare your data in parquet format.
Update configuration in nlrl/tasks/config/nlac_trainer.yaml or override via command line
Run training:

bash train_example.sh

For detailed instructions (including implementing your own environment), see the sections below.

Details

Important Configurations

The main configuration is in nlrl/tasks/config/nlac_trainer.yaml. Key sections include:

Data Configuration (data):
- train_files: Path to training data (parquet format)
- val_files: Path to validation data
- max_prompt_length: Maximum prompt length
- max_response_length: Maximum response length
- max_turn_length: Maximum length per turn in multi-turn conversations
Model Configuration (actor_rollout_ref.model):
- path: Path to the base model (HuggingFace format or HDFS path)
- external_lib: External library for model loading
- use_rmpad: Whether to use right-padding
- Note: Only FSDP training is supported
Rollout Configuration (actor_rollout_ref.rollout):
- name: Backend name - only sglang or vllm are supported (verl only supports these two)
  - Use sglang for agent loops and multi-turn interactions (recommended)
  - Use vllm for standard single-turn rollout
- multi_turn.enable: Enable multi-turn interactions
- multi_turn.max_user_turns: Maximum user turns
- multi_turn.max_assistant_turns: Maximum assistant turns
- multi_turn.interaction_config_path: Path to interaction config
- agent.default_agent_loop: Agent loop type (default to nlrl)
- agent.agent_loop_config_path: Path to agent loop config
- refine_kwargs.num_refinement_steps: Number of refinement steps
- refine_kwargs.num_nlq_samples: Number of NLQ evaluation samples
Training Configuration (trainer):
- project_name: Project name for logging
- experiment_name: Experiment name
- nnodes: Number of nodes
- n_gpus_per_node: GPUs per node
- total_epochs: Total training epochs
- save_freq: Checkpoint save frequency
- test_freq: Validation frequency

Environment Implementation

To use custom environments, implement them in nlrl/environments/ following the BaseEnvironment interface:

from nlrl.environments.base import BaseEnvironment, register_env

@register_env
class MyEnvironment(BaseEnvironment):
    env_str_prefix = "MyEnv"
    
    @classmethod
    def from_env_str(cls, env_str: str):
        # Parse environment string and create instance
        pass
    
    def step(self, action: str, timeout=None):
        # Execute action and return (success, output)
        pass
    
    @property
    def reward(self):
        # Return current reward
        pass
    
    @property
    def finished(self):
        # Return completion status
        pass

The environment string format is: "EnvPrefix@{...json_config...}"

Reward Functions

The codebase supports various reward functions configured via reward_model.style:

rule-openai/gsm8k: GSM8K rule-based scoring
rule-lighteval/MATH: MATH dataset scoring
code-sandbox: Code execution in sandbox
verifier_service: External verifier service
agent_env: Environment-based rewards
And more (see nlrl/utils/reward_score.py)

Citation

If you use this code, please cite:

@article{nlac2025,
  title={Natural Language Actor-Critic: Scalable Off-Policy Learning in Language Space},
  author={Hong, Joey and Liu, Kang and Ling, Zhan and Chen, Jiecao and Levine, Sergey},
  journal={arXiv preprint arXiv:2512.04601},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
nlrl		nlrl
nlrl_old		nlrl_old
verl		verl
.gitignore		.gitignore
README.md		README.md
train_example.sh		train_example.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Natural Language Actor-Critic

Overview

Quick Start

Details

Important Configurations

Environment Implementation

Reward Functions

Citation

About

Uh oh!

Releases

Packages

Languages

jxihong/nlac

Folders and files

Latest commit

History

Repository files navigation

Natural Language Actor-Critic

Overview

Quick Start

Details

Important Configurations

Environment Implementation

Reward Functions

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages