Skip to content
/ nlac Public

Implementation for Natural Language Actor Critic algorithm

Notifications You must be signed in to change notification settings

jxihong/nlac

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Natural Language Actor-Critic

Scalable policy iteration in language space

Paper | Website


Overview

This repository contains the NLAC (Natural Language Actor-Critic) codebase.

Quick Start

  1. Install verl using the official installation instructions.

  2. Prepare your data in parquet format.

  3. Update configuration in nlrl/tasks/config/nlac_trainer.yaml or override via command line

  4. Run training:

bash train_example.sh

For detailed instructions (including implementing your own environment), see the sections below.

Details

Important Configurations

The main configuration is in nlrl/tasks/config/nlac_trainer.yaml. Key sections include:

  1. Data Configuration (data):

    • train_files: Path to training data (parquet format)
    • val_files: Path to validation data
    • max_prompt_length: Maximum prompt length
    • max_response_length: Maximum response length
    • max_turn_length: Maximum length per turn in multi-turn conversations
  2. Model Configuration (actor_rollout_ref.model):

    • path: Path to the base model (HuggingFace format or HDFS path)
    • external_lib: External library for model loading
    • use_rmpad: Whether to use right-padding
    • Note: Only FSDP training is supported
  3. Rollout Configuration (actor_rollout_ref.rollout):

    • name: Backend name - only sglang or vllm are supported (verl only supports these two)
      • Use sglang for agent loops and multi-turn interactions (recommended)
      • Use vllm for standard single-turn rollout
    • multi_turn.enable: Enable multi-turn interactions
    • multi_turn.max_user_turns: Maximum user turns
    • multi_turn.max_assistant_turns: Maximum assistant turns
    • multi_turn.interaction_config_path: Path to interaction config
    • agent.default_agent_loop: Agent loop type (default to nlrl)
    • agent.agent_loop_config_path: Path to agent loop config
    • refine_kwargs.num_refinement_steps: Number of refinement steps
    • refine_kwargs.num_nlq_samples: Number of NLQ evaluation samples
  4. Training Configuration (trainer):

    • project_name: Project name for logging
    • experiment_name: Experiment name
    • nnodes: Number of nodes
    • n_gpus_per_node: GPUs per node
    • total_epochs: Total training epochs
    • save_freq: Checkpoint save frequency
    • test_freq: Validation frequency

Environment Implementation

To use custom environments, implement them in nlrl/environments/ following the BaseEnvironment interface:

from nlrl.environments.base import BaseEnvironment, register_env

@register_env
class MyEnvironment(BaseEnvironment):
    env_str_prefix = "MyEnv"
    
    @classmethod
    def from_env_str(cls, env_str: str):
        # Parse environment string and create instance
        pass
    
    def step(self, action: str, timeout=None):
        # Execute action and return (success, output)
        pass
    
    @property
    def reward(self):
        # Return current reward
        pass
    
    @property
    def finished(self):
        # Return completion status
        pass

The environment string format is: "EnvPrefix@{...json_config...}"

Reward Functions

The codebase supports various reward functions configured via reward_model.style:

  • rule-openai/gsm8k: GSM8K rule-based scoring
  • rule-lighteval/MATH: MATH dataset scoring
  • code-sandbox: Code execution in sandbox
  • verifier_service: External verifier service
  • agent_env: Environment-based rewards
  • And more (see nlrl/utils/reward_score.py)

Citation

If you use this code, please cite:

@article{nlac2025,
  title={Natural Language Actor-Critic: Scalable Off-Policy Learning in Language Space},
  author={Hong, Joey and Liu, Kang and Ling, Zhan and Chen, Jiecao and Levine, Sergey},
  journal={arXiv preprint arXiv:2512.04601},
  year={2025}
}

About

Implementation for Natural Language Actor Critic algorithm

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages