Skip to content

172635/TRACED

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

TRACED: Transition-aware Regret Approximation with Co-Learnability for Environment Design

License: CC BY-NC 4.0 arXiv

TRACED (Transition-aware Regret Approximation with Co-Learnability for Environment Design) is an unsupervised environment-design algorithm that refines regret estimation by decomposing it into value and transition prediction errors, and integrates a lightweight co-learnability metric to quantify the transfer benefits between tasks. It outperforms existing methods-PLRβŠ₯, DR, ACCEL, ADD, and CENIE-in zero-shot transfer on MiniGrid mazes and BipedalWalker.

TRACED Performance

For more information, please see our project page.

πŸŽ₯ Demonstrations

BipedalWalker

MiniGrid

πŸ“– Installation

Ensure you have Conda installed. Optionally, a GPU with CUDA support speeds up experiments.

conda create --name traced python=3.8
conda activate traced
pip install -r requirements.txt
git clone https://github.com/openai/baselines.git
cd baselines && pip install -e . && cd ..
pip install pyglet==1.5.11

Note: If you see AttributeError: module 'numpy' has no attribute 'bool' replace instances of np.bool with np.bool_. See NumPy 1.20 deprecations.

πŸš€ Training

All of TRACED’s training logic is contained in train.py, which you can invoke directly:

python -m train {options}

Rather than manually specifying every flag, you can generate a full command from a JSON configuration:

python train_scripts/make_cmd.py --json {config} --num_trials 1

This will print [options] that encodes all the settings in train_scripts/grid_configs/<config_name>.json.

Example commands
# TRACED on BipedalWalker

python -m train --xpid=ued-BipedalWalker-Adversarial-Easy-v0--noexpgrad-lr0.0003-epoch5-mb32-v0.5-gc0.5-henv0.01-ha0.001-plr0.9-rho0.5-n1000-st0.5-positive_value_loss-rank-t0.1-editor1.0-random-n3-baseeasy-tl_0 --env_name=BipedalWalker-Adversarial-Easy-v0 --use_gae=True --gamma=0.99 --gae_lambda=0.9 --seed=4361 --recurrent_arch=lstm --recurrent_agent=False --recurrent_adversary_env=False --recurrent_hidden_size=1 --use_global_critic=False --lr=0.0003 --num_steps=2048 --num_processes=16 --num_env_steps=2000000000 --ppo_epoch=5 --num_mini_batch=32 --entropy_coef=0.001 --value_loss_coef=0.5 --clip_param=0.2 --clip_value_loss=False --adv_entropy_coef=0.01 --max_grad_norm=0.5 --algo=ppo --use_plr=True --level_replay_prob=0.9 --level_replay_rho=0.5 --level_replay_seed_buffer_size=1000 --level_replay_score_transform=rank --level_replay_temperature=0.1 --staleness_coef=0.5 --no_exploratory_grad_updates=True --use_editor=True --level_editor_prob=1.0 --level_editor_method=random --num_edits=3 --base_levels=easy --use_lstm=False --use_behavioural_cloning=False --kl_loss_coef=0.0 --kl_update_step=1 --use_kl_only_agent=False --log_interval=10 --screenshot_interval=200 --log_grad_norm=True --normalize_returns=True --checkpoint_basis=student_grad_updates --archive_interval=5000 --use_traced=True --colearnability_weight=0.6 --transition_prob_weight=1.5 --handle_timelimits=True --level_replay_strategy=positive_value_loss --test_env_names=BipedalWalker-v3,BipedalWalkerHardcore-v3,BipedalWalker-Med-Stairs-v0,BipedalWalker-Med-PitGap-v0,BipedalWalker-Med-StumpHeight-v0,BipedalWalker-Med-Roughness-v0 --log_dir=/data/DCD/TRACED --test_interval=100 --test_num_episodes=10 --test_num_processes=2 --log_plr_buffer_stats=True --log_replay_complexity=True --checkpoint=True --log_action_complexity=False
# TRACED on MiniGrid mazes

python -m train --xpid=ued-MultiGrid-GoalLastEmptyAdversarialEnv-Edit-v0--noexpgrad-lstm256a-lr0.0001-epoch5-mb1-v0.5-gc0.5-henv0.0-ha0.0-plr0.8-rho0.5-n4000-st0.3-positive_value_loss-rank-t0.3-editor1.0-random-n5-baseeasy-tl_0 --env_name=MultiGrid-GoalLastEmptyAdversarialEnv-Edit-v0 --use_gae=True --gamma=0.995 --gae_lambda=0.95 --seed=7664 --recurrent_arch=lstm --recurrent_agent=True --recurrent_adversary_env=False --recurrent_hidden_size=256 --use_global_critic=False --lr=0.0001 --num_steps=256 --num_processes=32 --num_env_steps=250000000 --ppo_epoch=5 --num_mini_batch=1 --entropy_coef=0.0 --value_loss_coef=0.5 --clip_param=0.2 --clip_value_loss=True --adv_entropy_coef=0.0 --max_grad_norm=0.5 --algo=ppo --use_plr=True --level_replay_prob=0.8 --level_replay_rho=0.5 --level_replay_seed_buffer_size=4000 --level_replay_score_transform=rank --level_replay_temperature=0.3 --staleness_coef=0.3 --no_exploratory_grad_updates=True --use_editor=True --level_editor_prob=1.0 --level_editor_method=random --num_edits=5 --base_levels=easy --use_lstm=False --use_behavioural_cloning=False --kl_loss_coef=0.0 --kl_update_step=1 --use_kl_only_agent=False --log_interval=25 --screenshot_interval=1000 --log_grad_norm=False --handle_timelimits=True --checkpoint_basis=student_grad_updates --archive_interval=5000 --use_traced=True --colearnability_weight=1.0 --transition_prob_weight=1.0 --level_replay_strategy=positive_value_loss --test_env_names=MultiGrid-SixteenRooms-v0,MiniGrid-SimpleCrossingS9N1-v0,MiniGrid-FourRooms-v0,MultiGrid-SmallCorridor-v0,MultiGrid-Labyrinth-v0,MultiGrid-Maze-v0,MultiGrid-PerfectMazeMedium-v0 --log_dir=./logs/traced --log_action_complexity=True --log_plr_buffer_stats=True --log_replay_complexity=True --reject_unsolvable_seeds=False --checkpoint=True

πŸ” Evaluation

Evaluating a single model

The following command evaluates a <model>.tar in an experiment results directory, <xpid>, in a base log output directory <log_dir> for <num_episodes> episodes in each of the environments named <env_name1>, <env_name1>, and <env_name1>, and outputs the results as a .csv in <result_dir>.

python -m eval \
--base_path <log_dir> \
--xpid <xpid> \
--model_tar <model>
--env_names <env_name1>,<env_name2>,<env_name3> \
--num_episodes <num_episodes> \
--result_path <result_dir>

Evaluating multiple models

Similarly, the following command evaluates all models named <model>.tar in experiment results directories matching the prefix <xpid_prefix>. This prefix argument is useful for evaluating models from a set of training runs with the same hyperparameter settings. The resulting .csv will contain a column for each model matched and evaluated this way.

python -m eval \
--base_path <log_dir> \
--prefix <xpid_prefix> \
--model_tar <model> \
--env_names <env_name1>,<env_name2>,<env_name3> \
--num_episodes <num_episodes> \
--accumulator mean \
--result_path <result_dir>

Evaluating on zero-shot benchmarks

Replacing the --env_names=... argument with the --benchmark=<benchmark> argument will perform evaluation over a set of benchmark test environments for the domain specified by <benchmark>. The various zero-shot benchmarks are described below:

benchmark Description
maze Human-designed mazes, including singleton and procedurally-generated designs.
bipedal BipedalWalker-v3, BipedalWalkerHardcore-v3, and isolated challenges for stairs, stumps, pit gaps, and ground roughness.

🌐 Environments & Baselines

🧭 MiniGrid Mazes

The MiniGrid-based mazes from Dennis et al, 2020 and Jiang et al, 2021 require agents to perform partially-observable navigation. Various human-designed singleton and procedurally-generated mazes allow testing of zero-shot transfer performance to out-of-distribution configurations.

Experiments from Jiang et al, 2021

Method json config
PLRβŠ₯ minigrid/25_blocks/mg_25b_robust_plr.json
PLR minigrid/25_blocks/mg_25b_plr.json
DR minigrid/25_blocks/mg_25b_dr.json

Experiments from Parker-Holder et al, 2022

Method json config
TRACED minigrid/60_blocks_uniform/mg_60b_uni_traced.json
ACCEL (from empty) minigrid/60_blocks_uniform/mg_60b_uni_accel_empty.json
PLRβŠ₯ (Uniform(0-60) blocks) minigrid/mg_60b_uni_robust_plr.json
DR (Uniform(0-60) blocks) minigrid/mg_60b_uni_dr.json

🦿🦿 BipedalWalker

The BipedalWalker environment requires continuous control of a 2D bipedal robot over challenging terrain with various obstacles, using a propriocetive observation. The zero-shot transfer configurations, used in Parker-Holder et al, 2022, include BipedalWalkerHardcore, environments featuring each challenge (i.e. ground roughness, stump, pit gap, and stairs) in isolation, as well as extremely challenging configurations discovered by POET in Wang et al, 2019.

Method json config
TRACED bipedal/bipedal_traced.json
ACCEL bipedal/bipedal_accel.json
PLRβŠ₯ bipedal/bipedal_robust_plr.json
DR bipedal/bipedal_dr.json

Current environment support

Method 🧭 MiniGrid mazes 🦿🦿 BipedalWalker
TRACED βœ… βœ…
ACCEL βœ… βœ…
PLRβŠ₯ βœ… βœ…
PLR βœ… βœ…
DR βœ… βœ…

πŸ“Š Monitoring

By default, train.py generates a folder in the directory specified by the --log_dir argument, named according to --xpid. This folder contains the main training logs, logs.csv, and periodic screenshots of generated levels in the directory screenshots. Each screenshot uses the naming convention update_<number of PPO updates>.png. When ACCEL is turned on, the screenshot naming convention also includes information about whether the level was replayed via PLR and the mutation generation number for the level, i.e. how many mutation cycles led to this level.

For logging with Weights & Biases (wandb), you can enable logging by passing --use_wandb true and --wandb_key <your_wandb_key> as command-line arguments.

πŸ’Ύ Checkpoints

Latest checkpoint The latest model checkpoint is saved as model.tar. The model is checkpointed every --checkpoint_interval number of updates. When setting --checkpoint_basis=num_updates (default), the checkpoint interval corresponds to number of rollout cycles (which includes one rollout for each student and teacher). Otherwise, when --checkpoint_basis=student_grad_updates, the checkpoint interval corresponds to the number of PPO updates performed by the student agent only. This latter checkpoint basis allows comparing methods based on number of gradient updates actually performed by the student agent, which can differ from number of rollout cycles, as methods based on Robust PLR, like ACCEL, do not perform student gradient updates every rollout cycle.

Archived checkpoints Separate archived model checkpoints can be saved at specific intervals by specifying a positive value for the argument --archive_interval. For example, setting --archive_interval=1250 and --checkpoint_basis=student_grad_updates will result in saving model checkpoints named model_1250.tar, model_2500.tar, and so on. These archived models are saved in addition to model.tar, which always stores the latest checkpoint, based on --checkpoint_interval.

🀝 Acknowledgements

Built on top of the Dual Curriculum Design (DCD) codebase. We thank the authors of:

Misc

  • Please note that this codebase may not exactly reproduce the results reported in the paper due to potential human errors during code migration. If you observe any discrepancies in performance, feel free to reach out-we’d appreciate your feedback.

πŸ“„ License

This project is licensed under CC BY-NC 4.0. Note that the repository relies on third-party libraries subject to their respective licenses.

Citation

@article{TRACED,
  title={TRACED: Transition-aware Regret Approximation with Co-learnability for Environment Design},
  author={Cho, Geonwoo and Im, Jaegyun and Lee, Jihwan and Yi, Hojun and Kim, Sejin and Kim, Sundong},
  journal={arXiv preprint arXiv:2506.19997},
  year={2025}
}

About

Fork of the official implementation of TRACED (Transition-aware Regret Approximation with Co-learnability for Environment Design).

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%