RELAAX is a framework designed to:
-
Simplify research and development of Reinforcement Learning applications and algorithms by taking care of underlying infrastructure
-
Provide a usable and scalable implementation of state of art Reinforcement Learning Algorithms
-
Simplify deploying of Agents and Environments for training and exploitation of the trained Agents at scale on popular cloud platforms
RELAAX components:
-
Reinforcement Learning eXchange (RLX) protocol connects RL Agents with RL Environment
-
RELAAX Client wraps details of the RLX Protocol implementation and exposes simple API to be used to exchange States, Rewards, and Actions between the scalable RL Server and Environment.
-
RELAAX Server allows developers to run RL Agents locally or at scale on popular cloud platforms. See more details below.
-
RELAAX provides implementations of popular RL algorithms to simplify RL application development and research.
-
RELAAX is integrated into a hosted service where you can deploy your RL cluster on AWS, GCP, and Azure in just a few steps.
- Quick start
- System Architecture
- RELAAX Clients
- RELAAX Server
- Algorithms
- Deployment in Cloud
We recommended you use an isolated Python environment to run RELAAX. Virtualenv or Anaconda are examples. If you're using the system's python environment, you may need to run pip install commands with sudo. On OSX / macOS, we recommend using Homebrew to install a current python version.
-
Install Docker
-
Clone RELAAX repo.
git clone git@github.com:deeplearninc/relaax.git- Install RELAAX
cd relaax
pip install -e .- Build DA3C bridge.
algorithms/da3c/bridge/bridge.sh-
Install TensorFlow
-
Create training directory
cd ..
mkdir training
cd training- Build Docker image named gym (use sudo if needed):
docker build -f ../relaax/environments/OpenAI_Gym/Dockerfile -t gym ../relaax- Open new terminal window, navigate to training directory and run parameter server
relaax-parameter-server --config ../relaax/config/da3c_gym_boxing.yaml- Open new terminal window, navigate to training directory and run RLX server
relaax-rlx-server --config ../relaax/config/da3c_gym_boxing.yaml --bind 0.0.0.0:7001-
Use
ifconfigcommand to find IP of your localhost. Remember it. -
Open new terminal window, navigate to training directory and run environment inside gym docker image. Use sudo if needed.
docker run -ti gym <LOCALHOST_IP>:7001 Boxing-v0- Open new terminal window, navigate to trainin directory and run Tensorboard:
tensorboard --logdir metrics_gym_boxing- Tensorboard prints URL to use. Open it in browser to exemain training progress.
- Environment - computer simulation, game, or "hardware" in real world (say industrial manipulator, robot, car, etc.). To accelerate learning number of Environment(s) could be run in parallel.
- RELAAX Client - simple library which is embedded into Environment. It collects the State and Reward in Environment, sends it to the RELAAX Server, receives back Action(s) and communicates it to the Environment.
- RLX Server - listens on a port for a connection from the RELAAX Clients. After connection is accepted it starts Worker and passes control over communication with the client to that Worker.
- Worker - communicates with the client and runs Agent's NN. Each parallel replica of Environment/Client will have corresponding replica of the Agent.
- Parameter Server - one or several nodes which run Global Function NN (Q, value, or policy function). Parameter Server node(s) communicates with Workers over GRPC bridge to synchronize state of the Global Function NN with Agents.
- CheckPoints - storage where Parameter Server saves state of the Global Function NN; when system is re-stared, it may restore Global Function NN state from the stored previously checkpoint and continue learning.
- Metrics - Workers and Parameter Server send various metrics to the Metrics node; developer may see these metrics in Web Browser by connecting to the Metrics node.
Client is small library used to communicate with RL Agents. It could be used with the Environment implemented in many popular programming languages or embedded into specialized hardware systems. Currently client support Arcade Learning Environments (ALE), OpenAI Gym, and OpenAI Universe Environments. At the moment client implemented in Python, later on we are planning to implement client code in C/C++, Ruby, GO, etc. to simplify integration of other environments.
Reinforcement Learning eXchange protocol is a simple protocol implemented over TCP using JSON (later will be moved to Protobuf). It allows to send State of the Environment and Reward to the Server and deliver Action from the Agent to the Environment.
TODO: links to actual files
relaax
client
rlx_client.py
class Client - agent interface for environment
def init(state) - start training sending initial state to agent,
returns first action from agent
def send(reward, state) - send reward for previous action and current environment state,
returns next action from agent
def reset(reward) - send reward for previous action and resets agent
returns cumulative reward for last episode
def metrics() - get metrics object
def disconnect() - disconnect environment from agent
class Failure - raised in case of failure on agent's side
The Arcade Learning Environment (ALE) is a framework that allows to develop AI agents for Atari 2600 games. It is built on top of the Atari 2600 emulator Stella and separates the details of emulation from agent design. Additional information about ALE and Atari games you can find in official Google group.
-
Pull the Docker Image:
$ docker pull deeplearninc/relaax-ale
-
Run the Server:
Open new terminal window, navigate to training directory and run
honcho:$ honcho -f ../relaax/config/da3c_ale_boxing.Procfile start
It is assumed that the training directory located next to
relaaxrepository at the same level. It also allows to create it anywhere and it needs to write the right path to the appropriate*.Procfilewithinrelaaxrepo. -
Run a Client:
It provides 3 predefined run-cases for the pulled docker image:
# For example, the first one case $ docker run --rm -ti \ -v /path_to_atari_roms_folder:/roms \ --name ale deeplearninc/relaax-ale \ SERVER_IP:7001 boxingIt runs the docker in interactive mode by
-tiand automatically removes this container when it stops with--rm. It also has--name alefor convenience.You have to provide shared folder on your computer, where atari game roms are stored by
-vparameter.Use
ifconfigcommand to find IP of your relaax SERVER, which is run byhonchoIt launches one sample of the game environment within the docker, which is defined by the last parameter
boxing(it launches theAtari Boxinggame)# For example, the second run-case $ docker run --rm -ti \ -v /path_to_atari_roms_folder:/roms \ --name ale deeplearninc/relaax-ale \ SERVER_IP:7001 boxing 4It adds the third parameter which is equal to
4since it allows to define number of games to launch within the docker for parallel training.# And the third one use-case $ docker run --rm -ti \ -p IP:PORT:5900 \ -v /path_to_atari_roms_folder:/roms \ --name ale deeplearninc/relaax-ale \ SERVER_IP:7001 boxing displayIt passes the last argument as
displayto run game in display mode, therefore it maps some ports on your computer to useVNCconnection for visual session.For example, the full command to run the clients and a server on a single machine (under the NAT) should looks like as follows:
$ docker run --rm -ti \ -p 192.168.2.103:15900:5900 \ -v /opt/atari-game-roms:/roms \ --name ale deeplearninc/relaax-ale \ 192.168.2.103:7001 boxing displayYou can connect to client's visual output via your VNC client with:
For example: --- Server: 192.168.2.103:15900 Passwd: relaax Color depth: True color (24 bit)
Please find sample of configuration to perform experiments with ALE there:
relaax/config/da3c_ale_boxing.yaml
This sample is setup for Atari Boxing game, which has a discrete set of actions.
Therefore you may use discrete version of our Distributed A3C or set another algorithm there:
algorithm:
path: ../relaax/algorithms/da3caction_size and state_size parameters for Atari Boxing is equal to:
action_size: 18 # action size for given game rom (18 fits ale boxing)
state_size: [84, 84] # dimensions of input screen frame of an Atari gameYou should check / change these parameter if you want to use another environment.
How to build your own Docker Image
- Navigate to the ALE's folder within
relaaxrepo
$ cd path_to_relaax_repo/environments/ALE- Build the docker image by the following commands
# docker build -f Dockerfile -t your_docker_hub_name/image_name ../..
# or you can build without your docker hub username, for example:
$ docker build -f Dockerfile -t relaax-ale-vnc ../..OpenAI Gym is open-source library: a collection of test problems environments, that you can use to work out your reinforcement learning algorithms.
-
Pull the Docker Image:
$ docker pull deeplearninc/relaax-gym
-
Run the Server:
Open new terminal window, navigate to training directory and run
honcho:$ honcho -f ../relaax/config/da3cc_gym_walker.Procfile start
It is assumed that the training directory located next to
relaaxrepository at the same level. It also allows to create it anywhere and it needs to write the right path to the appropriate*.Procfilewithinrelaaxrepo. -
Run a Client:
It provides 3 predefined run-cases for the pulled docker image:
# For example, the first one case $ docker run --rm -ti \ --name gym deeplearninc/relaax-gym \ SERVER_IP:7001 BipedalWalker-v2It runs the docker in interactive mode by
-tiand automatically removes this container when it stops with--rm. It also has--name gymfor convenience.Use
ifconfigcommand to find IP of your relaax SERVER, which is run byhonchoIt launches one sample of the environment within the docker, which is defined by the last parameter
BipedalWalker-v2(name of the gym's environment)# For example, the second run-case $ docker run --rm -ti \ --name gym deeplearninc/relaax-gym \ SERVER_IP:7001 BipedalWalker-v2 4It adds the third parameter which is equal to
4since it allows to define number of environments to launch within the docker for parallel training.# And the third one use-case $ docker run --rm -ti \ -p IP:PORT:5900 \ --name gym deeplearninc/relaax-gym \ SERVER_IP:7001 BipedalWalker-v2 displayIt passes the last argument as
displayto run environment in display mode, therefore it maps some ports on your computer to useVNCconnection for visual session.For example, the full command to run the clients and a server on a single machine (under the NAT) should looks like as follows:
$ docker run --rm -ti \ -p 192.168.2.103:15900:5900 \ --name gym deeplearninc/relaax-gym \ 192.168.2.103:7001 BipedalWalker-v2 displayYou can connect to client's visual output via your VNC client with:
For example: --- Server: 192.168.2.103:15900 Passwd: relaax Color depth: True color (24 bit)
Please find sample of configuration to run experiments with OpenAI Gym there:
relaax/config/da3cc_gym_walker.yaml
This sample is setup for BipedalWalker-v2 environment, which operates with continuous action space.
Therefore you may use continuous version of our Distributed A3C or set another algorithm there:
algorithm:
path: ../relaax/algorithms/da3c_contaction_size and state_size parameters for BipedalWalker-v2 is equal to:
action_size: 4 # action size for the given environment
state_size: [24] # array of dimensions for the input observationYou should check / change these parameter if you want to use another environment.
How to build your own Docker Image
- Navigate to the OpenAI Gym's folder within
relaaxrepo
$ cd path_to_relaax_repo/environments/OpenAI_Gym- Build the docker image by the following commands
# docker build -f Dockerfile -t your_docker_hub_name/image_name ../..
# or you can build without your docker hub username, for example:
$ docker build -f Dockerfile -t relaax-gym-vnc ../..DeepMind Lab is a 3D learning environment based on id Software's Quake III Arena. It provides a suite of challenging 3D navigation and puzzle-solving tasks for learning agents especially with deep reinforcement learning.
-
Pull the Docker Image:
$ docker pull deeplearninc/relaax-lab
-
Run the Server:
Open new terminal window, navigate to training directory and run
honcho:$ honcho -f ../relaax/config/da3c_lab_demo.Procfile start
It is assumed that the training directory located next to
relaaxrepository at the same level. It also allows to create it anywhere and it needs to write the right path to the appropriate*.Procfilewithinrelaaxrepo. -
Run a Client:
It provides 3 predefined run-cases for the pulled docker image:
# For example, the first one case $ docker run --rm -ti \ --name lab deeplearninc/relaax-lab \ SERVER_IPIt runs the docker in interactive mode by
-tiand automatically removes this container when it stops with--rm. It also has--name labfor convenience.Use
ifconfigcommand to find IP of your relaax SERVER, which is run byhonchoIt launches one sample of the lab's environment within the docker with a
nav_maze_static_01map which is predefined by default (list of the default lab's maps)# For example, the second run-case $ docker run --rm -ti \ --name lab deeplearninc/relaax-lab \ SERVER_IP 4 nav_maze_static_02It adds the second parameter which is equal to
4since it allows to define number of environments to launch within the docker for parallel training.It also allows to define a map by the third parameter or it uses
nav_maze_static_01by default.# And the third one use-case $ docker run --rm -ti \ -p IP:PORT:6080 \ --name lab deeplearninc/relaax-lab \ SERVER_IP displayIt passes the last argument as
displayto run environment in display mode, therefore it maps some ports on your computer to useVNCconnection for visual session.It also allows to define a map by the third parameter.
For example, the full command to run the clients and a server on a single machine (under the NAT) should looks like as follows:
$ docker run --rm -ti \ -p 6080:6080 \ --name lab deeplearninc/relaax-lab \ 192.168.2.103 display nav_maze_static_03You can connect to client's visual output via your browser by opening http://127.0.0.1:6080/vnc.html URL. You will see web form to enter your credentials. Leave all fields intact and press
'Connect'. You will see a running game.
Please find sample of configuration to perform experiments with DeepMind Lab there:
relaax/config/da3c_lab_demo.yaml
action_size and state_size parameters for this configuration is equal to:
action_size: 11 # the full action size for the lab's environment
state_size: [84, 84] # dimensions of the environment's input screenThe full set for action_size consists of 11-types of interactions:
- look_left
- look_right
- look_up
- look_down
- strafe_left
- strafe_right
- forward
- backward
- fire
- jump
- crouch
It shrinks these actions to the 6 (italic) while training
by --shrink parameter, which is set to true by default.
How to build your own Docker Image
- Navigate to the DeepMind Lab's folder within
relaaxrepo
$ cd path_to_relaax_repo/environments/DeepMind_Lab- Build the docker image by the following commands
# docker build -f Dockerfile -t your_docker_hub_name/image_name ../..
# or you can build without your docker hub username, for example:
$ docker build -f Dockerfile -t relaax-lab-vnc ../..Main purpose of RLX Server is to run agents exploring and exploiting environments. You can run several RLX Servers on several computers. Run one RLX Server per computer. RLX Server starts, opens specified port and start listening it. When next client connects to the port, RLX Server accepts connection, forks itself as new process, starts new worker to process connection from client. Accepting connection means opening new connection on other port. So relax firewall rules on RLX Server node to allow connections on arbitrary ports.
RLX Server implements dynamic loading of algorithm code. Several examples of algorithms are in <relaax_repo>/algorithms. Feel free to copy and modify them according your needs.
RLX Server denies starting new worker in case of insufficient memory. To implement this feature on new connection RLX Server calculates mean memory consumption per child (worker) process and compares it with amount of available memory. Swap memory is not taken in account during comparison. If available memory is not enough RLX Server immediately closes new connection. Please note that typical client is trying to reconnect again in case of any network issue. This way load balancing and autoscaling is implemented. When load balancer routes new connection with overloaded RLX Server node RLX Server closes connection and client repeats connection attempt. Eventually, connection is routed to node with enough memory and training starts. Appropriate configuration of cluster autoscaler (based on low memory threshold) is required to utilize this feature.
Another balancing feature is regular connection drop on worker side. After specified timeout worker drops connection with client on next learning episode reset. Client automatically reconnects to load balancer allowing even load between working RLX Server nodes.
TODO: links to actual files
relaax
server
rlx_server
main.py
def main(): - parse command line,
read configuration YAML file and
run server.
server.py TODO: introduce OO structure to server.py
def run(...): - load algorithm definition,
start listening incoming connections;
- on next incoming connection
check for available memory,
start a separate process.
- in a separate process
create new Agent,
create new Worker(Agent),
run new Worker.
When you install RELAAX on your node you've got relaax-rlx-server command.
If you're going to run training locally use following command line:
relaax-rlx-server --config config.yaml --bind localhost:7001 --parameter-server localhost:7000 --log-level WARNINGIf you're going to run training on cluster use following command line. There are differences in parameter-server IP and timeout to enable load balancer:
relaax-rlx-server --config config.yaml --bind 0.0.0.0:7001 --parameter-server parameter-server:7000 --log-level WARNING --timeout 120Available options are:
-h, --help show help message and exit
--config FILE configuration YAML file, see below
--log-level LEVEL set logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
--bind HOST:PORT address to serve (host:port)
--parameter-server HOST:PORT parameter server address (host:port)
--timeout TIMEOUT worker stops on game reset after given timeout (no need to use it on local run)Both RLX Server (read workers) and Parameter server shares the same configuration file. The file describes algorithm to use and algorithm specific parameters.
Configuration file example (relaax/config/da3c_ale_boxing.yaml):
---
# relaax-parameter-server command line
relaax-parameter-server:
--bind: localhost:7000
--checkpoint-dir: checkpoints/boxing_a3c
--log-level: WARNING
--metrics-dir: metrics_ale_demo
# relaax-rlx-server command line
relaax-rlx-server:
--bind: localhost:7001
--parameter-server: localhost:7000
--log-level: WARNING
# Number and meaning of these keys depends on specific algorithm.
# path to algorithm directory. In this case we use one from RELAAX repo. Feel free to create your own algorithm and use it for training.
algorithm:
path: ../relaax/algorithms/da3c
action_size: 4 # action size for given game rom (18 fits ale boxing)
episode_len: 5 # local loop size for one episode
gpu: false # to use GPU, set to the True
lstm: false # to use LSTM instead of FF, set to the True
max_global_step: 1e8 # amount of maximum global steps to pass through the training
initial_learning_rate: 7e-4
entropy_beta: 0.01 # entropy regularization constant
rewards_gamma: 0.99 # rewards discount factor
RMSProp:
decay: 0.99
epsilon: 0.1
gradient_norm_clipping: 40Worker is main training unit. RLX Server starts worker as a separate process on new connection from client. New worker runs agent and provides communication between agent and environment inside client. Workers do not have separate configuration or command line. Both configuration and command line of worker are inherited from RLX server when worker is forked.
TODO: links to actual files
relaax
server
rlx_server
worker.py
class Worker
def run(...): - using socket_protocol
run message loop between agent (local) and client (remote)
Parameter Server is to store and update agents' Global Function NN. Parameter Server is hub of star topology where workers are leaves. If selected algorithm allows sharding then Parameter Server could be distributed on several nodes (shards) depending on load.
Parameter Server is implemented as a GRPC Server. GRPC service definition depends on specific RL algorithm and is bundled with algorithm definition.
Parameter Server implements dynamic loading of algorithm code (same is true for workers). Several examples of algorithms are in RELAAX repo. Feel free to copy and modify them according your needs.
Parameter Server stores Global Function NN on local file system (convenient for local training) or on AWS S3 storage (must have for training on cluster).
Global Function NN states are stored in form of checkpoints. Each checkpoint is marked with training step number. This allows to store multiple checkpoints for the same training to investigate training progress. When Parameter Server starts it searches specified checkpoint location and loads last saved checkpoint.
Parameter Server saves checkpoint:
- on regular intervals, default 15 min, but it is possible to change in config.yaml
- if the training is over - algorithm reports that required number of training steps are done
- if it is stopped by SIGINT signal (Ctrl-C in terminal running Parameter Server for example)
TODO: links to actual files
relaax
server
parameter_server
main.py
def main(): - parse command line,
read configuration YAML file,
configure checkpoint saver/loader and
run server.
server.py TODO: introduce OO structure to server.py
def run(...): - load algorithm definition,
configure algorithm parameter server,
load latest checkpoint if any,
start parameter server in separate thread,
run monitor loop.
When you install RELAAX on your node you've got relaax-parameter-server command.
If you're going to run training locally use following command line:
relaax-parameter-server --config config.yaml --bind localhost:7000 --log-level WARNING --checkpoint-dir training/checkpoints --metrics-dir training/metricsIf you're going to run training on cluster use following command line. There are differences in parameter-server IP and checkpoint and metrics locations:
relaax-parameter-server --config config.yaml --bind 0.0.0.0:7000 --log-level WARNING --checkpoint-aws-s3 my_bucket training/checkpoints --aws-keys aws-keys.yaml --metrics-dir training/metrics --metrics-aws-s3 my_bucket training/metricsAvailable options are:
-h, --help show help message and exit
--log-level LEVEL set logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
--config FILE configuration YAML file
--bind HOST:PORT address to serve (host:port)
--checkpoint-dir DIR training checkpoint directory
--checkpoint-aws-s3 BUCKET KEY
AWS S3 bucket and key for training checkpoints
--metrics-dir metrics data directory
--metrics-aws-s3 BUCKET KEY
AWS S3 bucket and key for training metrics data
--aws-keys FILE YAML file containing AWS access and secret keysDo not use both --checkpoint-dir and --checkpoint-aws-s3 flags in the same command line.
Configuration file is the same as for RLX Server. Please use the same configuration for Parameter Server and for RLX Server. Otherwise training will fail.
You need to supply credentials to allow Parameter Server to use AWS S3. aws-keys.yaml file provides them:
---
access: YOUR_ACCESS_KEY_HERE
secret: YOUR_SECRET_ACCESS_KEY_HEREAn algorithm is an usual Python package. But RELAAX server loads algorithms dynamically. Dynamic loading simplifies algorithm developement outside Python package structure. The path to selected algorithm is defined in config.yaml or in command line. All algorithms follow structure defined in relaax/algorithm_base directory:
relaax
algorithm_base
parameter_server_base.py
class ParameterServerBase
def __init__(config, saver, metrics) - initialize parameter server
def close(): - close parameter server
def restore_latest_checkpoint(): - restore latest checkpoint
def save_checkpoint(): - save new checkpoint
def checkpoint_location(): - return human readable checkpoint location
def global_t(): - return current global learning step
def bridge(): - return bridge interface
agent_base.py
class AgentBase
def act(state): - take state and get action
def reward_and_reset(reward): - take reward and reset training
def reward_and_act(reward, state): - take reward and state and get action
def metrics(): - get metrics object
bridge_base.py
class BridgeBase
def increment_global_t(): - increment current global learning step
def apply_gradients(gradients): - apply gradients to Global Function NN
def get_values(): - get Global Function NN
def metrics(): - get metrics object
class BridgeControlBase
def parameter_server_stub(parameter_server_url): - return parameter server stub object
def start_parameter_server(address, service): - start parameter server with bind address and ParameterServerService object
config_base.py
class ConfigBase
def __init__(config): - initializes configuration from loaded config.yaml
Algorithm package should exports following symbols:
class Config(ConfigBase) - algorithm configuration
class ParameterServer(ParameterServerBase) - implement parameter server for algorithm
TODO: simplify API
class Agent(AgentBase) - learning agent of algorithm
class Bridge(BridgeBase) - implement bridge between agent and parameter server
class BridgeControl(BridgeControlBase) - control bridge between agent and parameter serverTODO: links to actual files TODO: complete
relaax
algorithms
da3c
__init__.py - algorithm API (see previous section)
common
lstm.py - long short-term memory NN
network.py - algorithm NN
config.py
class Config - algorithm configuration
def __init__(config): - initializes configuration from loaded config.yaml
bridge
bridge.sh - script to compile GRPC bridge
bridge.proto - data bridge GRPC service
service ParameterServer
rpc IncrementGlobalT() - increment and get current global learning step
rpc ApplyGradients() - apply gradients to Global Function NN
rpc GetValues() - get Global Function NN
rpc StoreScalarMetric() - store scalar metrics value
bridge.py - data bridge between rlx_server and parameter server
wrap GRPC service defined in bridge.proto
class BridgeControl
def parameter_server_stub(): - return parameter server stub object (BridgeBase)
def start_parameter_server(): - start parameter server with bind address and BridgeBase object
agent
agent.py
class Agent - learning agent of algorithm
def act(): - take state and get action
def reward_and_act(): - take reward and state and get action
def reward_and_reset(): - take reward and reset training
def metrics(): - get metrics object
network.py - agent's facet of algorithm NN
def make(): - make agent's part of algorithm NN
parameter_server
network.py - parameter server's facet of algorithm NN
def make(): - make parameter server's part of algorithm NN
parameter_server.py
class ParameterServer - implement parameter server for algorithm
def __init__(): - create new server
def close(): - close server
def restore_latest_checkpoint(): - restore latest checkpoint using given checkpoint saver
def save_checkpoint(): - save checkpoint using given checkpoint saver
def checkpoint_location(): - get human readable checkpoint storage location
def global_t(): - get current global learning step
def bridge(): - return bridge interface
The purpose of the bridge is to provide data transport between workers and Parameter Server. Each worker and Parameter Server has it's own copy of Global Function NN. The bridge provides means of synchronization of these Global Functions and allows to distribute training process across different processes on different computational nodes.
Bridge is part of algorithm. Bridge is implemented as thin wrapper on GRPC service.
Minimal bridge GRPC service includes methods to update Global Function on Parameter Server and to receive synchronize Global Function on workers. This is GRPC service for Distributed A3C algorithm:
service ParameterServer {
rpc IncrementGlobalT(NullMessage) returns (Step) {}
rpc ApplyGradients(stream NdArray) returns (NullMessage) {}
rpc GetValues(NullMessage) returns (stream NdArray) {}
rpc StoreScalarMetric(ScalarMetric) returns (NullMessage) {}
}
Corresponding Parameter Server API looks like (relaax/algorithms/da3c/common/bridge/init.py):
class ParameterServerService(object):
def increment_global_t(self):
# increments learning step on Parameter Server
return global_t
def apply_gradients(self, gradients):
# applies gradients from Agent to Parameter Server
def get_values(self):
# pulls Global Function NN from Parameter Server to Agent
return values
def metrics():
# get metrics object
return metrics_objectMetrics is a way to gather information about training process in time. RELAAX uses TensorFlow to gather metrics and TensorBoard to present them. Metrics could be gathered from Parameter Server, workers (agents) and environments (clients).
Parameter server:
self.metrics().scalar('training_velocity', velocity, x=parameter_server.global_t())Agent:
self.metrics().scalar('act latency', latency, x=agent.global_t)Environment:
client.metrics().scalar('act latency on client', latency)This call stores metrics with given name and value. All metrices are stored as mappings from training global step to given values. All metrices could be browsed in realtime during training by TensorBoard attached to training cluster or to local training.
DA3C gathers following metrics:
- episode reward
- episode length
- episode time
- reward per time
- policy loss
- value loss
- grad (with global norm)
- entropy
- agent action latency (with/without network latency)
It's recommended to use isolated Python environment to run RELAAX. Virtualenv or Anaconda are examples.
-
Install PIP - tool to install Python packages.
-
Install TensorFlow (TODO: link)
-
To install training environment clone RELAAX Git repository:
git clone git@github.com:deeplearninc/relaax.git- Then navigate repository root and install relaax package and all depended packages:
cd <relaax_repo>
pip install .If you are going to modify RELAAX code itself then install it in "develop mode".
-
Install PIP - tool to install Python packages.
-
Install TensorFlow (TODO: link)
-
clone RELAAX Git repository:
git clone TODO: add repo path- Then navigate repository root and install relaax package and all depended packages:
cd <relaax_repo>
pip install -e .- Build algorithm bridges
<relaax_repo>/relaax/algorithms/bridge.shInspired by original paper - Asynchronous Methods for Deep Reinforcement Learning from DeepMind
Environment (Client) - each client connects to a particular Agent (Learner).
The main role of any client is feeding data to an Agent by transferring: state, reward and terminal signals (for episodic tasks if episode ends). Client updates these signals at each time step by receiving the action signal from an Agent and then sends updated values to it.
- Process State: each state could be pass through some filtering procedure before transferring (if you defined). It could be some color, edge or blob transformations (for image input) or more complex pyramidal, Kalman's and spline filters.
Agent (Parallel Learner) - one or more Agents can connect to a Global Learner.
The main role of any agent is to perform main training loop. Agent synchronize their neural network weights with global network by copying the last one at the beginning of the loop. Agent performs N steps of Client's signals receiving and sending actions back. These N steps is similar to batch collection. If batch is collected Agent computes the loss (wrt collected data) and pass it to Optimizer. RMSProp optimizer computes gradients, which sends to the Global Learner to update their neural network weights. Several Agents work in parallel and can update global network in concurrent way.
-
Agent's Neural Network: we use the network architecture from this Mnih's paper (by default).
- Input: 3D input to pass through 2D convolutions (default: 84x84x4).
- Convolution Layer #1: 16 filters with 8x8 kernel and stride 4 in both directions, then ReLU applies (by default).
- Convolution Layer #2: 32 filters with 4x4 kernel and stride 2 in both directions, then ReLU applies (by default). Sequence of two convolutions allows to define nonlinearity dependencies.
- Fully connected Layer: consists of 256 hidden units, then ReLU applies (by default).
- Policy: outputs number of units equals to action size, passed through softmax operator (by default). It is Actor's output, which represents Q-values on unity distribution (equals to probability) of state-action value function - Q(s, a).
- Value: outputs one value without applying of additional operators (by default). It is Critic's output, which represents value function output V(s) - how well this state (equals to expected return from this point).
-
Total Loss: it's scalar sum of value and policy loss.
-
Value Loss: sum (over all batch samples) of squared difference between total discounted reward (R) and a value of the current sample state - V(s), i.e. expected accumulated reward from this time step.
R = ri + gamma * V(s from N+1_step), whereri- immediate reward from this sample,gamma- discount factor (constant for the model),V(s from N+1_step)- value of the state next to the N-th state, if next state is terminal thenV = 0. -
Policy Loss: output of the policy (P) is an array of probability distribution over all possibly actions for the given sample state. Batch of samples is concatenated to the matrix.
Policy Loss = log(P) * A * TD + entropy, whereA- one-hot vectors for the chosen action of each sample.log(P) * A- produce sparse matrix, which we reduce to a column vector.TD = (R - V)- temporary difference between total discounted reward (R) and a value of the current sample state V(s) - produce column vector.entropy = -sum(P * log(P), index=1) * entropy_beta_coefficient, after multiplying the policy (P) likelihood we sum the result matrix by rows to produce a column vector. Then multiplying byentropy_beta_coefficient = 0.01Finally we sum up all column vectors and reduce it to a scalar.
-
-
Softmax Action: we choose more often the actions, which has more probability. It helps to explore a lot of state-action pairs at the beginning of the training. We will become more confident in some actions while training and the probability distribution is becoming more acute. It also helps to solve a problem of "path along the cliff" with high reward at the end.
-
RMSProp Optimizer: we use this type of optimizer wrt original paper. RMSProp is more customizable optimizer than Adam for instance and you can get better result if you fit it with appropriate parameters. We set
learning rate = 7e-4for RMSProp and linear anneal this value through the training process wrt global training step. We also setupdecay = 0.99andepsilon = 0.01for the optimizer. Agent's RMSProp just used to compute gradients wrt current Agent's neural network weights and given loss, while all moments and slots of optimizer are stored (and shared) at Global Learner. -
Gradients: we clip computed gradients before transferring.
output_grads = computed_grads * 40.0 / l2norm(computed_grads) -
Synchronize Weights: we copy weights from Global network to Agent's network every training loop (N steps passed).
Global Learner - one for whole algorithm (training process).
The main role of the Global Learner is updating of its own neural network weights by receiving gradients from the Agents and sending these weights to Agents to synchronize. Global Learner can be sharded to speedup the training process.
-
Global Neural Network: network architecture is similar to Agent's one.
-
RMSProp Optimizer: has the same parameters as Agent's one, but used only to apply receiving gradients. This RMSProp stores moments and slots that are global for all Agents.
You can also specify hyperparameters for training in provided params.yaml file:
episode_len: 5 # training loop size for one batch
max_global_step: 1e8 # amount of maximum global steps to pass through the training
initial_learning_rate: 7e-4 # initial learning rate
entropy_beta: 0.01 # entropy regularization constant
rewards_gamma: 0.99 # discount factor for rewards
RMSProp: # optimizer's parameters
decay: 0.99
epsilon: 0.1
gradient_norm_clipping: 40
Breakout with DA3C-FF and 8 parallel agents: score performance is similar to DeepMind paper

Breakout with DA3C-FF and 8 parallel agents: ih this case we outperforms significantly DeepMind, but
we have some instability in training process (anyway DeepMind shows only 34 points after 80mil steps)

Version of Distributed A3C algorithm, which can cope with continuous action space. Inspired by original paper - Asynchronous Methods for Deep Reinforcement Learning from DeepMind
Most of the parts are the same to previous scheme, excluding:
-
Signal Filtering: perform by Zfilter
y = (x-mean)/stdusing running estimates of mean and std inspired by this source. You can filter both states and rewards. We use it only for states by default. -
Agent's (Global) Neural Network: we use the similar architecture to A3C paper. Each continuous state passes some filtering procedure before transferring to Input by default.
- Input: vector of filtered state input (default: 24).
- Fully connected Layer: consists of 128 hidden units, then ReLU applies (by default).
- LSTM: consists of 128 memory cells (by default).
- Value: outputs one value without applying of additional operators (by default). It is Critic's output, which represents value function output V(s) - how well this state (equals to expected return from this point).
- Policy: Actor's output is divided separately on
muandsigma- mu: scalar of linear output.
- sigma: applying SoftPlus operator, outputs a scalar.
You can also specify your own architecture in provided JSON file.
-
Choose Action: we use a random sampling wrt given
muandsigma -
Total Loss: it's scalar sum of value and policy loss.
- Value Loss: the same to previous scheme.
- Policy Loss:
GausNLL * TD + entropy
GausNLLis gaussian negative-log-likelihoodGausNLL = (sum(log(sigma), index=1) + batch_size * log(2*pi))/2 - sum(power, index=1),where
power = (A - mu)^2 * exp(-log(sigma)) * -0.5- produce column vector.TD = (R - V)- temporary difference between total discounted reward (R) and a value of the current sample state V(s).entropy = -sum(0.5 * log(2 * pi * sigma) + 1, index=1) * entropy_beta_coefficient,resulting sparse matrix we sum over rows to produce column vector.
entropy_beta_coefficient = 0.001
We also use a smaller learning rate = 1e-4
Measure how fast Agent returns Action in response to the State sent by the Client
| Node Type | Number of clients | Latency |
|---|---|---|
| m4.xlarge | 32 | 323.23ms |
| m4.xlarge | 64 | ???ms |
| m4.xlarge | 48 | ???ms |
| c4.xlarge | 48 | ???ms |
| c4.xlarge | 64 | ???ms |
| c4.xlarge-m4.xlarge | 64 | ???ms |
| c4.xlarge-m4.xlarge | 96 | ???ms |
| c4.xlarge-m4.xlarge | 128 | ???ms |
| c4.2xlarge | 232 | ???ms |
| c4.2xlarge | 271 | ???ms |
TBD - Latency chart (Show latency of the agents over time)
| Node Type | Number of clients | Performance |
|---|---|---|
| m4.xlarge | 32 | 99 steps per sec |
| m4.xlarge | 64 | 171 steps per sec |
| m4.xlarge | 48 | 167 steps per sec |
| c4.xlarge | 48 | 169 steps per sec |
| c4.xlarge | 64 | 207 steps per sec |
| c4.xlarge-m4.xlarge | 64 | 170 steps per sec |
| c4.xlarge-m4.xlarge | 96 | 167 steps per sec |
| c4.xlarge-m4.xlarge | 128 | 177 steps per sec |
| c4.2xlarge | 232 | 232 steps per sec |
| c4.2xlarge | 271 | 271 steps per sec |
These other algorithms we are working on and planning to make them run on RELAAX server:
-
TRPO-GAE Inspired by:
-
ACER (A3C with experience) Inspired by:
-
UNREAL Inspired by:
-
Distributed DQN (Gorila) Inspired by:
-
PPO with L-BFGS (similar to TRPO) Inspired by:
-
CEM Inspired by:
-
DDPG Inspired by:
To train RL Agents at scale RELAAX Server and supported Environments could be deployed in the Cloud (AWS, GCP, Azure)
RELAAX comes with scripts and online service to allocate all required network components (VPC, subnets, load balancer), autoscaling groups, instances, etc. and provision software on on appropriate Instances.





