Reacher Continuous Control using DDPG
The Project use the MLAngents version 0.4.0 so we need to do some following steps
Install Anaconda following link bellow
https://www.anaconda.com/products/individual
Create virtual environment
# Create the virtual env DQN
conda create -n DQN_navigation python=3.6
# activate environment
source activate DQN_navigation
# clone the udacity repo
git clone https://github.com/udacity/deep-reinforcement-learning.git
# go to the python folder of the repo
cd deep-reinforcement-learning/python
# install the unityagents package from this folder
pip install -e .
# git clone DQN_Navigation_Project
cd ..
https://github.com/TriKnight/Reacher_Continuous_Control
# install the requirements from our package
cd Reacher_Continuous_Control
pip install -r requirements.txt
conda install -c anaconda ipykernel
python -m ipykernel install --user --name=DQN_navigation
- Open Jupyter notebok.
jupyter notebook
-
Open Reacher_Continuous_Control
-
Change the Kernel
Kernel/Change Kernel/DQN_navigation
Note that your project submission need only solve one of the two versions of the environment 1 agent and 20 agents
-
Download the environment from one of the links below. You need only select the environment that matches your operating system:
-
Version 1: One (1) Agent
- Linux: click here
- Mac OSX: click here
- Windows (32-bit): click here
- Windows (64-bit): click here
-
Version 2: Twenty (20) Agents
- Linux: click here
- Mac OSX: click here
- Windows (32-bit): click here
- Windows (64-bit): click here
-
The task is episodic, and in order to solve the environment, your agent must get an average score of +30 over 100 consecutive episodes.
- Observations: Each agent receives an observation consisting of a 33-dimensional vector with measurements like relative position and orientations of the links, relative position of the goal and its speed, etc..
- Actions: Each agent moves its arm around by applying actions consisting of 4 torques applied to each of the 2 actuated joints (2 torques per joint).
- Rewards: Each agent gets a reward of +0.1 each step its end effector is within the limits of the goal. The environment is considered solved once the agent gets an average reward of +30 over 100 episodes.
The barrier for solving the second version of the environment is slightly different, to take into account the presence of many agents. In particular, your agents must get an average score of +30 (over 100 consecutive episodes, and over all agents). Specifically,
- After each episode, we add up the rewards that each agent received (without discounting), to get a score for each agent. This yields 20 (potentially different) scores. We then take the average of these 20 scores.
- This yields an average score for each episode (where the average is over all 20 agents).
The environment is considered solved, when the average (over 100 episodes) of those average scores is at least +30.
In this version we use Deep Deterministic Policy Gradient (DDPG) to solve problem.
- DDPG is an off-policy algorithm.
- DDPG can only be used for environments with continuous action spaces.
- DDPG can be thought of as being deep Q-learning for continuous action spaces.
- The Spinning Up implementation of DDPG does not support parallelization.





