Skip to content

Incomplete relabelling of trajectories in HER #96

@rstrudel

Description

@rstrudel

Hi,

First, thanks a lot for releasing such a nice repository. I have been using it for a few months now and I appreciate that it is quite well written and most of the code is self explanatory. I learned a lot just by using it.

I am using SAC-HER and got a lot of divergence issues which I fixed in the end. One of the main problem came from the relabelling of samples in the buffer: https://github.com/vitchyr/rlkit/blob/90195b24604f513403e4d0fe94db372d16700523/rlkit/data_management/obs_dict_replay_buffer.py#L228-L239
Given a batch, a set of new rewards is computed according to the updated set of goals. However the terminals are not updated, whereas some states might be terminal given the new goal.

And this has its importance in the Bellman update where the terminals variable appear https://github.com/vitchyr/rlkit/blob/90195b24604f513403e4d0fe94db372d16700523/rlkit/torch/sac/sac.py#L128

If the reward if of the form -distance(state, goal) and an episode is only finished because of the maximum path length, then not updating the terminals will have little impact. It may be why this bug passed silently. However I am working with a spare reward which is 1 if distance(state, goal) < epsilon. And in this case, if terminals are not updated then the Q-function blows up. Indeed, if we assume that target_q_values = 1 at the goal, if terminals = 0 then q_target = 2, at the next iteration q_target = 3 and so on. If terminals = 1, i.e. if the state is terminal according to the resampled goal then q_target = 1.

So in my fork of your repository, I replaced:

new_rewards = self.env.compute_rewards(
                new_actions,
                new_next_obs_dict,
            )

by

new_rewards, new_terminals = self.env.compute_rewards(
                new_actions,
                new_next_obs_dict,
            )

where terminals is 1 if distance(state, goal) < epsilon in my case. This fixed the Q-function blow-up issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions