Nachum O, Gu SS, Lee H, Levine S. Data-efficient hierarchical reinforcement learning. InAdvances in Neural Information Processing Systems 2018 (pp. 3303-3313). Code is here https://github.com/tensorflow/models/tree/master/research/efficient-hrl

Abstract

Proposed a Hierachical Reinforcement Learning method where higher-level controller learns and proposes a goal and lower-level controller tracks the goal. It also utilizes off-policy samples for training to learn less in real-world scenarios.

Why is it good?

Previous methods [43, 10, 11, 18, 36] required careful task-specific design and on-policy training => Difficult to apply in real-world

Proposed method can be learned with modest numbers of iteration samples, because it utilizes off-policy experience for both higher and lower-level training.

Core Technology

Lower-level controllers are supervised with goals that are learned and proposed by the higher-level controllers. They also introduced an off-policy correction to solve the problem of higher-level policy being affected by the change in lower-level policy.

Verification Method

Any Argument or Idea?

Next Paper to read

Details

Math

Structure of the Network

Higher Level outputs the goal $g_t$ to lower level. Higher level only outputs its goal once in c steps. I think you can see it as a path planner giving a current sub-goal $g_t$.

\[Higher Level -> g_t = \mu_t (for mod c) or g_t = g_{t-1} -> Lower Level\]

Off-Policy Corrections for Higher-Level Training

Let $g_t$ be goal computed before lower level is updated, Let $\tilde{g_t}$ be new goal. We want $g_t$ to be $\tilde{g_t}$

Old             New
Higher          Higher
    |               |
    | g_t           | \tilde{g_t}
    v               v
Lower           Loser

What they did was …

  1. Sample 8 goals $g_{t,i} \sim N(\frac{s_{t+c}-s_t}{2}, \sigma)$
  2. Include $g_t$ and a goal corresponding to the difference $s_{t+c}$ and $s_t$ and let those 10 samples be $\tilde{g_t}$.
  3. compute log probabiility of $\mu^{lo}(a_{t:t+c-1}, s_{t:t+c-1}, \tilde{g}_{t:t+c-1})$
\[\log \mu^{lo}(a_{t:t+c-1}, s_{t:t+c-1}, \tilde{g}_{t:t+c-1}) \propto -\frac{1}{2} \Sigma_{i=t}^{t+c-1} ||a_i - \mu^{lo}(s_i, \tilde{g_i})||_{2}^{2}\]
  1. Choose the maximal goal to re-label the experience for training the higher-level controller.

Learning Policy

In the paper they used TD3 (type of DDPG) for learning the policy $\mu_{\phi}$. Behavior policy (used for exploring) is given by $a_t \sim \mathcal{N}(\mu_{\phi}(s_t), \sigma)$. $\sigma$ is a fixed standard deviation.

\[Q(s_t,a_t) <- Q(s_t,a_t) + \alpha [ R(s_t, a_t) + \gamma \text(max)Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t)]\]

where \(\mu_{\phi} = \text(argmax)Q(s_t, a_t)\)

Higher Level Controller has tuples of $(s_{t:t+c-1}, \tilde{g}{t:t+c-1}, a{t:t+c-1}, R_{t:t+c-1}, s_{t+c})$.

Lower Level Controller has tuples of $(s_t, a_t, R_t, s_{t+1})$.

Put them in each Replay Buffer and optimize the neural networks seperately.

Code

Pseudo Code

for e in epoch:
    s, goal_coord = env.reset()
    count = 0
    r_sum = 0

    while not done:
        a = agent.low_policy(s)
        a = add_noise(a)
        n_s, n_g, high_r, done, info = env.step(a) # add goal
        low_r = low_level_reward(s, low_g, n_s)
        low_g = low_level_goal(s, low_g, n_s)
        # sub_g = agent.high_policy(n_s)

        high_g = n_g

        rb.low.append((s, low_g), a, n_s, low_r, done)
        # Σa, Σ
        if count % high_level_buffer_freq == 0:
            # high_a (low_g) will be corrected by off-policy-correction
            rb.high.append()

        if count%high_freq == 0:
            batch_high = rb.high.sample()
            agent.train_high_level(batch_high)

        count += 1
        r_sum += r

    if done:
        batch_low = rb.low.sample()
        agent.train_low_level(batch_low)
        update phi by the deterministic policy gradient

    if e % high_level_update_freq == 0:
        relabel the high-level transition
        update phi by the deterministic policy gradient
        r_sum = 0
        high_transition = [s, ] # s should be n_s
        # begin_s, end_s, next_goal, subgoal, sum_rewards, [states], [actions]
def calc_low_level_goal(st, gt, n_s):
    return st + gt - n_s

def calc_low_level_reward(st, gt, n_s):
    return np.linalg.norm(st + gt - n_s)