Nachum O, Gu SS, Lee H, Levine S. Data-efficient hierarchical reinforcement learning. InAdvances in Neural Information Processing Systems 2018 (pp. 3303-3313). Code is here https://github.com/tensorflow/models/tree/master/research/efficient-hrl

Abstract

Proposed a Hierachical Reinforcement Learning method where higher-level controller learns and proposes a goal and lower-level controller tracks the goal. It also utilizes off-policy samples for training to learn less in real-world scenarios.

Why is it good?

Previous methods [43, 10, 11, 18, 36] required careful task-specific design and on-policy training => Difficult to apply in real-world

Proposed method can be learned with modest numbers of iteration samples, because it utilizes off-policy experience for both higher and lower-level training.

Core Technology

Lower-level controllers are supervised with goals that are learned and proposed by the higher-level controllers. They also introduced an off-policy correction to solve the problem of higher-level policy being affected by the change in lower-level policy.

Verification Method

Task: Original Pushing Objects Tasks in Mujoco
Criteria: Trials vs. Average Reward
Comparison: Option-Critic Architecture, FeUdal Networks, Stochastick NN for HRL (SNN4HRL), VIME

Any Argument or Idea?

Next Paper to read

Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. In AAAI, pages 1726–1734, 2017.
Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. Feudal networks for hierarchical reinforcement learning.
Carlos Florensa, Yan Duan, and Pieter Abbeel. Stochastic neural networks for hierarchical reinforcement learning. arXiv preprint arXiv:1704.03012, 2017 arXiv preprint arXiv:1703.01161, 2017
Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. Vime: Variational information maximizing exploration. In Advances in Neural Information Processing Systems, pages 1109–1117, 2016

Details

Math

Structure of the Network

Higher Level outputs the goal $g_t$ to lower level. Higher level only outputs its goal once in c steps. I think you can see it as a path planner giving a current sub-goal $g_t$.

\[Higher Level -> g_t = \mu_t (for mod c) or g_t = g_{t-1} -> Lower Level\]

Off-Policy Corrections for Higher-Level Training

Let $g_t$ be goal computed before lower level is updated, Let $\tilde{g_t}$ be new goal. We want $g_t$ to be $\tilde{g_t}$

Old             New
Higher          Higher
    |               |
    | g_t           | \tilde{g_t}
    v               v
Lower           Loser

What they did was …

Sample 8 goals $g_{t,i} \sim N(\frac{s_{t+c}-s_t}{2}, \sigma)$
Include $g_t$ and a goal corresponding to the difference $s_{t+c}$ and $s_t$ and let those 10 samples be $\tilde{g_t}$.
compute log probabiility of $\mu^{lo}(a_{t:t+c-1}, s_{t:t+c-1}, \tilde{g}_{t:t+c-1})$

\[\log \mu^{lo}(a_{t:t+c-1}, s_{t:t+c-1}, \tilde{g}_{t:t+c-1}) \propto -\frac{1}{2} \Sigma_{i=t}^{t+c-1} ||a_i - \mu^{lo}(s_i, \tilde{g_i})||_{2}^{2}\]

Choose the maximal goal to re-label the experience for training the higher-level controller.

Learning Policy

In the paper they used TD3 (type of DDPG) for learning the policy $\mu_{\phi}$. Behavior policy (used for exploring) is given by $a_t \sim \mathcal{N}(\mu_{\phi}(s_t), \sigma)$. $\sigma$ is a fixed standard deviation.

\[Q(s_t,a_t) <- Q(s_t,a_t) + \alpha [ R(s_t, a_t) + \gamma \text(max)Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t)]\]

where $\mu_{\phi} = \text(argmax)Q(s_t, a_t)$

Higher Level Controller has tuples of $(s_{t:t+c-1}, \tilde{g}{t:t+c-1}, a{t:t+c-1}, R_{t:t+c-1}, s_{t+c})$.

Lower Level Controller has tuples of $(s_t, a_t, R_t, s_{t+1})$.

Put them in each Replay Buffer and optimize the neural networks seperately.

Code

Pseudo Code

for e in epoch:
    s, goal_coord = env.reset()
    count = 0
    r_sum = 0

    while not done:
        a = agent.low_policy(s)
        a = add_noise(a)
        n_s, n_g, high_r, done, info = env.step(a) # add goal
        low_r = low_level_reward(s, low_g, n_s)
        low_g = low_level_goal(s, low_g, n_s)
        # sub_g = agent.high_policy(n_s)

        high_g = n_g

        rb.low.append((s, low_g), a, n_s, low_r, done)
        # Σa, Σ
        if count % high_level_buffer_freq == 0:
            # high_a (low_g) will be corrected by off-policy-correction
            rb.high.append()

        if count%high_freq == 0:
            batch_high = rb.high.sample()
            agent.train_high_level(batch_high)

        count += 1
        r_sum += r

    if done:
        batch_low = rb.low.sample()
        agent.train_low_level(batch_low)
        update phi by the deterministic policy gradient

    if e % high_level_update_freq == 0:
        relabel the high-level transition
        update phi by the deterministic policy gradient
        r_sum = 0
        high_transition = [s, ] # s should be n_s
        # begin_s, end_s, next_goal, subgoal, sum_rewards, [states], [actions]

def calc_low_level_goal(st, gt, n_s):
    return st + gt - n_s

def calc_low_level_reward(st, gt, n_s):
    return np.linalg.norm(st + gt - n_s)