Nachum O, Gu SS, Lee H, Levine S. Data-efficient hierarchical reinforcement learning. InAdvances in Neural Information Processing Systems 2018 (pp. 3303-3313). Code is here https://github.com/tensorflow/models/tree/master/research/efficient-hrl
Proposed a Hierachical Reinforcement Learning method where higher-level controller learns and proposes a goal and lower-level controller tracks the goal. It also utilizes off-policy samples for training to learn less in real-world scenarios.
Previous methods [43, 10, 11, 18, 36] required careful task-specific design and on-policy training => Difficult to apply in real-world
Proposed method can be learned with modest numbers of iteration samples, because it utilizes off-policy experience for both higher and lower-level training.
Lower-level controllers are supervised with goals that are learned and proposed by the higher-level controllers. They also introduced an off-policy correction to solve the problem of higher-level policy being affected by the change in lower-level policy.
Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. In AAAI, pages 1726–1734, 2017.
Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. Feudal networks for hierarchical reinforcement learning.
Carlos Florensa, Yan Duan, and Pieter Abbeel. Stochastic neural networks for hierarchical reinforcement learning. arXiv preprint arXiv:1704.03012, 2017 arXiv preprint arXiv:1703.01161, 2017
Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. Vime: Variational information maximizing exploration. In Advances in Neural Information Processing Systems, pages 1109–1117, 2016
Higher Level outputs the goal $g_t$ to lower level. Higher level only outputs its goal once in c steps. I think you can see it as a path planner giving a current sub-goal $g_t$.
\[Higher Level -> g_t = \mu_t (for mod c) or g_t = g_{t-1} -> Lower Level\]Let $g_t$ be goal computed before lower level is updated, Let $\tilde{g_t}$ be new goal. We want $g_t$ to be $\tilde{g_t}$
Old New
Higher Higher
| |
| g_t | \tilde{g_t}
v v
Lower Loser
What they did was …
In the paper they used TD3 (type of DDPG) for learning the policy $\mu_{\phi}$. Behavior policy (used for exploring) is given by $a_t \sim \mathcal{N}(\mu_{\phi}(s_t), \sigma)$. $\sigma$ is a fixed standard deviation.
\[Q(s_t,a_t) <- Q(s_t,a_t) + \alpha [ R(s_t, a_t) + \gamma \text(max)Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t)]\]where \(\mu_{\phi} = \text(argmax)Q(s_t, a_t)\)
Higher Level Controller has tuples of $(s_{t:t+c-1}, \tilde{g}{t:t+c-1}, a{t:t+c-1}, R_{t:t+c-1}, s_{t+c})$.
Lower Level Controller has tuples of $(s_t, a_t, R_t, s_{t+1})$.
Put them in each Replay Buffer and optimize the neural networks seperately.
for e in epoch:
s, goal_coord = env.reset()
count = 0
r_sum = 0
while not done:
a = agent.low_policy(s)
a = add_noise(a)
n_s, n_g, high_r, done, info = env.step(a) # add goal
low_r = low_level_reward(s, low_g, n_s)
low_g = low_level_goal(s, low_g, n_s)
# sub_g = agent.high_policy(n_s)
high_g = n_g
rb.low.append((s, low_g), a, n_s, low_r, done)
# Σa, Σ
if count % high_level_buffer_freq == 0:
# high_a (low_g) will be corrected by off-policy-correction
rb.high.append()
if count%high_freq == 0:
batch_high = rb.high.sample()
agent.train_high_level(batch_high)
count += 1
r_sum += r
if done:
batch_low = rb.low.sample()
agent.train_low_level(batch_low)
update phi by the deterministic policy gradient
if e % high_level_update_freq == 0:
relabel the high-level transition
update phi by the deterministic policy gradient
r_sum = 0
high_transition = [s, ] # s should be n_s
# begin_s, end_s, next_goal, subgoal, sum_rewards, [states], [actions]
def calc_low_level_goal(st, gt, n_s):
return st + gt - n_s
def calc_low_level_reward(st, gt, n_s):
return np.linalg.norm(st + gt - n_s)