Proposed a hierarchical reinforcement learning method that learns atomic actions via imitation learning and fin-tuned via reinforcement learning. Simplify long-horizon policy learning problem. How? Proposed a novel “data-relabeling algorithm” for learning a goal-conditioned hierachical policies.
No access to specific task, leverage unstructured & unsegmented demonstrations for imitation learning.
Recent RL are constrained to relatively simple short-horizon skilss, HRL is used instead. However, HRL struggle with exploration(h-DQN), skill segmentation(option), and reward definition(Diversity is all you need). We simplify problem by utilizing extra supervision in the form of unstructured human demonstrations.
Goal-conditioned RL that learns $pi(a | s,sg)$ such that it maximizes goal-conditioned expected reward $E_{sg~G}[E_pi[\Sigma_{t=0}^T gamma^t r_i(st, at, sg)]]$. |
and Imitation Learning that learns $pi(a,s,sg)$ such that it maximizes $E_{(s,a)~D}[log pi(a | s)]$. (See Learning to achieve goals) |