Kober, J. and Peters, J.R., 2009. Policy search for motor primitives in robotics. In Advances in neural information processing systems (pp. 849-856).

Abstract

Proposes Policy Learning by Weighting Exploration with the Returns (PoWER) Policy Learning from episodic rewards. EM-inspired algorithm applicable to complex motor learning tasks.

Why is it good?

Policy search/learning is an alternative to value-function based RL. Successful with high dimensional continuous states and actions.

Core Technology

Policy Gradient

\[J(\theta) = \mathbb{E} \left[ R(\tau) \right] \\ J(\theta) = \int p_{\theta}(\tau) R(\tau) d\tau \\ \log J(\theta') = \log \int \frac{p_{\theta}(\tau)}{p_{\theta}(\tau)} p_{\theta'}(\tau)R(\tau) d\tau \geq \int p_{\theta}(\tau) R(\tau) \log \frac{p_{\theta'}(\tau)}{p_{\theta}(\tau)} d\tau\\\]

where inequality comes from Jensen’s Equality.

\[\nabla_{\theta'} \log J(\theta') = \int p_{\theta}(\tau) R(\tau) \nabla_{\theta'} \log p_{\theta'}(\tau) d\tau\\ \nabla_{\theta'} \log J(\theta') = \mathbb{E}_{\tau \sim p_{\theta}(\tau)} \left[ \sum_{t=1}^{T} \nabla_{\theta'} \log \pi_{\theta'} (a_t|s_t) Q(s_t, a_t) \right] \\\]

where $\quad Q(s_t, a_t) = \sum_{t’=t}^{T} r(s_{t’}, a_{t’})$, assuming finite horizon.

For policy, the paper uses a stochastic policy (linear deterministic policy $a=\theta^T \phi(s_t)$ and additional Gaussian Noise $\epsilon \sim \mathcal{N}(0, \sigma^2)$). As a result, the policy is

\[a = \theta^T \phi(s_t) + \epsilon \phi(s_t)\]

Sub this policy into

\[\mathbb{E}_{\tau \sim p_{\theta}(\tau)} \left[ \sum_{t=1}^{T} \nabla_{\theta'} \log \pi_{\theta'} (a_t|s_t) Q(s_t, a_t) \right] = 0\]

and get

\[\theta' \leftarrow \theta + \mathbb{E} \left[ \sum_{t=1}^T Q^{\pi}(s_t, a_t) \mathbf{W}(s_t) \right]^{-1} \mathbb{E} \left[ \sum_{t=1}^T \epsilon_t Q^{\pi}(s_t, a_t) \mathbf{W}(s_t) \right]\]

Verification Method

Ball in a cup with a real robotic manipulator.

Any Argument or Idea?

Next Paper to read

Details

Pseudo Code

def initialize_policy():
  filename = ()
  path = load_path(filename)
  dmp = DMP_discrete(n_bfs=10, n_dmps=2)
  dmp.imitate_path(path)
  return dmp

def main():
  policy = initialize_policy()
  while True:
    # raw data and Qs
    data = run_simulation(policy)
    Qs = compute_Q(data)
    # important data and Qs
    Qs, data, importance = reject_low_importance_data(Qs, data)
    new_policy = update_policy(policy, Qs, data, importance)

    if error(new_policy - policy) < THETA_THRESHOLD:
      break