Lecture 1: Supervised Learning of Behaviors: Deep Learning, Dynamic Systems, and Behavior Cloning
수업/Deep Reinforcement Learning Spring 2017 2017. 7. 7. 18:36□ Terminology and Notation
○ State는 Markov property를 따르나, Observation은 아님
□ Imitation Learning
○ Observation과 Action을 training data로 해서 supervised learning
○ 잘 동작하는가? No!
- supervised learning을 통해 작은 error가 나타남.
- 작은 error로 training에 없는 state로 도달할 경우, 잘못된 action이 error를 더 키울 수 있고, 이 것이 누적되어 error가 더 커짐.
○ Bojaski et al., End to End Learning for Self-Driving Cars, CVPR, 2016
- 정면 카메라 외의 좌 우에 카메라 2개를 추가함.
- 좌 우 카메라는 정면 카메라에서 shift에 rotate 시킨 것으로 여겨지며, 이를 기반으로 정면 외의 data를 추가함
* 예) 직진할 경우, 좌측 카메라 입장에서는 우측 커브하는 것으로 볼 수 있음
- 정면, 좌, 우 카메라로 얻은 데이터를 기반으로 임의로 shift, rotate 시킨 데이터를 training data로 함
- 자율주행차의 부족한 training data를 해결하는 방법임
○ Learning from a stabilizing controller(※ 다음 강의에 자세히 다룸)
○ p_dtata(o) = p_phi(o)랑 같게 할 수 있는가?
- training data에서의 observation과 학습된 policy로 수행하여 observation을 얻는 것이 iid(independent and identically distribution)하지 않음
* A typical approach to imitation learning is to train a classifier or regressor to predict an expert’s behavior given training data of the encountered observations (input) and actions (output) performed by the expert. However since the learner’s prediction affects future input observations/states during execution of the learned policy, this violates the crucial i.i.d. assumption made by most statistical learning approaches. (Ross et al., AISTATS, 2011)
○ DAgger: Dataset Aggregation(Ross et al., AISTATS, 2011) ※ 자세한건 논문참고
- goal: collect training data from p_phi(o) instead of p_data(o)
- how? just run phi_theta(u_t|o_t), but need labels u_t
- Algorithm(simple version)
* 1. train phi_theta(u_t|o_t) to get dataset D = {o_1, u_1,...,o_N,u_N}
* 2. run phi_theta(u_t|o_t) to get D_phi = {o_1, ..., o_M}
* 3. Ask human to labe D_phi with action u_t
* 4. Aggregate: D <- D U D_phi and repeat step 1
※ step 3이 문제임: 사람이 특정 순간에 대해 action을 정의하기가 힘듦
○ Imitation learning: recap
- Often (but not always) insufficient by itself(Imitation learning)
* Distribution mismatch problem
- Sometimes works well
* Hacks(e.g. left/rigth images)
* sample from a stable trajectory distribution
* Add more on-policy data., e.g. using DAgger
○ Case study 1: trail following as classification(Giusti et al., IEEE ROBOTICS AND AUTOMATION LETTERS. 2015)
- 사람 머리에 카메라 3개를 달아서 quadrotor training data를 수집
○ Case study 2: DAgger & domain adaptation(Daftry et al., ISER, 2016)
○ Case study 3: Imitation with LSTMs(Rahmatizadeh et al., 2016)
○ Other topics in imitation learning
- Structured prediction(나중에 다룸)
- Interaction & active learning
- Inverse reinforcement learning
* instead of copying the demonstration, figure out the goal
○ Imitation learning: what's the problem?
- humans need to provide data, which is typically finite
* Deep learning works best when data is plentiful
- humans are not good at providing som kinds of actions
- humans can learn autonomously; can our machines do the same?
* unlimited data from own experience
* continuous self-improvement
□ Learning without humans
○ cost/reward function
○ A cost function for imitation?
- c(x, u) = -log p(u=phi^*(x)|x)
○ The trouble with cost & reward functions
- reward가 명확하지 않은 경우
- Rusu et al., Sim-to-Real Robot Learning from Pixels with Progressive Nets, 2016 (※ Deepmind 논문임)
○ A note about terminology: the "R" word
- minimum of expected total cost or maximum of expected total reward
* reinforcement learning의 문제 정의로 사용
* model-free reinforcement learning을 표현하기도 함
'수업 > Deep Reinforcement Learning Spring 2017' 카테고리의 다른 글
Lecture 2: Optimal Control, Trajectory Optimization, and Planning (0) | 2017.07.25 |
---|---|
Lecture 0: Introduction (0) | 2017.07.02 |