# Versatile Intelligent Mobile Robotics William D. Brilliant, Presenter Leslie Pack Kaelbling, PI Artificial Intelligence - PowerPoint PPT Presentation  Versatile Intelligent Mobile Robotics William D. Brilliant, Presenter Leslie Pack Kaelbling, PI Artificial Intelligence ## Versatile Intelligent Mobile Robotics William D. Brilliant, Presenter Leslie Pack Kaelbling, PI Artificial Intelligence

##### Presentation Transcript

1. Adaptive Intelligent Mobile Robotics • William D. Smart, Presenter • Leslie Pack Kaelbling, PI • Artificial Intelligence Laboratory • MIT

2. Progress to Date • Fast bootstrapped reinforcement learning • algorithmic techniques • demo on robot • Optical-flow based navigation • flow algorithm implemented • pilot navigation experiments on robot • pilot navigation experiments in simulation testbed

3. Making RL Really Work • Typical RL methods require far too much data to be practical in an online setting. Address the problem by • strong generalization techniques • using human input to bootstrap • Let humans do what they’re good at • Let learning algorithms do what they’re good at

4. JAQL • Learning a value function in a continuous state and action space • based on locally weighted regression (fancy version of nearest neighbor) • algorithm knows what it knows • use meta-knowledge to be conservative about dynamic-programming updates

5. Problems with Q-Learning on Robots • Huge state spaces/sparse data • Continuous states and actions • Slow to propagate values • Safety during exploration • Lack of initial knowledge

6. Value Function Approximation • Use a function approximator instead of a table • generalization • deals with continuous spaces and actions • Q-learning with VFA has been shown to diverge, even in benign cases • Which function approximator should we use to minimize problems? s Q(s,a) F a

7. Locally Weighted Regression • Store all previous data points • Given a query point, find k nearest points • Fit a locally linear model to these points, giving closer ones more weight • Use KD-trees to make lookups more efficient • Fast learning from a single data point

8. Locally Weighted Regression • Original function

9. Locally Weighted Regression • Bandwidth = 0.1, 500 training points

10. Problems with ApproximateQ-Learning • Errors are amplified by backups

11. One Source of Errors

12. Independent Variable Hull • Interpolation is safe; extrapolation is not, so • construct hull around known points • do local regression if the query point is within the hull • give a default prediction if not

13. Recap • Use LWR to represent the value function • generalization • continuous spaces • Use IVH and “don’t know” • conservative predictions • safer backups

14. Incorporating Human Input • Humans can help a lot, even if they can’t perform the task very well. • Provide some initial successful trajectories through the space • Trajectories are not used for supervised learning, but to guide the reinforcement-learning methods through useful parts of the space • Learn models of the dynamics of the world and of the reward structure • Once learned models are good, use them to update the value function and policy as well.

15. Give Some Trajectories • Supply an example policy • Need not be optimal and might be very wrong • Code or human-controlled • Used to generate experience • Follow example policy and record experiences • Shows learner “interesting” parts of the space • “Bad” initial policies might be better

16. Environment Supplied Control Policy Two Learning Phases Phase One R O A Learning System

17. Environment Supplied Control Policy Two Learning Phases Phase Two R O A Learning System

18. What does this Give Us? • Natural way to insert human knowledge • Keeps robot safe in early stages of learning • Bootstraps information into the Q-function

19. Experimental Results:Corridor-Following

20. Corridor-Following • 3 continuous state dimensions • corridor angle • offset from middle • distance to end of corridor • 1 continuous action dimension • rotation velocity • Supplied example policy • Average 110 steps to goal

21. Corridor-Following • Experimental setup • Initial training runs start from roughly the middle of the corridor • Translation speed has a fixed policy • Evaluation on a number of set starting points • Reward • 10 at end of corridor • 0 everywhere else

22. Corridor-Following Phase 1 Phase 2 Average training “Best” possible

23. Corridor Following: Initial Policy

24. Corridor Following: After Phase 1

25. Corridor Following: After Phase 1

26. Corridor Following: After Phase 2

27. Conclusions • VFA can be made more stable • Locally weighted regression • Independent variable hull • Conservative backups • Bootstrapping value function really helps • Initial supplied trajectories • Two learning phases

28. Optical Flow • Get range information visually by computing optical flow field • nearer objects cause flow of higher magnitude • expansion pattern means you’re going to hit • rate of expansion tells you when • elegant control laws based on center and rate of expansion (derived from human and fly behavior)

29. Approaching a Wall

30. Balance Strategy • Simple obstacle-avoidance strategy • compute flow field • compute average magnitude of flow in each hemi-field • turn away from the side with higher magnitude (because it has closer objects)

31. Balance Strategy in Action

32. Crystal Space

33. Crystal Space

34. Crystal Space

35. Next Steps • Extend RL architecture to include model-learning and planning • Apply RL techniques to tune parameters in optical-flow • Build topological maps using visual information • Build highly complex simulated environment • Integrate planning and learning in multi-layer system