|Sažetak (engleski)|| |
Chapter 1: Introduction. This section is concerned with giving the research context of the thesis. It gives the motivation behind developing learning capabilities for technical systems, especially robotic systems. The one emphasized are the need for higher flexibility and autonomy of such systems in order to remove the barrier of highly specialized knowledge needed during the application of robots in the industry or in the service robotics field. The problem of task-oriented behavior is viewed from the programming and learning perspectives. Three common approaches for accomplishing such high-level behavior have been addressed: learning from demonstration, reinforcement learning and task-oriented motion planning. Current accomplishments in all three fields are referenced and specific advantages and disadvantages of each approach are covered. Based on this, the motivation for the research direction taken in this thesis is pointed out. The research hypothesis and goals are defined. On the end of this section, the structure of the thesis is given.
Chapter 2: Learning from demonstration. On the beginning of this section, a broader overview of learning from demonstration (LfD) specific problems is given. The correspondence problem is pointed out together with most frequent demonstration mechanisms. Methods for encoding demonstrated trajectories are covered in both the statistical modeling aspect and encoding based on dynamical systems. The DMP parametrization and its characteristics are covered in detail as it is used for trajectory encoding in other parts of the thesis. Two main approaches for encoding generalizability into learning from demonstration methods are covered – task parametrization and inverse reinforcement learning. A novel methodology for the analysis of demonstrations based on trajectories obtained by kinesthetic teaching is proposed and covered. The method uses a novel classification mechanism in order to determine attracting points, non-attracting points and obstacle points in the working environment of the robot. Experimental results of this methodology are presented and commented on the end of this section.
Chapter 3: Task-oriented trajectory planning. Demonstration sampling and analysis by the methodology from the previous section is performed in Cartesian space. In this section, task-oriented reproduction of trajectories in the same domain is performed. Common trajectory representations used in robotics that can be used both for planning in configuration space and operational space and their parametrizations are covered. As the thesis focuses on the application of primitive motions in task-oriented programming, this section gives an overview of the application of primitive motions in task-oriented scenarios. A modified DMP representation is presented which is capable of explicitly using the information obtained by the demonstration analysis. It has the capability of encoding variational information in the low level DMP trajectory definition and achieves this by introducing a modified time function instead of the standard exponential decay function. The methodology is originally presented in the conference paper: Task Dependent Trajectory Learning from Multiple Demonstrations Using Movement Primitives. After this, a Cartesian optimization-based path planning model is proposed, based on the following paper: Learning from Demonstration Based on a Classification of Task Parameters and Trajectory Optimization. The model is capable of encoding the information from the demonstration analysis by approximating identified via-points and avoiding identified obstacles. The path planning model is transferred into a DMP trajectory using the special DMP state representation presented earlier. The trajectory planning approach is verified on a presented experimental setup.
Chapter 4: Reinforcement learning in continuous environments. As models learned from demonstrations often fail to produce completely accurate task solution in the extrapolation phase, the idea of local trajectory improvement through self-exploration has been considered in this section. Reinforcement learning provides the general framework for achieving this. This section therefore covers the theoretical overview of RL, which provides the basis for explaining the methodology seen in the continuous space scenario. Policy search methods are identified as the most suitable when performing improvements on the trajectory level with continuous parametrization. Two main approaches for policy search are covered: critic-based approaches and “black-box” optimization (BBO). Both perform learning directly in the parameters space by observing and evaluating agent’s interactions with the environment. However, BBO approaches simplify the required evaluation mechanism while having comparable performance to critic-based approaches. Possible policy representations in the trajectory domain are covered in a special
subsection. The application of a BBO algorithm together with a DMP policy parametrization is demonstrated at the end of this section.
Chapter 5: Iterative learning for stochastic tasks. The BBO policy search methodology presented in the previous section implies the direct interaction of the agent (robot) with the environment. As searching in the parameter space of the trajectory policy in real environments is very dangerous and can lead to physical damages of both the robot and environment, a simulation setup is here introduced, suitable for robot learning. The setup is based on the ROS based physics simulator Gazebo. Based on this, a task-oriented iterative learning setup is proposed. At its core, the setup consists of black-box optimization which is given in the form of the evolutionary CMA-ES algorithm. The policy parametrization responsible for the execution of trajectories in the simulation environment is in the DMP form. The CMA-ES algorithm is responsible for updating the policy weight parameters with respect to a task-oriented cost-function. This closes the policy search loop which is performed in an iterative manner in order to achieve task learning convergence. The methodology was tested on two tasks: a peg-in-hole task and a sweeping task. Since the tasks showed high stochasticity with respect to the goal-oriented cost functions, two criteria to evaluate such learning processes where proposed. A best-current solution metric and a current average metric. The first one keeps track of the best solution achieved in the policy search process, while the later gives information about the overall quality of the learning process.
Chapter 6: LfD as a basis for iterative learning. The iterative learning algorithm presented in the previous section was initialized by an empirical strategy which used a linear trajectory. Previous research suggested that the search in big parameter spaces is very dependent on the initial conditions and exploration is mostly only locally oriented. In this section, results of the iterative learning algorithm are given, when initialized from demonstrations. The demonstration methodology followed the one presented in section two, which involved kinesthetic guidance for demonstration collection and the coordinate frame classification methodology for extracting useful via-points. The initial cartesian DMP trajectories where constructed using the optimization-based methodology from section three. The obtained results showed that the LfD initialization strategy lead to significantly better results in terms of the quality of the searched solutions as well as faster
convergence to applicable solutions. Findings presented in this section are based on the following paper: Accelerating Robot Trajectory Learning for Stochastic Tasks.
Chapter 7: Conclusion This section discusses the summary and the main achievements of the doctoral thesis. The main contributions can be viewed as: I) a novel learning from demonstration method for the analysis of trajectory level demonstrations, based on the classification of coordinate frames, II) an optimization-based cartesian trajectory planning algorithm with coordinate frame approximation and obstacle avoiding capabilities, III) a simulation based, iterative learning framework for task-oriented trajectory learning compatible with the LfD methodology. Future research will be focused on finding more efficient algorithms for policy search with sparse evaluation and testing the applicability of different policy representation. The possibilities for automatic estimation of exploration rates will be explored, as well as the automatic extraction of end-result-oriented cost/reward functions in order to remove the need for hand crafted functions.