This chapter reviews approaches for task representation in robot programming by demonstration (PbD). Based on the level of task abstraction, the methods are categorized into high‐level task representation at the symbolic level of abstraction and low‐level task representation at the trajectory level of abstraction. Techniques for data preprocessing related to trajectories scaling and aligning are also discussed in the chapter.
The PbD framework aims at learning from multiple demonstrations of a skill performed under similar conditions. For a set of M demonstrations, the perception data are denoted by , where m is used for indexing the demonstrations, t is used for indexing the measurements within each demonstration, and Tm denotes the total number of measurements of the demonstration sequence Xm. Each measurement represents a D‐dimensional vector . The form of the measurements depends on the data acquisition system(s) employed for perception of the demonstrations, and it can encompass the following:
The data structure for the task perception with optical sensors and vision cameras is presented in Sections 2.1 and 2.2.
The objective of PbD is to map the demonstrated space into a sequence of low‐level command signals for a robot learner to reproduce the demonstrated task. As outlined in Chapter 2, the mapping is often highly nonlinear, and therefore, it represents a challenging problem. It usually consists of several steps. The task modeling and task analysis phases encode the recorded data into compact and flexible representation of demonstrated motions and extract the relevant task features for achieving the required robot performance. In the task planning step, a trajectory for task reproduction is derived by generalization over the dataset of demonstrated task examples. The generalized trajectory for task reproduction is denoted by , where Tgen is the number of time measurements of the generalized trajectory. Afterward, the generalized trajectory is transferred to robot commands and loaded on the robot platform for the task execution in the real environment.
Representation of observed tasks in robot PbD is often categorized based on the level of task abstraction into high‐level and low‐level representation (Schaal et al., 2003). As elaborated in Section 1.4.2, high‐level task representation refers to encoding tasks as a hierarchical sequence of high‐level behaviors (Dillmann, 2004; Saunders et al., 2006). This representation category is also known as symbolic task representation. In general, the elementary behaviors are predefined, and the observed tasks are initially segmented into sequences of behavior. The task representation then involves defining rules and conditions related to the state of the required world for each elementary action to occur, as well as defining rules and required conditions for each elementary action to end and the successor behavior to begin. Among the principal drawbacks for such task representation approach is the limitation of relying on a library of predefined elementary behaviors in encoding tasks, and the limited applicability for representing tasks requiring continuous high accuracy along a three‐dimensional (3D) path or a velocity profile.
Low‐level task representation is also referred to as representation at a trajectory level, where the task demonstrations are encoded as continuous 3D trajectories in the Cartesian space or as continuous angle trajectories in the joint space (Section 1.4.2.2). The PbD approaches presented in the book employ this type of task abstraction, due to the ability to represent arbitrary motions, as well as to define the spatial constraints of the demonstrations (Aleotti and Caselli, 2005; Calinon, 2009). Additionally, low‐level task representation enables to specify the velocities and accelerations at different phases of the demonstrated task, and allows encoding tasks with high‐accuracy demand.
Statistical methods have been widely used in robotics for representing the uncertain information about the state of the environment. On one hand, processing of the robot’s sensory data involves handling the uncertainties that originate from sensors noise, limitations of the sensors, and unpredictable changes in the environment. On the other hand, processing the control actions involves tackling the uncertainties regarding the way the decisions are made to achieve the desired level of motor performance. The statistical frameworks represent the uncertainties of robot’s perception and action via probability distributions, instead of using a single best guess about the state of the world. As a result, the use of statistical models has contributed to improved robustness and performance in many robotic applications, such as localization and navigation of mobile robots, planning, and map generation from sensory data (Thrun, 2000).
Regarding robot learning from observation of human demonstrations, the theory of statistical modeling has been exploited for representing the uncertainties of the acquired perceptual data (Calinon, 2009). Indeed, one must bear in mind that a fundamental property of the human movements is their random nature. Humans are not capable of drawing perfect straight lines or repeating identical movements, which is assumed is due to the inherent stochastic nature in the neural information processing of the required actions (Clamann, 1969). The statistical algorithms provide a form to encapsulate the random variations in the observed demonstrations by deriving the probability distributions of the outcomes from several repeated measurements. Consequently, a model of a human skill is built from several examples of the same skill demonstrated in a similar fashion and under similar conditions. The underlying variability across the repeated demonstrations is utilized to probabilistically represent the different components of the task, and subsequently, to retrieve a generalized version of the demonstrated trajectories.
Within the published literature, hidden Markov model (HMM) has been established as one of the most popular statistical methods for modeling of human motions. Other approaches that have been used for probabilistic skill encoding include the following: Gaussian mixture model (GMM) (Calinon, 2009), support vector machines (Zollner et al., 2002; Martinez and Kragic, 2008), and Bayesian belief networks (Coates et al., 2008; Grimes and Rao, 2008).
Generalization from multiple task examples imposes the need to address the problem of temporal variability in the demonstrated data. Often, it is required to initially scale the demonstrated data to sequences with equal length before proceeding with the data analysis. The approaches of linear scaling and dynamic time warping (DTW) scaling have been employed in many works for this purpose (Pomplun and Mataric, 2000; Gribovskaya and Billard, 2008; Ijspeert et al., 2012).
Linear scaling refers to the change in the number of measurements of a sequence through interpolation between the sequence data. For a set of collected demonstration trajectories with a different number of time frames, scaling to a new set of sequences with equal number of time frames is achieved by performing linear scaling of the time vectors for each individual trajectory. This procedure is equivalent to extending or shortening the sequences to the required number of time measurements. Among different types of interpolation techniques for time series data, polynomial interpolation is the most often used. More specifically, low‐order polynomials are locally fitted to create a smooth and continuous function, referred to as a spline.
For illustration, Figure 3.1a shows two sample trajectories with different number of measurements, whereas Figure 3.1b displays their counterparts after the linear time scaling. Accordingly, the length of the test trajectory is scaled to the number of measurements of the reference trajectory. For sequences that differ significantly in length, this method might not be very efficient, since the temporal variations in the demonstrations can result in spatial misalignments across the set.
DTW is an algorithm for comparing and aligning time series, where finding an optimal alignment of sequences is based on a nonlinear time warping by using a dynamic programming technique. The DTW alignment for the two sequences from Figure 3.1a is illustrated in Figure 3.1c. It can thus be concluded that the main advantage of the DTW scaling over the linear time scaling is the efficient alignment of the signals for handling the spatial distortions.
The DTW sequence alignment is based on forming a matrix of distances between two time series, and finding an optimal path through the matrix that locally minimizes the distance between the sequences. For a given reference sequence of length Tχ and a test sequence of length Tγ, the distance matrix H is formed as follows:
In (3.1), the notation is used for denoting the Euclidean l2‐norm. Note that other norms can also be used as a distance measure, such as l1‐norm (Holt et al., 2007). The optimal alignment path is a function of the elements of the matrix H with . The path is calculated by minimizing the cumulative sum of the distances H(χ, γ) and the minimum distance between the test and reference sequences of the neighboring cells, i.e.,
The procedure for implementing DTW includes the following steps:
The following constraints are enforced for the warping path:
The boundary conditions define the starting and ending points of the path, whereas the continuity and monotonicity conditions constrain the path to be continuous and monotonically spaced in time, respectively.
The DTW alignment can excessively distort the signals in the case of dissimilar sequences, therefore shape-preserving constraints are often imposed. Sakoe and Chiba (1978) proposed to force the warped time waves to stay within a specified time frame window of their normalized time vector, and to impose a constraint on the slope of the warping path.
1D DTW refers to alignment of time series based on a single dimension of the data. When dealing with multidimensional data, it might result in suboptimal alignment across the rest of the dimensions. Multidimensional DTW approach was proposed by Holt et al. (2007), where the alignment of the sequences involves the coordinates from all dimensions of the data. Hence, the distance matrix includes the Euclidean distance among the dimensions of the sequences at each time point, and has the following form:
where d denotes the dimensionality of the sequences.
Chapter 3 begins with the formulation of the problem of learning trajectories in a PbD setting. Further, methods for task representation at a high‐level and low‐level of abstraction are covered. A brief overview of statistical methods for task representation is presented. Their application in robot learning from multiple demonstrations is intuitive and natural with consideration to the stochastic character of human motions. The chapter concludes with a discussion of two standard approaches employed for tackling the temporal variations of human‐demonstrated trajectories.