18
Object data and manifolds

This final chapter contains some extensions to shape analysis, especially applications to object data and more general manifolds.

18.1 Object oriented data analysis

The techniques of shape analysis have natural extensions to other application areas. The very broad field of object oriented data analysis (OODA) is concerned with analysing different types of data objects compared with the conventional univariate or multivariate data. The concept of OODA was proposed by J.S. Marron. Examples of object data include functions, images, shapes, manifolds, dynamical systems, and trees. The main aims of shape analysis extend more generally to object data, for example defining a distance between objects, estimation of a mean, summarizing variability, reducing dimension to important components, specifying distributions of objects and carrying out hypothesis tests. From Marron and Alonso (2014) in any study an important consideration is to decide what are the atoms (most basic parts) of the data. A key question is ‘what should be the data objects?’, and the answer will then lead to appropriate methodology for statistical analysis. The subject is fast developing following initial definitions in Wang and Marron (2007), and a recent summary with discussion is given by Marron and Alonso (2014) with applications to Spanish human mortality functional data, shapes, trees and medical images.

One of the key aspects of object data analysis is that registration of the objects must be considered as part of the analysis. In addition the identifiability of models, choice of regularization and whether to marginalize or optimize as part of the inference (see Chapters 14 and 16) are important aspects of object data analysis, as they are in shape analysis (Dryden 20142014). We have already described many applications of object oriented data, for example functional outline data analysis in Section 16.4; averaging images in Section 17.4; and shape and size-and-shape analysis of landmark data.

18.2 Trees

A particularly challenging type of object data is that of trees, for example collections of phylogenetic trees (Holmes 2003) or geometric trees constructed from medical images of the brain (Skwerer et al. 2014).

Billera et al. (2001) introduced ground-breaking work on the study of the geometry of the space of phylogenetic trees, and in particular showed that the space has negative curvature. The space of trees can be thought of as an ‘open book’, where ‘pages’ of the book are Euclidean spaces which are all joined down a common spine. Each page of the open book represents trees with the same topological structure, but different lengths of branches. Trees on different pages have different topological structures. The geodesic path between two trees is just a straight line between two trees on the same page but between different pages the geodesic consists of straight lines that join at the spine. The space of trees is an example of a stratified space, where the Euclidean pages are the different strata. Phylogenetic trees have a fixed set of leaves. Kendall and Colijn (2015) provide a metric for comparing phylogenetic trees and fast computation of median trees.

Wang and Marron (2007) studied sets of trees and emphasized that different types of methods are required to analyse such object data, and they coined the phrase ‘object oriented data analysis’ from this work. However, working with trees is difficult. Owen (2011) provided a fast algorithm for computing geodesic distances in tree space which is required for statistical analysis, for example using the bootstrap (Holmes 2003), PCA (Aydin et al. 2009), a geodesic principal path (Nye 2011), and diffusion models (Nye and White 2014).

Feragen et al. (2010, 2013) introduce a distance for comparing geometric tree shapes, the quotient Euclidean distance, via quotienting out different representations of the same tree. The edges contain attributes, for example geometric points along a curve. Feragen et al. (2010, 2013) apply the methodology for averaging and comparing geometric trees to examples of trees from medical images of lung airways, and compare with the classical tree-edit distance.

Tree space contain strata that introduce curious effects, in particular the notion of ‘stickyness’ where a mean tree shape sticks to a particular lower dimensional stratum. For example, Hotz et al. (2013) develop sticky central limit theorems in tree space, and Barden et al. (2013) study central limit theorems for Fréchet means in the space of phylogenetic trees.

There are other types of stratified manifolds where it is of interest to develop statistical procedures. For example, the cone of symmetric positive semi-definite matrices contains different strata for the subspaces of different ranks [e.g. see Bhattacharya et al. (2013) and Hotz et al. (2013) for further discussion of stratified manifolds].

18.3 Topological data analysis

Topological data analysis is a relatively new area of study bringing together pure mathematical ideas and data analysis. Carlsson (2009) provides a summary and introduction to the area. The basic idea is that topological features (e.g. number of holes, Betti numbers) are recorded as a filtration of simplicial complexes is applied to the dataset. Particular types of filtration are the Čech complex, the Vietoris–Rips complex and Delaunay complex. Sets of geometrical shapes/patterns (simplicial complexes) are obtained by applying the filtration with different parameters, and observing how the features persist over different scales is key to the concept of persistent homology. Persistence diagrams are constructed and then compared using suitable distance metrics, such as the Wasserstein metric. Fréchet means can be constructed (Munch et al. 2015) and general statistical procedures can be applied. For example, Gamble and Heo (2010) explore the uses of persistent homology for statistical analysis of landmark-based shape data and Heo et al. (2012) introduce a topological analysis of variance with applications in dentistry. Other work includes that by Bubenik et al. (2010) who consider a non-parametric regression problem on a compact Riemannian manifold to estimate persistent homology consistently, Turner et al. (2014) who introduce a persistent homology transform for modelling shapes and surfaces, which is a sufficient statistic for representing shape and Kim et al. (2014) discuss the potential for using persistent homology to study phylogenetic trees in the microbiome. Nakamura et al. (2015) use a kernel-based distance between persistence diagrams in order to describe differences between different types of materials.

18.4 General shape spaces and generalized Procrustes methods

18.4.1 Definitions

We can also study more general shape spaces. The work follows Carne (1990); see also Dryden (1999) and Dryden and Mardia (1991a).

Definition 18.1 Consider a point T in a differentiable manifold M. Define a group G which acts on M, and denote the transformation of T by gG by g(T). The general shape space Σ(M, G) ≡ M/G is defined as the orbit space of M under the action of G. The equivalence class in Σ(M, G) corresponding to TM is the set G(T), and we call this set the general shape of T.

A suitable choice of distance between T and Y in Σ(M, G) is given by:

numbered Display Equation

where hG is chosen so that dΣ(T, Y) = dΣ(Y, T), and dist( · ) is a suitable choice of distance in M. For example dist() could be the Euclidean metric if M was a Euclidean space.

The special case of the shape space of Chapter 3 that we have primarily considered is when with Euclidean metric ( is the set of coincident points) and G is the Euclidean similarity group of transformations. The size-and-shape space of Chapter 5 involves the isometry group of rotation and translation only. Ambartzumian (1990) considers the general affine shape, when ( is the set of deficient rank configurations) and G is the affine transformation group (see Section 12.2). Also, in Section 16.4 we had M = L2, the Hilbert space of square integrable functions and G is the space of diffeomorphic warpings.

For some choices of manifold M a non-Euclidean metric will be the natural choice. For example, if M = Sp (the unit sphere), then the great circle metric could be used. Le (1989a,b) has considered the shape space for points on a sphere, where the registration group is rotations only. Stoyan and Molchanov (1995) considered the manifold to be the system of non-empty compact sets in , and the registration group consisted of rotations and translations.

18.4.2 Two object matching

As we saw in the Section 4.1.1 calculation of distances in shape space can be regarded as finding the minimum Euclidean distances between configurations, by superimposing or registering them as close as possible using the similarity transformations. Consider two configurations of k points in : and which we write as k × m matrices:

numbered Display Equation

We shall assume throughout this section that km + 1. We wish to match T to Y, using

(18.1) numbered Display Equation

where h( · ) is a known smooth function, B are the parameters and E is the error matrix. We could consider estimating B by minimizing some objective function s2(E) – a function of the error matrix E = Yh(T, B).

Alternatively, we could formulate the situation in terms of a statistical model, where Y is perturbed from h(T, B) by the random zero mean matrix E from some specified distribution. The unknown parameters B, and any parameters in the distribution of E (e.g. a variance parameter), could then be estimated say by maximum likelihood or some other inferential procedure.

Our primary concern throughout has been the least squares case. Hence, in this case the objective function used is s2(E) = ||E||2 = trace(ETE). Equivalently, the implicit model for the errors vec(E) is independent multivariate normal. If the functions h(T, B) are similarity transformations of T, then this special case reduces to the Procrustes analysis of Chapter 7. Some other choices for objective function are discussed in Section 13.6.

18.4.3 Generalized matching

If we have a random sample of n objects T1, …, Tn (TjM, k × m matrices) from a population, then it is of interest to obtain an estimate of the population mean shape μ (k × m matrix) and to explore the structure of variability, up to invariances in the set of transformations G.

An estimate of the population mean configuration μ up to invariances in G, denoted by , can be obtained by simultaneously matching each Tj to μ (j = 1, …, n) and choosing as a suitable M-estimator (subject to certain constraints on μ). In particular, is obtained from the constrained minimization

where ϕ(x) ≥ 0 is a penalty function (an increasing function) on , ϕ(0) = 0, gj(Tj) is a transformed version of Tj by gjG, s(E) is the objective function for matching two configurations and general restrictions need to be imposed on μ to avoid degeneracies. A common choice of estimator has ϕ(x) = x2, the least squares choice. In this special case, if gj are known then the minimizing value of μ is:

numbered Display Equation

Example 18.1 For and G are the Euclidean similarity transformations, the full Procrustes mean shape is obtained by solving Equation (18.2) with ϕ(x) = x2, s(E)2 = ||E||2 = trace(ETE) subject to the constraint that ||μ|| = 1, μT1k = 0 and μ is rotated to a fixed axis.   □

18.5 Other types of shape

There are many studies into different aspects of shape in several disciplines. For example, the Gestalt theory of shape looks at shape from physical, physiological and behavioural points of view, for example Zusne (1970). Nagy (1992) gave some general observations about shape theory.

There is a vast amount of literature on shape in computer graphics, computer science and geometry, including the work of Koenderink (). Mumford (1991) gives some important observations about theories of shape and questions whether or not they model human perception. As well as discussing curvature he describes various metrics for shape. Following the work of Tversky (1977) he argues that human perception about the similarity of objects A and B is often different from the perception of similarity between objects B and A, and may also be affected by contextual information. Hence a metric, which is symmetric in its arguments, may not always be the best way to measure ‘similarity’.

Further wide ranging reviews of notions of shape, particularly in image analysis, are given by Neale and Russ (2012).

18.6 Manifolds

The main ideas of shape analysis can be extended to many other types of manifolds too. Of fundamental importance is the need to define a distance, to define population and sample means, to describe variability, to define probability distributions and to carry out statistical inference.

Mardia and Jupp (2000) describe analysis of circles and spheres in detail, which are the simplest types of non-Euclidean manifold. There is a great deal of relevant distributional work given by Watson (1983) and Chikuse (2003) on spheres and other symmetric spaces. The topic of manifold data analysis is too large and broad to summarize here, and so instead we give a few select examples. Turaga et al. (2011) study Stiefel manifolds and Grassmannians for image and video-based recognition; Rosenthal et al. (2014) develop spherical regression using projective transformations; pure states in quantum physics can be represented as points in complex projective space (Kent 2003; Brody 2004) and Guta et al. (2012) develop maximum likelihood estimation methods for the extension to mixed quantum states; Kume and Le (2000, 2003) consider Fréchet means in simplex shape spaces, which are hyperbolic spaces which have negative curvature; Mardia et al. (1996d), Mardia and Patrangenaru (2005) and Kent and Mardia (2012) consider cross-ratios and data analysis with projective invariance; Dryden et al. (2009b) and Zhou et al. (2013, 2016) study symmetric positive-definite (SPD) matrices (covariance matrices), and use a Procrustes procedure applied to diffusion tensors in medical image analysis; Jayasumana et al. (2013a) use kernel methods for studying SPD matrices; Arsigny et al. (2006) and Fillard et al. (2007) give many practical examples on the same manifolds using the log-Euclidean metric with applications to clinical diffusion tensor MRI estimation, smoothing and white matter fibre tracking; Yuan et al. (2013) study varying coefficient models for diffusion tensors; and Pigoli et al. (2014) extend the Procrustes approach for distances in Hilbert spaces between covariance functions, with applications to comparing sounds in romance languages.

18.7 Reviews

A summary of such a broad and wide-ranging topic of shape analysis cannot be complete. We conclude by mentioning some other reviews and collections of material on shape analysis, which naturally have different emphases from our own.

Some books on the topic include those by Bookstein (1991), Lele and Richtsmeier (2001); Claude (2008), Weber and Bookstein (2011), Bookstein (2014) and Zelditch et al. (2012) with many biological examples; Kendall et al. (1999), Small (1996) and Younes (2010) from a mathematical perspective; Grenander and Miller (2007), Davies et al. (2008b) and Neale and Russ (2012) on deformation models and image analysis; da Fontoura Costa and Marcondes Cesar Jr (2009) on shape classification; Bhattacharya and Bhattacharya (2008), Brombin and Salmaso (2013) and Patrangenaru and Ellingson (2015) on non-parametric shape analysis; Stoyan and Stoyan (1994, Part II) on outline data; and Stoyan et al. (1995, Chapter 8) on unlabelled configurations.

The topic of shape analysis has entries in the Encyclopaedia of Statistical Sciences (Kotz and Johnson 1988) under the titles of ‘Shape Statistics’ (D.G. Kendall), ‘Size and Shape Analysis’ (Mosimann) and ‘Landmark Data’ (Mardia) and in the Encyclopaedia of Biostatistics (Dryden 2005a). Finally, some general reviews include Kendall (1989), Rohlf and Marcus (1993), Pavlidis (1995), Loncaric (1998), Adams et al. (2004, 2013), Heo and Small (2006), Slice (2007), Mitteroecker and Gunz (2009) and Kendall and Lee (2010); and some edited volumes involving shape analysis include those by Ying-Lie et al., (1994), MacLeod and Forey (2002), Slice (2005), Krim and Yezzi (2006), Evison and Bruegge (2010), Hamelryck et al. (2012), Breuß et al. (2013), Dryden and Kent (2015) and Turaga and Srivastava (2015).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset