Distance calculations

As mentioned previously, Euclidean distance is commonly used to build the input for hierarchical clustering. Let's look at a simple example of how to calculate it with two observations and two variables/features.

Let's say that observation A costs $5.00 and weighs 3 pounds. Further, observation B costs $3.00 and weighs 5 pounds. We can place these values in the distance formula: distance between A and B is equal to the square root of the sum of the squared differences, which in our example would be as follows:

d(A, B) = square root((5 - 3)2 + (3 - 5)2), which is equal to 2.83

The value of 2.83 is not a meaningful value in and of itself, but is important in the context of the other pairwise distances. This calculation is the default in R for the dist() function. You can specify other distance calculations (maximum, manhattan, canberra, binary, and minkowski) in the function. We will avoid going in to detail on why or where you would choose these over Euclidean distance. This can get rather domain-specific; for example, a situation where Euclidean distance may be inadequate is where your data suffers from high-dimensionality, such as in a genomic study. It will take domain knowledge and/or trial and error on your part to determine the proper distance measure.

One final note is to scale your data with a mean of zero and standard deviation of one, so that the distance calculations are comparable. If not, any variable with a larger scale will have a larger effect on distances.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset