So far, we've been limiting ourselves to two-dimensional data. After all, the human mind has a lot of trouble dealing with more than three dimensions, and even two-dimensional visualizations of three-dimensional space can be difficult to comprehend.
However, we can use PCA to help. It projects higher-dimensional data down to lower dimensions, but it does this in a way that preserves the most significant relationships in the data. It re-projects the data on a lower dimension in a way that captures the maximum amount of variance in the data. This makes the data easier to visualize in three- or two-dimensional space, and it also provides a way to select the most relevant features in a dataset.
In this recipe, we'll take the data from the US census by race that we've worked with in previous chapters, and create a two-dimensional scatter plot of it.
We'll use the same dependencies in our project.clj
file as we did in Creating Scatter Plots with Incanter, and this set of imports in our script or REPL:
(require '[incanter.core :as i] '[incanter.charts :as c] '[incanter.io :as iio] '[incanter.stats :as s])
We'll use the aggregated census race data for all states. You can download this from http://www.ericrochester.com/clj-data-analysis/data/all_160.P3.csv. We'll assign it to the race-data
variable:
(def race-data (iio/read-dataset "data/all_160.P3.csv" :header true))
We'll first summarize the data to make it more manageable and easier to visualize. Then we'll use PCA to project it on a two-dimensional space. We'll graph this view of the data:
(def fields [:P003002 :P003003 :P003004 :P003005 :P003006 :P003007 :P003008]) (def race-by-state (reduce #(i/$join [:STATE :STATE] %1 %2) (map #(i/$rollup :sum % :STATE race-data) fields)))
(def race-by-state-matrix (i/to-matrix race-by-state)) (def x (i/sel race-by-state-matrix :cols (range 1 8)))
(def pca (s/principal-components x))
(def components (:rotation pca)) (def pc1 (i/sel components :cols 0)) (def pc2 (i/sel components :cols 1)) (def x1 (i/mmult x pc1)) (def x2 (i/mmult x pc2))
x1
and x2
. We'll use them to create a two-dimensional scatter plot:(def pca-plot (c/scatter-plot x1 x2 :x-label "PC1", :y-label "PC2" :title "Census Race Data by State"))
(i/view pca-plot)
This provides us with a graph expressing the most salient features of the dataset in two dimensions:
Conceptually, PCA projects the entire dataset on a lower-dimensional space and rotates to a view that captures the maximum variability it can see from that dimension.
In the preceding chart, we can see that most of the data clusters are around the origin. A few points trail off to the higher numbers of the graph.