Kernel density estimation

In order to explain KDE, let us generate some one-dimensional data and build some histograms. Histograms are a good way to understand the underlying probability distribution of the data.

We can generate histograms using the following code block for reference:

> data <- rnorm(1000, mean=25, sd=5)
> data.1 <- rnorm(1000, mean=10, sd=2)
> data <- c(data, data.1)
> hist(data)
> hist(data, plot = FALSE)
$breaks
[1] 0 5 10 15 20 25 30 35 40 45

$counts
[1] 8 489 531 130 361 324 134 22 1

$density
[1] 0.0008 0.0489 0.0531 0.0130 0.0361 0.0324 0.0134 0.0022 0.0001

$mids
[1] 2.5 7.5 12.5 17.5 22.5 27.5 32.5 37.5 42.5

$xname
[1] "data"

$equidist
[1] TRUE

attr(,"class")
[1] "histogram"

This code creates two artificial data-sets and combines them. Both datasets are based on the normal distribution; the first has a mean of 25 and standard deviation of 5, the second has a mean of 10 and standard deviation of 2. If we recall from basic statistics, a normal distribution is in a bell shape centered at the mean value, and 99% of the values are between the mean-(3*sd) and mean+(3*sd). By combining these data from these two distributions, the new dataset will have two peaks and a small area of overlap.

Let us look at the first histogram generated:

The output in R shows the bin size and the locations. Look at $breaks and $mids in the R output.

What if we change the bin sizes as follows:

> hist(data, breaks = seq(0,50,2))

We provide the hist function with the breaks parameter, which is a sequence from 0 to 50, with a step size of two. We want our bin size to be two rather than 10the default value R is selected.

Let us look at the histogram:

As you can see, by varying the bin size, the representation of the data changes. You can test the same effect by varying the bin locations or $mids. The histogram depends on having the right bin size and bin location.

The alignment of the bins with the data points decide the height of a bar in the histogram. Ideally, this height should be based on actual density of the nearby points. Can we stack the bar in such way that the height is proportional to the points they represent. This is where KDE one-ups the histogram.

A discrete random variable takes on a finite number of possible values. We can determine the probability for all the possible value of that random variable using a probability mass function. A continuous random variable takes on infinite number of possible values. We cannot find the probability of a particular value, instead we find the probability that a value falls in some interval. A probability density function (PDF) is used to find the probability of a value falling in an interval for a continuous random variable.

Let us look at an example:

mean=25
sd=5
1-(2*pnorm(mean+(2*sd), mean=mean,sd=sd, lower.tail=FALSE))

[1]0.9544997

pnorm is the function to get the cumulative probability distribution function for normal distributions, i.e. the probability that a given value is within a range of a normal distribution. Here we calculate the probability that a value falls between 2 standard deviations of the mean value, which we know from basic statistics is 95%. We can calculate this probability by looking at the probability that the given value is greater than the mean+(2*sd); we then multiply that by 2 (to account for the values less than mean-(2*sd)). This gives us the probability the values are in the "tails" of the distribution, so we subtract this value from 1 to get the probability that the values lie between mean +/- (2*sd) to give the answer of 0.9544997. Note that even if you change the values of mean and sd, the result is always 0.9544997.

There is also an excellent plot here: https://en.wikipedia.org/wiki/Probability_density_function

For a  set of N samples generated by a uni/multivariate and continuous random variable kernel density, the estimate tries to approximate the random variable's probability density function.

Let us see a univariate example using the density function to approximate the PDF of the given data:

# Kernel Density estimation
kde = density(data)
plot(kde)

Let us look at the density plot:

This smoothed out plot from KDE gives us a better idea about the distribution of the underlying data. The density function in R uses a Gaussian kernel by default for each point in the data. The parameters of KDE are the kernel, which is used at every point to decide the shape of the distribution at each point, and the bandwidth, which defines the smoothness of the output distribution.

Why use a kernel instead of calculating the density at each point and them summing it up?

We can calculate the density at each point x0 as follows:

Where h is the distance threshold. This may give a bumpy plot as each point xi in the neighborhood defined by h receives the same weight.

So instead, we use a kernel.

Where K is the kernel and h is the smoothing parameter. The kernel gives weight to each point xi based on its proximity to x0.

Look at the R help function to learn more about the density function:

 help(density)

Hopefully, this section gives an understanding of how to use kernel density estimate to approximate the probability density function of the underlying dataset. We will be using this PDF to build our sentiment classifier.

Refer to Chapter 6's, Kernel Smoothing methods section in The Elements of Statistical Learning, by Trevor Hastie, Robert Tibshirani and Jerome Friedman for a rigorous discussion on kernel density estimates.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset