Exploratory analysis

Before starting with data analysis through the building and training of a neural network, we conduct an exploratory analysis to understand how data is distributed and extract preliminary knowledge.

We can begin our explorative analysis by tracing a plot of predictors versus target. We recall in this respect that in our analysis, the predictors are the following variables: cylinders, displacement, horsepower, weight, acceleration, year, origin, and name. The target is the mpg variable that contains measurements of the miles per gallon of 392 sample cars.

Suppose we want to examine the weight and mileage of cars from three different origins, as shown in the next graph, using the following code:

plot(data$weight, data$mpg, pch=data$origin,cex=2)

To plot the chart, we used the plot() function, specifying what to point on the x axis (weight), what to point on the y axis (mpg), and finally, based on which variable to group the data (origin), as shown in the following graph:

Remember the number in the origin column correspond at the following zone: 1= America, 2=Europe, and 3=Japan). From the analysis of the previous graph, we can find that fuel consumption increases with weight gain. Let's remember that the target measures the miles per gallon, so how many miles are going with a gallon of fuel. It follows that the greater the value of mpg (miles per gallon), the lower the fuel consumption.

Another consideration that comes from plot analysis is that cars produced in America are heavier. In fact, in the right part of the chart (which corresponds to higher values of weight), there are only cars produced in that area.

Finally, if we focus our analysis on the left of the graph, in the upper part that corresponds to the lowest fuel consumption, we find in most cases Japanese and European cars. In conclusion, we can note that cars that have the lowest fuel consumption are Japanese.

Now, let's see the other graphs, that is, what we get if we plot the remaining numeric predictors (cylinders, displacement, horsepower, and acceleration) versus target (mpg).

par(mfrow=c(2,2))
plot(data$cylinders, data$mpg, pch=data$origin,cex=1)
plot(data$displacement, data$mpg, pch=data$origin,cex=1)
plot(data$horsepower, data$mpg, pch=data$origin,cex=1)
plot(data$acceleration, data$mpg, pch=data$origin,cex=1)

For space reasons, we decided to place the four charts in one. R makes it easy to combine multiple plots into one general graph, using the par() function. Using the par( ) function, we can include the option mfrow=c(nrows, ncols) to create a matrix of nrows x ncols plots that are filled in by row. For example the option mfrow=c(3,2) creates a matrix plot with 3 rows and 2 columns. In addition, the option mfcol=c(nrows, ncols) fills in the matrix by columns.

In the following figure are shown 4 plot arranged in a matrix of 2 rows and two columns:

From the analysis of the previous figure, we find confirmation of what has already been mentioned earlier. We can note that cars with higher horsepower have higher fuel consumption. The same thing we can say about the engine displacement; also in this case, vehicles with higher displacement have higher fuel consumption. Again, cars with higher horsepower and displacement values are produced in America.

Conversely, cars with higher acceleration values have lower fuel consumption. This fact is due to the lesser weight that such cars have. Usually, heavy cars are slower in acceleration.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset