53
Chapter 4
Tukey’s Box Plot:
Exploratory Analysis
The Structure of the Box Plot
Box plot is easily the simplest and most widely used data analysis technique.
e origin of the box plot lies in the range plot. In the rudimentary version of
the range plot, a line stretches between the minimum and the maximum values [1].
We can include markers in this range line to indicate central tendency. We can also
annotate the line with markers to indicate standard deviation.
Box 4.1 StatiStical thinking
To achieve statistical thinking in engineering and management is our pur-
pose. is involves thinking with data, perceiving central tendency and dis-
persion, and recognizing statistical outliers. ese three aspects of statistical
thinking are facilitated by the box plot. e central line in the box indicates
central tendency, the median. Dispersion is shown in two levels of details:
the length of the box is an indicator of dispersion in a broad business sense,
and the whiskers indicate dispersion with more rigor and confidence level.
Outliers, if any, are identified and plotted as points beyond the whiskers. To
use the box plot is to practice statistical thinking. We can use the box plot
effectively in management and engineering situations.
54 Simple Statistical Methods for Software Engineering
In its early form developed by Mary Spear in 1952, the box plot displayed the
five-point summary of data [2]:
Median
Lower quartile
Upper quartile
Smallest data value
Largest data value
A box is made of median and quartiles; the box includes 50% of observations.
e quartiles are the edges of the box called hinges. e whiskers are lines that begin
at the hinges and end at the smallest and largest data values. e graph is known as
the box-and-whisker plot, or simply the box plot.
e box plot has gone through several changes. A summary of the historical
developments is presented by Kristin [3].
A simple but effective improvement of the box plot came from John Wilder
Tukey, which made box plot a popular tool. Tukey modified the box plot and
published it in Exploratory Data Analysis [4] in 1977. In the modern version, data
fences are used. e whiskers do not stretch to the smallest or largest data values.
e whiskers stretch out from the box only up to trimming points (or fences) that
mark off outliers. e trimming rules have been empirically designed. e markers
are 1.5 interquartile range (IQR) away from the box. Whiskers end at the points
farthest from the box inside these markers. e markers provide a pragmatic way to
find outliers. Aczel and Sundara Pandian, authors of an Excel tool to plot the box
plot, refer to these markers as fences [5]. Besides these inner fences, the plot authors
have introduced additional markers 3 IQR away from the box. ese are referred to
as outer fences. If data lie beyond the inner fences, they can be suspected as possible
outliers. Data that fall outside the outer fences are definite outliers.
A typical box plot is shown in Figure 4.1. e following guidelines have been
used in the construction of the graph:
Box central line = Median
Lower hinge (edge) of box = Quartile 1
Upper hinge (edge) of box = Quartile 3
IQR = Quartile 3 − Quartile 1
Right inner fence = Quartile 3 + 1.5 IQR
Left inner fence = Quartile 1 − 1.5 IQR
Right outer fence = Quartile 3 + 3 IQR
Left outer fence = Quartile 1 − 3 IQR
Software productivity data (lines of code/person day) are analyzed by this plot.
e box is constructed from Quartile 1 (productivity = 8) to Quartile 3 (productivity =
34.5). Fifty percent of the data are inside the box. Hence, the core productivity is in
Tukey’s Box Plot 55
this range. e left whisker reaches zero, whereas the right whisker reaches 70. e
whisker ends represent a more complete range, beyond the core box. e whisker
range has good news as well as bad news. e lower whisker is a serious concern;
productivity values have dropped to zero. is could be a data error and might have
to be cleaned out. We are doing exploratory analysis with the box plot, and we just
make a note of this observation at the moment.
en we find outliers. ose values between the fences are suspected outliers.
Productivity values between 75 and 116 are perhaps not repeatable performance.
ose values beyond the outer fence are definite outliers. ey are, in a purely statisti-
cal sense, odd, untenable results. Perhaps those results might have had harmful side
effects; the damage might have been done, and only a root cause analysis can tell.
Customer Satisfaction Data Analysis Using the Box Plot
In analyzing ordinal data, box plots are invaluable. Let us take the customer satisfac-
tion (CSAT) index data from a development project. e data are shown in Data 4.1.
Data have been collected in a 0–10 scale. is scale fares better than the con-
ventional 0–5 Likert Scale. e 0–10 scale has more granularity and less subjective
error.
To understand the performance of the organization analysts, take the median
if data were ordinal, although taking the mean is a common but mistaken practice.
Q3
Whisker
Inner
fence
Outer
fence
Outlier
Potential
outliers
Q1
Median
–100 –50 0 50
Productivity LOC/PD
100 150
Figure 4.1 Box plot of software productivity.
56 Simple Statistical Methods for Software Engineering
In this case, both mean and median provide nearly similar results. However, we
prefer to use the median. e central tendency of CSAT index is shown as follows:
Mean 6.8
Median 6.9
is is compared with the organization goal, which happens to be 8.0. e
obvious shortcoming is recognized, and future decisions are made to bridge the
gap. is is the routine analysis.
Let us now try a box plot to display the CSAT data as shown in Figure 4.2.
We are able to make the following additional observations in the box plot.
1. e entire box is below the goal. is is a serious subject. e core process
carrying 50% of performance results is below the mark.
2. ere is an outlier with a value of CSAT index around 3.2. is is way down
the track. If we apply the Kano model of CSAT, this score will run into deep
dissatisfaction levels. Perhaps it is just short of customer fury.
3. Not a single event has reached the top score of 10. Customer delight seems to
be an unattainable goal. To balance the outlier, we need at least a few delighters.
To compensate one negative impression, we need to create ten positive impres-
sions. e compensatory effort is missing.
Data 4.1 Customer
Satisfaction Index Values
5.0 5.8 7.4
6.5 7.6 8.7
6.6 6.0 8.3
9.1 7.2 7.7
5.9 5.5 7.1
5.8 6.8 6.5
7.7 6.3 8.1
7.5 5.9 7.0
5.0 6.3 6.4
5.9 8.5 4.6
7.1 8.2 8.0
7.6 6.0 7.3
7.9 6.2 5.9
7.2 4.7 6.9
7.8 5.3 6.7
4.7 6.9 8.1
7.0 6.7 7.1
3.1 7.6 7.0
5.1 8.2 6.8
4.5 7.2 6.6
8.1 8.6 6.9
5.9 6.9 6.6
7.8 7.7
Tukey’s Box Plot 57
Certainly, thinking with the box plot enables us to reason much more than
working with the mean value of the CSAT index. Box plots make us see the prob-
lem in its entirety; this is a very valuable support.
Tailoring the Box Plot
e box plot is being widely applied in real life. We have seen the insights brought
in by box plot to R&D scientists, project managers, business managers, quality
managers, and data analysis in the software business. Even students in the middle
grade in the United States are taught the box plot [6].
e box plot is being continuously refined. Tukey himself published variations
in the box plot in 1978 [7]. Others have included frequency information in the box
plot. Bivariate box plots called bag plots have also been tried out. People have pro-
posed variants called bean plots. An analysis of the attempted improvements in the
box plot may be found in the paper by Choonpradub and McNeil [8].
Attempts that have tried to pack more information into the box plot have failed.
People prefer the simple uncluttered plain box plot.
Applications of Box Plot
Numerical quantities focus on expected values, graphical sum-
maries on unexpected values.
John W. Tukey
0 1 2 3 4 5 6 7 8 9 10
Figure 4.2 Box plot of customer satisfaction index.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset