Chapter 4 - Tukey’s Box Plot: Exploratory Analysis (1/4)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 4

Tukey’s Box Plot:

Exploratory Analysis

The Structure of the Box Plot

Box plot is easily the simplest and most widely used data analysis technique.

e origin of the box plot lies in the range plot. In the rudimentary version of

the range plot, a line stretches between the minimum and the maximum values [1].

We can include markers in this range line to indicate central tendency. We can also

annotate the line with markers to indicate standard deviation.

Box 4.1 StatiStical thinking

To achieve statistical thinking in engineering and management is our pur-

pose. is involves thinking with data, perceiving central tendency and dis-

persion, and recognizing statistical outliers. ese three aspects of statistical

thinking are facilitated by the box plot. e central line in the box indicates

central tendency, the median. Dispersion is shown in two levels of details:

the length of the box is an indicator of dispersion in a broad business sense,

and the whiskers indicate dispersion with more rigor and conﬁdence level.

Outliers, if any, are identiﬁed and plotted as points beyond the whiskers. To

use the box plot is to practice statistical thinking. We can use the box plot

eﬀectively in management and engineering situations.

54 ◾ Simple Statistical Methods for Software Engineering

In its early form developed by Mary Spear in 1952, the box plot displayed the

ﬁve-point summary of data [2]:

Median

Lower quartile

Upper quartile

Smallest data value

Largest data value

A box is made of median and quartiles; the box includes 50% of observations.

e quartiles are the edges of the box called hinges. e whiskers are lines that begin

at the hinges and end at the smallest and largest data values. e graph is known as

the box-and-whisker plot, or simply the box plot.

e box plot has gone through several changes. A summary of the historical

developments is presented by Kristin [3].

A simple but eﬀective improvement of the box plot came from John Wilder

Tukey, which made box plot a popular tool. Tukey modiﬁed the box plot and

published it in Exploratory Data Analysis [4] in 1977. In the modern version, data

fences are used. e whiskers do not stretch to the smallest or largest data values.

e whiskers stretch out from the box only up to trimming points (or fences) that

mark oﬀ outliers. e trimming rules have been empirically designed. e markers

are 1.5 interquartile range (IQR) away from the box. Whiskers end at the points

farthest from the box inside these markers. e markers provide a pragmatic way to

ﬁnd outliers. Aczel and Sundara Pandian, authors of an Excel tool to plot the box

plot, refer to these markers as fences [5]. Besides these inner fences, the plot authors

have introduced additional markers 3 IQR away from the box. ese are referred to

as outer fences. If data lie beyond the inner fences, they can be suspected as possible

outliers. Data that fall outside the outer fences are deﬁnite outliers.

A typical box plot is shown in Figure 4.1. e following guidelines have been

used in the construction of the graph:

Box central line = Median

Lower hinge (edge) of box = Quartile 1

Upper hinge (edge) of box = Quartile 3

IQR = Quartile 3 − Quartile 1

Right inner fence = Quartile 3 + 1.5 IQR

Left inner fence = Quartile 1 − 1.5 IQR

Right outer fence = Quartile 3 + 3 IQR

Left outer fence = Quartile 1 − 3 IQR

Software productivity data (lines of code/person day) are analyzed by this plot.

e box is constructed from Quartile 1 (productivity = 8) to Quartile 3 (productivity =

34.5). Fifty percent of the data are inside the box. Hence, the core productivity is in

Tukey’s Box Plot ◾ 55

this range. e left whisker reaches zero, whereas the right whisker reaches 70. e

whisker ends represent a more complete range, beyond the core box. e whisker

range has good news as well as bad news. e lower whisker is a serious concern;

productivity values have dropped to zero. is could be a data error and might have

to be cleaned out. We are doing exploratory analysis with the box plot, and we just

make a note of this observation at the moment.

en we ﬁnd outliers. ose values between the fences are suspected outliers.

Productivity values between 75 and 116 are perhaps not repeatable performance.

ose values beyond the outer fence are deﬁnite outliers. ey are, in a purely statisti-

cal sense, odd, untenable results. Perhaps those results might have had harmful side

eﬀects; the damage might have been done, and only a root cause analysis can tell.

Customer Satisfaction Data Analysis Using the Box Plot

In analyzing ordinal data, box plots are invaluable. Let us take the customer satisfac-

tion (CSAT) index data from a development project. e data are shown in Data 4.1.

Data have been collected in a 0–10 scale. is scale fares better than the con-

ventional 0–5 Likert Scale. e 0–10 scale has more granularity and less subjective

error.

To understand the performance of the organization analysts, take the median

if data were ordinal, although taking the mean is a common but mistaken practice.

Whisker

Inner

fence

Outer

fence

Outlier

Potential

outliers

Median

–100 –50 0 50

Productivity LOC/PD

100 150

Figure 4.1 Box plot of software productivity.

56 ◾ Simple Statistical Methods for Software Engineering

In this case, both mean and median provide nearly similar results. However, we

prefer to use the median. e central tendency of CSAT index is shown as follows:

Mean 6.8

Median 6.9

is is compared with the organization goal, which happens to be 8.0. e

obvious shortcoming is recognized, and future decisions are made to bridge the

gap. is is the routine analysis.

Let us now try a box plot to display the CSAT data as shown in Figure 4.2.

We are able to make the following additional observations in the box plot.

1. e entire box is below the goal. is is a serious subject. e core process

carrying 50% of performance results is below the mark.

2. ere is an outlier with a value of CSAT index around 3.2. is is way down

the track. If we apply the Kano model of CSAT, this score will run into deep

dissatisfaction levels. Perhaps it is just short of customer fury.

3. Not a single event has reached the top score of 10. Customer delight seems to

be an unattainable goal. To balance the outlier, we need at least a few delighters.

To compensate one negative impression, we need to create ten positive impres-

sions. e compensatory eﬀort is missing.

Data 4.1 Customer

Satisfaction Index Values

5.0 5.8 7.4

6.5 7.6 8.7

6.6 6.0 8.3

9.1 7.2 7.7

5.9 5.5 7.1

5.8 6.8 6.5

7.7 6.3 8.1

7.5 5.9 7.0

5.0 6.3 6.4

5.9 8.5 4.6

7.1 8.2 8.0

7.6 6.0 7.3

7.9 6.2 5.9

7.2 4.7 6.9

7.8 5.3 6.7

4.7 6.9 8.1

7.0 6.7 7.1

3.1 7.6 7.0

5.1 8.2 6.8

4.5 7.2 6.6

8.1 8.6 6.9

5.9 6.9 6.6

7.8 7.7

Tukey’s Box Plot ◾ 57

Certainly, thinking with the box plot enables us to reason much more than

working with the mean value of the CSAT index. Box plots make us see the prob-

lem in its entirety; this is a very valuable support.

Tailoring the Box Plot

e box plot is being widely applied in real life. We have seen the insights brought

in by box plot to R&D scientists, project managers, business managers, quality

managers, and data analysis in the software business. Even students in the middle

grade in the United States are taught the box plot [6].

e box plot is being continuously reﬁned. Tukey himself published variations

in the box plot in 1978 [7]. Others have included frequency information in the box

plot. Bivariate box plots called bag plots have also been tried out. People have pro-

posed variants called bean plots. An analysis of the attempted improvements in the

box plot may be found in the paper by Choonpradub and McNeil [8].

Attempts that have tried to pack more information into the box plot have failed.

People prefer the simple uncluttered plain box plot.

Applications of Box Plot

Numerical quantities focus on expected values, graphical sum-

maries on unexpected values.

John W. Tukey

0 1 2 3 4 5 6 7 8 9 10

Figure 4.2 Box plot of customer satisfaction index.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 4 - Tukey’s Box Plot: Exploratory Analysis (1/4)

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 4 - Tukey’s Box Plot: Exploratory Analysis (1/4)