Chapter 17 - Software Size Growth: Log-Normal Distribution (1/4)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

267

Chapter 17

Software Size Growth:

Log-Normal Distribution

Log-Normal Processes

Software grows in the development cycle. Software metrics, namely, size, eﬀort,

defects, and reliability, all manifest growth of software.

Growth is a multiplicative process. is sharply contrasts with the additive pro-

cess of gambling.

Growth of sites on the Web, growth of organisms in biology and ecology or growth

of fatigue cracks in semiconductors, growth of Web pages in the Internet, growth

of pollutants in the atmosphere, growth of cancer in people, growth of corrosion in

metals, growth of phone traﬃc in a communication network, and growth of words

are examples of multiplicative processes.

Growth is inadequately represented in the bell curve. A diﬀerent curve, namely,

the log-normal distribution, ﬁrst used in 1836 (see Box 17.1), is seen to represent

adequately well such growth events. e log-normal distribution is built on a simple

premise that logarithms of skewed data will be normally distributed without skew;

taking logarithms removes skew. e logarithmic scale is often used to present

complex nonlinear data in a simpliﬁed linear form (see Box 17.2). Logarithmic

transformation of observations allows us to apply the familiar properties of the bell

curve to the transformed data.

Consider software design complexity, which is relatively skewed when com-

pared with the bell curve. We analyzed the NASA data [1] on module design com-

plexities of 505 modules written in C language.

268 ◾ Simple Statistical Methods for Software Engineering

NASA software defect data sets have been made publicly available and

extensively used by researchers.

e data may be viewed in the box plot provided in Figure 17.1.

In the box plot, it may be seen that design complexity data are right skewed with

several outliers too. (Box plot is a good data visualizer; it produces a rich picture of

data. Further information about the reading a box plot is available in Chapter 4.)

Having seen the picture of raw data, we can choose to take logarithms of data

and examine the result to see how data have been transformed, in particular, how

well data have been unskewed, by logarithms. We plot the logarithms of data in a

box plot in Figure 17.2.

e new box plot has noteworthy and serious diﬀerences. Comparing Figure

17.2 with Figure 17.1 reveals two consequences of the transformation:

1. e box has become symmetric.

2. ere is a drastic reduction in the number of outliers.

e new box plot is a better ﬁt. It is as if data after transformation have found

its destination pattern.

Box 17.1 The FirsT AppeArAnce oF Log-normAL

It began with geometric mean. If a variable can be thought of as the multipli-

cative product of some positive independent random variables, then it could

be modeled as log-normal.

e basic properties of log-normal distribution were discussed long ago in

1836 by Weber [2]. McAlister described the log-normal distribution around

1879. Kapteyn and Van Uven, in 1916, gave a graphical method of estimat-

ing the parameters; the log-normal distribution was found to be accurately

representing the distribution of critical dose for several drugs; this was also

the ﬁrst time that log-normal distribution was applied in real life.

–20 –10 0 10 20 30

Design complexity

40 50 60 70

Figure 17.1 Design complexity data.

Software Size Growth ◾ 269

e two box plot patterns can be expressed as mathematical curves. First, we

can ﬁt the data to a normal distribution. e symmetrical normal distribution

is rather a mechanically executed force ﬁt. e data have the following normal

parameters:

Mean = 3.592

Standard deviation = 5.447

To obtain the parameters for log-normal distribution, in the most commonly

used format, we must estimate the mean and standard deviations of natural loga-

rithms of data. us, we obtain 0.771 and 0.896.

Normal and log-normal curves generated by Excel functions NORM.DIST

and LOGNORM.DIST based on the two sets of above parameters are plotted in

Figure 17.3.

e normal curve represents a traditional and habitual treatment, and the log-

normal curve represents a modern and theory-driven treatment of design complex-

ity data. It may be seen that the log-normal curve is in closer agreement with the

box plot of data shown in Figure 17.1.

–6 –4 –2 0

Ln design complexity

2 4 6 8

Figure 17.2 Natural logarithms of design complexity data.

Normal

Log-normal

–15 –10 –5 0 5

Design complexity

10 15 20 25

0.05

0.10

0.15

0.20

0.25

0.30

0.35

Figure 17.3 Normal versus log-normal distributions of design complexity.

270 ◾ Simple Statistical Methods for Software Engineering

e log-normal curve represents an engineering truth missed by the bell curve.

e normal curve is a misﬁt; it has an odd negative tail that is not practical and

also it tenders a misleading peak. e log-normal curve does not go negative and

has the right skewed tail and a perfect peak.

Software design is not a Gaussian process; it is a log-normal process.

Building a Log-Normal PDF for

Software Design Complexity

As the saying goes, if we substitute natural logarithms for x in a Gaussian PDF

equation (Equation 17.1), we obtain a log-normal PDF equation, as follows:

F x e

( , , )

( )

µ σ

πσ

−

(17.1)

Accordingly, natural logarithms of data must be taken ﬁrst, and then in

Equation 17.1, x must be replaced by Ln(x), μ must be replaced by the average of

Ln(x), and σ must be replaced by the standard deviation of Ln(x). However, the

equation will be in the logarithmic scale. Taking exponential of the results will

convert them to real-life units.

It may be noted that the Excel function LOGNORM.DIST takes inputs in the

logarithmic scale but gives results in real domain. We do not have to take exponentials

and go through a separate conversion process, and this is a great practical convenience.

Users have built their own versions of the log-normal PDF. e ﬁrst choice we

need to make is the central value.

e log-normal PDF is built around the median, like the Gaussian is built

around the mean.

Some versions of log-normal use the geometric mean. In a typical log-normal

process, the median and the geometric mean are nearly equal. Using the geomet-

ric mean is highly justiﬁed by the fact that log-normal numbers are multiplica-

tive and tend to form a geometric series. In creating a log-normal PDF, the NIST

Engineering Statistics Handbook [3] proposes the following structure:

f x

( )

(ln )

−

σ π

(17.2)

where β is the scale factor and σ is the shape factor.

Software Size Growth ◾ 271

e parameters α and β can be extracted by (1) the method of moments (MOM),

(2) the maximum likelihood method, and (3) the minimum χ

method. We would

pursue the MOM in this chapter; hence, we use the following two relationships:

β is the scale factor = mean of Ln(x) (17.3)

σ is the shape factor = standard deviation of Ln(x) (17.4)

ese relationships are inherent in the Excel function LOGNORM.DIST, as

we have seen while creating Figure 17.3. Methods 2 and 3 compute parameters by

iteration, and it is a good idea to use Equations 17.3 and 17.4 to generate initial

values that may help the following iteration runs to converge faster.

Even manual techniques of parameter extraction begin with Equations 17.3 and

17.4. If we apply them to design complexity data, the scale and shape parameters

would become 0.771 and 0.896, the starting values.

e NIST suggestion becomes a valuable option: the scale parameter may be

taken as Ln(Median(x)) instead of Mean of Ln(x). e scale parameter by NIST

option will be 0.693 instead of the standard 0.771. is is based on a logic that log-

normal distributions are centered on the median, and we need not search for the

scale parameter iteratively.

Working with a Pictorial Approach

Let us now consider a graphical way of connecting with mathematical distribution.

We can construct and use a histogram, known for its pattern extraction capabili-

ties. Such a histogram of design complexity data is shown in Figure 17.4.

e histogram has extracted a distinctive pattern, with well-deﬁned and clearly

discernible features: mode (peak), shape, and tail. ese graphical features provide

guidance in the choice of a sensitive log-normal parameter: the shape factor. Using

graphical matching, we can select the most appropriate from a set of design com-

plexity log-normal curves.

A set of log-normal curves are given in Figure 17.5, with four sets of log-normal

parameters given as follows:

1. Shape 0.7, scale 0.5

2. Shape 0.896, scale 0.771 (obtained by MOM)

3. Shape 1.1, scale 1.0

4. Shape 1.3, scale 1.4

ese curves have been obtained iteratively by perturbing the parameter values

around an initial value, a second pair of parameters, with a shape of 0.896 and a

scale of 0.771, obtained using MOM.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 17 - Software Size Growth: Log-Normal Distribution (1/4)

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 17 - Software Size Growth: Log-Normal Distribution (1/4)