267
Chapter 17
Software Size Growth:
Log-Normal Distribution
Log-Normal Processes
Software grows in the development cycle. Software metrics, namely, size, effort,
defects, and reliability, all manifest growth of software.
Growth is a multiplicative process. is sharply contrasts with the additive pro-
cess of gambling.
Growth of sites on the Web, growth of organisms in biology and ecology or growth
of fatigue cracks in semiconductors, growth of Web pages in the Internet, growth
of pollutants in the atmosphere, growth of cancer in people, growth of corrosion in
metals, growth of phone traffic in a communication network, and growth of words
are examples of multiplicative processes.
Growth is inadequately represented in the bell curve. A different curve, namely,
the log-normal distribution, first used in 1836 (see Box 17.1), is seen to represent
adequately well such growth events. e log-normal distribution is built on a simple
premise that logarithms of skewed data will be normally distributed without skew;
taking logarithms removes skew. e logarithmic scale is often used to present
complex nonlinear data in a simplified linear form (see Box 17.2). Logarithmic
transformation of observations allows us to apply the familiar properties of the bell
curve to the transformed data.
Consider software design complexity, which is relatively skewed when com-
pared with the bell curve. We analyzed the NASA data [1] on module design com-
plexities of 505 modules written in C language.
268 Simple Statistical Methods for Software Engineering
NASA software defect data sets have been made publicly available and
extensively used by researchers.
e data may be viewed in the box plot provided in Figure 17.1.
In the box plot, it may be seen that design complexity data are right skewed with
several outliers too. (Box plot is a good data visualizer; it produces a rich picture of
data. Further information about the reading a box plot is available in Chapter 4.)
Having seen the picture of raw data, we can choose to take logarithms of data
and examine the result to see how data have been transformed, in particular, how
well data have been unskewed, by logarithms. We plot the logarithms of data in a
box plot in Figure 17.2.
e new box plot has noteworthy and serious differences. Comparing Figure
17.2 with Figure 17.1 reveals two consequences of the transformation:
1. e box has become symmetric.
2. ere is a drastic reduction in the number of outliers.
e new box plot is a better fit. It is as if data after transformation have found
its destination pattern.
Box 17.1 The FirsT AppeArAnce oF Log-normAL
It began with geometric mean. If a variable can be thought of as the multipli-
cative product of some positive independent random variables, then it could
be modeled as log-normal.
e basic properties of log-normal distribution were discussed long ago in
1836 by Weber [2]. McAlister described the log-normal distribution around
1879. Kapteyn and Van Uven, in 1916, gave a graphical method of estimat-
ing the parameters; the log-normal distribution was found to be accurately
representing the distribution of critical dose for several drugs; this was also
the first time that log-normal distribution was applied in real life.
–20 –10 0 10 20 30
Design complexity
40 50 60 70
Figure 17.1 Design complexity data.
Software Size Growth 269
e two box plot patterns can be expressed as mathematical curves. First, we
can fit the data to a normal distribution. e symmetrical normal distribution
is rather a mechanically executed force fit. e data have the following normal
parameters:
Mean = 3.592
Standard deviation = 5.447
To obtain the parameters for log-normal distribution, in the most commonly
used format, we must estimate the mean and standard deviations of natural loga-
rithms of data. us, we obtain 0.771 and 0.896.
Normal and log-normal curves generated by Excel functions NORM.DIST
and LOGNORM.DIST based on the two sets of above parameters are plotted in
Figure 17.3.
e normal curve represents a traditional and habitual treatment, and the log-
normal curve represents a modern and theory-driven treatment of design complex-
ity data. It may be seen that the log-normal curve is in closer agreement with the
box plot of data shown in Figure 17.1.
–6 –4 –2 0
Ln design complexity
2 4 6 8
Figure 17.2 Natural logarithms of design complexity data.
Normal
Log-normal
–15 –10 –5 0 5
Design complexity
10 15 20 25
0
0.05
0.10
0.15
0.20
0.25
0.30
0.35
Figure 17.3 Normal versus log-normal distributions of design complexity.
270 Simple Statistical Methods for Software Engineering
e log-normal curve represents an engineering truth missed by the bell curve.
e normal curve is a misfit; it has an odd negative tail that is not practical and
also it tenders a misleading peak. e log-normal curve does not go negative and
has the right skewed tail and a perfect peak.
Software design is not a Gaussian process; it is a log-normal process.
Building a Log-Normal PDF for
Software Design Complexity
As the saying goes, if we substitute natural logarithms for x in a Gaussian PDF
equation (Equation 17.1), we obtain a log-normal PDF equation, as follows:
F x e
x
( , , )
( )
µ σ
πσ
µ
σ
=
1
2
2
2
2
(17.1)
Accordingly, natural logarithms of data must be taken first, and then in
Equation 17.1, x must be replaced by Ln(x), μ must be replaced by the average of
Ln(x), and σ must be replaced by the standard deviation of Ln(x). However, the
equation will be in the logarithmic scale. Taking exponential of the results will
convert them to real-life units.
It may be noted that the Excel function LOGNORM.DIST takes inputs in the
logarithmic scale but gives results in real domain. We do not have to take exponentials
and go through a separate conversion process, and this is a great practical convenience.
Users have built their own versions of the log-normal PDF. e rst choice we
need to make is the central value.
e log-normal PDF is built around the median, like the Gaussian is built
around the mean.
Some versions of log-normal use the geometric mean. In a typical log-normal
process, the median and the geometric mean are nearly equal. Using the geomet-
ric mean is highly justified by the fact that log-normal numbers are multiplica-
tive and tend to form a geometric series. In creating a log-normal PDF, the NIST
Engineering Statistics Handbook [3] proposes the following structure:
f x
x
e
x
( )
(ln )
=
1
2
2
2
2
σ π
β
σ
(17.2)
where β is the scale factor and σ is the shape factor.
Software Size Growth 271
e parameters α and β can be extracted by (1) the method of moments (MOM),
(2) the maximum likelihood method, and (3) the minimum χ
2
method. We would
pursue the MOM in this chapter; hence, we use the following two relationships:
β is the scale factor = mean of Ln(x) (17.3)
σ is the shape factor = standard deviation of Ln(x) (17.4)
ese relationships are inherent in the Excel function LOGNORM.DIST, as
we have seen while creating Figure 17.3. Methods 2 and 3 compute parameters by
iteration, and it is a good idea to use Equations 17.3 and 17.4 to generate initial
values that may help the following iteration runs to converge faster.
Even manual techniques of parameter extraction begin with Equations 17.3 and
17.4. If we apply them to design complexity data, the scale and shape parameters
would become 0.771 and 0.896, the starting values.
e NIST suggestion becomes a valuable option: the scale parameter may be
taken as Ln(Median(x)) instead of Mean of Ln(x). e scale parameter by NIST
option will be 0.693 instead of the standard 0.771. is is based on a logic that log-
normal distributions are centered on the median, and we need not search for the
scale parameter iteratively.
Working with a Pictorial Approach
Let us now consider a graphical way of connecting with mathematical distribution.
We can construct and use a histogram, known for its pattern extraction capabili-
ties. Such a histogram of design complexity data is shown in Figure 17.4.
e histogram has extracted a distinctive pattern, with well-defined and clearly
discernible features: mode (peak), shape, and tail. ese graphical features provide
guidance in the choice of a sensitive log-normal parameter: the shape factor. Using
graphical matching, we can select the most appropriate from a set of design com-
plexity log-normal curves.
A set of log-normal curves are given in Figure 17.5, with four sets of log-normal
parameters given as follows:
1. Shape 0.7, scale 0.5
2. Shape 0.896, scale 0.771 (obtained by MOM)
3. Shape 1.1, scale 1.0
4. Shape 1.3, scale 1.4
ese curves have been obtained iteratively by perturbing the parameter values
around an initial value, a second pair of parameters, with a shape of 0.896 and a
scale of 0.771, obtained using MOM.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset