150 Simple Statistical Methods for Software Engineering
Process Signature
Our humanity rests upon a series of learned behaviors, woven
together into patterns that are innitely fragile and never directly
inherited.
Margaret Mead
Data of process performance can be converted into histogram signatures. ese
process signatures represent process characteristics.ey are used to manage processes.
1. Process stability: Histogram can be applied to test process stability. Stable pro-
cesses produce histograms with a single peak.
2. Mode: Histogram reveals process mode.
3. Multiple peaks: If data come from a mixture of several processes, we will see
multiple peaks in the histogram.
4. Cluster analysis: If data have natural clusters, histograms show them.
5. Outliers: Histogram can easily show outliers.
6. Natural boundary: Histogram reveals natural process boundaries that can be
used in goal setting.
e histogram in Figure 10.5 is an example of process signature. It captures the
way time to repair is managed in projects and presents a broad summary of historic
performance.
BOX 10.2 HISTORY OF HISTOGRAMS
e word histogramis of Greek origin, as it is a composite of the words
“istos(= mast”) and gram-ma(= something written”). Hence, it should
be interpreted as a form of writing consisting of “masts,” i.e., long shapes verti-
cally standing, or something similar. e termhistogram” was coined by the
famous statistician Karl Pearson to refer to a common form of graphical repre-
sentation. Histograms were used long before they received their name, but their
birth date is unclear. It is clear that histograms wererst conceived as a visual
aid to statistical approximations. Bar charts most likely predate histograms and
this helps us put a lower bound on the timing of their first appearance.
Yannis Ioannidis
Department of Informatics and Telecommunications, University of Athens
Pattern Extraction Using Histogram 151
is is a typical experience in xing high-priority bugs without any SLA con-
straint. e team treats high-priority bugs with utmost earnestness and tries to ship
the x at the earliest. e histogram is skewed and has a thick tail on the right side.
ere is a sure probability that the repair time would be high.
Beaumont [3] shows a more disciplined histogram for time to repair. e dis-
persion is far less. Barkman et al. [4] present histograms for 16 selected metrics in
open source projects. (Data have been collected from 150 distinct projects with
over 70,000 classes and over 11 million lines of code.) is is really a gallery of
histogram signatures; entries range from well-behaved symmetrical histograms to
extremely skewed ones. e signature structures typically represent the metrics.
ese patterns are more or less the same across the entire IT industry.
Uniqueness of Histogram Signature
Histograms are true signatures.
Process histograms reect people.
Product histograms reect design.
People leave their signatures in their deliveries. e uniqueness of histograms can
be used to advantage. Software Engineering Institute (SEI) has presented a series of
histograms that change with the maturity of the organization from level 2 to level 5
in their several communications. Process histogram is a signature of an organizations
maturity. Well-constructed histograms with the right metrics can be used as more
precise signatures that can be used to compare and predict performance.
0
5
10
15
20
25
30
35
40
45
1 2 3 4 5 6 7
Time to repair (h)
Figure 10.5 Process signature histogram of time to repair.
152 Simple Statistical Methods for Software Engineering
A brilliant case in point is the video signature of Liu et al. [5]:
e explosive growth of information technology and digital content
industry stimulates various video applications over the Internet. Duplicate
detection and measurement is essential to identify the excessive content
duplication. ere are approximately two or three duplicate videos among
the ten results on the rst web page. Finding visually similar content is
the central theme in the area of content-based image retrieval; histogram
distributions of similar videos are with much likeness, while the dissimilar
ones are completely different. e video histogram is used to represent the
distributions of videos’ feature vectors in the feature space. is approach
is both efficient and effective for web video duplicate detection.
Histogram Shapes
Histograms are empirical distributions (or density functions). ey can be smoothed
by nonparametric methods, as is performed in machine intelligence algorithms.
Alternatively, they can be fitted to mathematical models.
BOX 10.3 DETECTING BRAIN TUMOR WITH HISTOGRAM
Brain cancer can be counted among the most deadly and intractable diseases.
Tumors may be embedded in regions of the brain forming more tumors too
small to detect using conventional imaging techniques. Malignant tumors
are typically called brain cancer. ese tumors can spread outside of the brain.
Brain tumor detection is a serious issue in medical science. Imaging plays a
central role in the diagnosis and treatment planning of a brain tumor.
e image of the brain is acquired through MRI technique. If the histo-
grams of the images corresponding to the two halves of the brain are plotted,
a symmetry between the two histograms should be observed due to the sym-
metrical nature of the brain along its central axis. On the other hand, if any
asymmetry is observed, the presence of the tumor is detected. After detection
of the presence of the tumor, thresholding can be done for segmentation of
the image. e differences of the two histograms are plotted and the peak of
the difference is chosen as the threshold point. Using this threshold point, the
whole image is converted into a binary image providing the boundary of the
tumor. e binary image is now cropped along the contour of the tumor to
calculate the physical dimension of the tumor. e whole of the work has
been implemented using MATLAB
®
2010. (Kowar and Yadav [6])
Pattern Extraction Using Histogram 153
e shape of a histogram can help in deciding suitable mathematical equa-
tions. Skewed histograms suggest lognormal, exponential, or Pareto distributions.
e length of a histogram tail contains the nal clues. Symmetrical histograms
suggest normal distribution. Left tails suggest Gumbel minimum distribution.
Abrupt right tails suggest Gumbel maximum distribution. (e previously men-
tioned mathematical distributions are described in Section II of this book.) ese
descriptions referred to the 16 histograms presented by Barkman [7], which can be
visually mapped to well-known probability distributions. A visual selection of the
best suited equation clue is a valuable low-cost alternative to complex techniques
for model building.
Mixture
If a histogram has a second peak, it is known as bimodal. e second peak (or
cluster) may come from a mixture of data from two processes. For example, a pro-
ductivity histogram may exhibit two peaks. Each peak may correspond to one pro-
gramming language containing a mixture of data. Alternatively, the better peak
in the productivity histogram may come from a different team performing with
higher skill levels, and that is the case with the histogram shown in Figure 10.6.
e way histograms reveal mixtures is very helpful.
Process Capability Histogram
Character is expressed through our behavior patterns, or natu-
ral responses to things.
Joyce Meyer
Although the process presented results in histograms, it is customary to mark
the upper specification limit (USL) and the lower specification limit (LSL) on the
histogram. With USL and LSL marks, it is now called a process capability histogram.
It enables us to check if the process peak is on target and if the process variation is
within the limits. ese two are the criteria for a capable process. Process capability
indices can be calculated along with process risk.
154 Simple Statistical Methods for Software Engineering
0
20
40
60
80
100
120
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100
105
110
115
Frequency
Productivity, LOC per day
Figure 10.6 Bimodal histogram of productivity.
LSL USLTarget
–0.1 0.0 0.1 0.2 0.3 0.4 0.5
Potential (within) capability
Cp
CPL
CPU
Cpk
0.80
0.26
1.35
0.26
Process data
LSL
Target
USL
Sample mean
Sample N
SD (within)
SD (overall)
0.08
0.3
0.5
0.146854
42
0.0870325
0.104562
Observed performance
PPM < LSL
PPM > USL
PPM total
285,714.29
0.00
285,714.29
Exp. overall performance
PPM < LSL
PPM > USL
PPM total
261,288.77
365.88
261,654.65
Overall capability
Pp
PPL
PPU
Ppk
Cpm
0.67
0.21
1.13
0.21
0.36
Figure 10.7 Process capability of defects/test case.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset