Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 16

Lessons Learned from Software Analytics in Practice

Ayse Bener*; Ayse Tosun Misirli^†; Bora Caglayan*; Ekrem Kocaguneli^‡; Gul Calikli^§ ^* Mechanical and Industrial Engineering, Ryerson University, Toronto, ON, Canada
^† Faculty of Computer and Informatics, Istanbul Technical University, Istanbul, Turkey
^‡ Microsoft, Seattle, WA, USA
^§ Department of Computing, Open University, Milton Keynes, UK

Abstract

In this chapter, we share our experience and views on software data analytics in practice with a review of our previous work. In more than 10 years of joint research projects with industry, we have encountered similar data analytics patterns in diverse organizations and in different problem cases. We discuss these patterns following a “software analytics” framework: problem identification, data collection, descriptive statistics, and decision making. In the discussion, our arguments and concepts are built around our experiences of the research process in six different industry research projects in four different organizations.

Methods: Spearman rank correlation, Pearson correlation, Kolmogorov-Smirnov test, chi-square goodness-of-fit test, t test, Mann-Whitney U test, Kruskal-Wallis analysis of variance, k-nearest neighbor, linear regression, logistic regression, naïve Bayes, neural networks, decision trees, ensembles, nearest-neighbor sampling, feature selection, normalization.

Keywords

Software analytics framework

Industry research projects

Data extraction

Descriptive statistics

Predictive analytics

Prescriptive analytics

Chapter Outline

16.1 Introduction 453

16.2 Problem Selection 455

16.3 Data Collection 457

16.3.1 Datasets 458

16.3.1.1 Datasets for Predictive Analytics 458

16.3.1.2 Datasets for Effort Estimation Models 464

16.3.2 Data Extraction 465

16.3.2.1 Quantitative Data Extraction 465

16.3.2.2 Qualitative Data Extraction 467

16.3.2.3 Patterns in Data Extraction 467

16.4 Descriptive Analytics 468

16.4.1 Data Visualization 468

16.4.2 Reporting via Statistics 470

16.5 Predictive Analytics 473

16.5.1 A Predictive Model for all Conditions 473

16.5.2 Performance Evaluation 478

16.5.3 Prescriptive Analytics 482

16.6 Road Ahead 483

References 485

16.1 Introduction

Modern organizations operate in an environment shaped by the advances of information and communication technologies. The economic growth in highly populated countries such as China, India, and Brazil forces organizations from traditionally strong economies in Europe, North America, and Japan to globally compete for resources as well as markets for their goods and services. This poses an unprecedented need for the organizations to be efficient and agile [1]. Meanwhile, today’s ubiquitous communication and computing technologies generate vast amounts of data in many forms and many magnitudes greater than what was available even a decade ago. The ability of organizations of all sizes to gain actionable insights from such data sources, not only to optimize their use of limited resources [2], but also to help shape their future, is now the key to survival. Those who can beat their competitors in the effectiveness with which they handle their informational resources are the ones likely to go beyond survival and enjoy sustained competitiveness. Software development organizations also follow the same trend.

Most of the management decisions in software engineering are based on the perceptions of the people about the state of the software and their estimations about its future states. Some of these decisions concern resource allocation, team building, budget estimation, and release planning. As the complexity of software systems and the interactions between increasing numbers of developers have grown, the need for data-driven decision making has emerged to solve common problems in the domain, such as completing projects on time, within budget, and with minimum errors. Software projects are inherently difficult to control. Managers struggle to make many decisions under a lot of uncertainty. They would like to be confident in the product, the team, and the processes. There are also many blind spots that may cause severe problems at any point in the project development life cycle. These concerns have drawn much attention to software measurement, software quality, and software cost/effort estimation—namely, descriptive and predictive analytics.

Data science is vital for software development organizations because of a paradigm shift around many kinds of data in development teams. We essentially need to have intuition on data, and this is just not the statistics (statistical modeling, fitting, simulation) that we learned in school. Data science consists of analytics to use data to understand the past and present (descriptive analytics), to analyze past performance (predictive analytics), and to use optimization techniques (prescriptive analytics).

Software analytics is one of those unique fields that lies at the intersection of academia and practice. Software analytics research is empirically driven. Unlike traditional research methods, researchers study and learn from the data to build useful analytics. To produce insightful results for industrial practice, researchers need to use industrial data and build analytics.

Software analytics must follow a process that starts with problem identification—that is, framing the business problem and the analytics problem. Throughout this process, stakeholder agreement needs to be obtained through effective communication. The end goal for a software analytics project could be to address a genuine problem in the industry. Therefore, the outcome of the analytics project could be transferred and embedded into the decision making process of the organization. Sometimes, software analytics provides additional insights that stakeholders do not expect or imagine it can provide, and in such cases, the results could influence more than one phase in a software development process.

Data collection in software organizations is a complicated procedure. It requires identification and prioritization of data needs and available sources depending on the context. Qualitative and quantitative data collection methods are different depending on the problem. In some cases, data acquisition requires tool development first, followed by harmonizing, rescaling, cleaning, and sharing data within the organization. In discussing data collection, we will cover potential complications and data scaling issues.

The simplest way to obtain insight from data is through descriptive statistics. Even a simple analysis or visualization may reveal hidden facts in the data. Later, predictive models utilizing data mining techniques may be built to aid the decision making process in an organization. Selection of the suitable data mining techniques depending on the problem is critical when building the predictive model. Factors such as clarity of the problem, maturity of the data collection process in a company, and expected benefits from the predictive model may also affect model construction. In addition, insights gained from descriptive statistics may be used in the construction of predictive models. Finally, new predictive models could be evaluated, which requires the definition of certain performance measures, appropriate statistical tests, and effect size analysis.

In our software analytics projects, we have used a method adapted from a typical data mining process (problem definition, data gathering and preparation, model building and evaluation, knowledge deployment [3]) to define the main steps we performed. Figure 16.1 depicts the five main phases, the results of which could be used directly by practitioners, or could be linked to consecutive phases. The whole process does not necessarily end with a single cycle, but it is preferred to conduct several iterations in a software analytics project to make adjustments depending on the outcome of each phase.

f16-01-9780124115194 — Figure 16.1 Our method in software analytics projects.

In this chapter, we define each of these five phases in Figure 16.1 on the basis of our experience in state-of-the-art case studies concerning various software organizations. We provide example techniques, tools, and charts that we used to collect, analyze, and model data, as well as to optimize and deploy the proposed model in different industry settings in order to solve two main challenges: software defect prediction and effort estimation. In the following sections, we define the objectives of each phase in a software analytics project, share our experience on how to conduct each phase, and provide practical tips to address potential issues. All examples and lessons learned have come from our empirical studies rigorously conducted with industry partners. We do not provide the details of our past projects (e.g., the reasons behind our using a particular algorithm or a new metric, pseudocodes for model building, all performance measures, etc.) in this chapter because of space limitations. We suggest the reader refers to our articles listed at the end of this chapter for such information.

16.2 Problem Selection

Problem selection in research differs depending on the nature of the research [4]. In the natural and social sciences, the researcher/scientist chooses a topic that he/she is simply curious about. In the natural sciences, the researcher aims to understand the natural phenomena to build the theoretical basis for prediction, and this becomes the basis for invention or engineering. In the social sciences, the researcher aims to understand human and societal phenomena to build the theoretical basis for prediction and intervention, and this becomes the basis for changing the world to create therapy, education, policy, motivation, etc. Both have a rigorous experimental basis. In the natural sciences, theories have to be testable, and testing is done in a physical world, which provides hard constraints on theories. In the social sciences, on the other hand, theories have to be testable, and testing is done in a behavioral world, which provides probabilistic constraints.

Empirical software engineering, compared with the natural sciences and social sciences, is still an immature field. It suffers from lack of understanding of empirical problems, issues, possibilities, and research designs. Software engineering is the study and practice of building software systems. It is an experimental field that uses various forms of experiments to test the theory (i.e., requirements) and its model (i.e., implementation). These experiments include independent and dependent variables, manipulations, data collection, and data analysis.

Software engineering is a very rich field where we have systems that execute programs and processes, people who design and use programs, and people who use and execute processes. This richness enables software engineering research and practice to cut across other disciplines, such as anthropology to understand families of systems, sociology to understand systems in context (i.e., relationships among systems as centralized, distributed, networked, etc.), social psychology to understand individual systems (i.e., component interaction), and personal psychology to understand individual system characteristics. People and processes can also be mapped to these disciplines—for example, anthropology to understand projects and disciplines, sociology to understand interactions among teams or projects, social psychology to understand interactions of people and technology, and personal psychology to understand traits and dispositions of developers and managers. These intersections enable researchers to observe and abstract specific parts of the world and create a theory. From that theory they can create a usable model to represent that theory. It then becomes an iterative process to adjust both the theory and the model as they evolve [5]. When the researcher is satisfied, he/she integrates the model into the world—that is, the existing environment with processes, systems, and technologies. Integrating the model into the world changes the current world, and this leads to adjustments and extensions to the original theory. This then leads to further changes in the model and the world.

A good question on which to conduct research is one that is testable with the materials at hand. A good question in empirical software engineering research is not only the one that the researcher pragmatically investigates, but it is also a relevant one that addresses a genuine issue in the daily life of software organizations. Every day, many decisions need to be made in software organizations, such as determining what users want/need, assessing architecture and design choices, evaluating functional characteristics or nonfunctional properties, evaluating and comparing technologies (i.e., supporting tools, product support, process support), determining what went wrong, and allocating resources effectively. Some of these challenges have also been addressed by process improvement (e.g., [6]) and data management models (e.g., [7]).

In many software development organizations, there is little evidence to inform decisions. Empirical studies are the key to show fundamental mechanisms of software tools, processes, and development techniques to software professionals, and to eliminate alternative explanations in decision making. An empirical study is a study that reconciles theory and reality. In order to conduct more valid empirical studies, researchers need to establish principles that are causal (correlated, testable theory), actionable, and general. They need to answer important questions rather than focusing on a nice solution and trying to map that solution to a problem that does not exist.

In the problem selection step, we need to ask the right or important questions. It is important to incorporate the domain expert/practitioner into this step. We may ask different types of questions: existence, description/classification, composition, relationships, and descriptive-comparative, causality, or causality-comparison interactions [8]. We need to make sure that the hypotheses are consistent with research questions, since these will be the basis of the statement of the problem. Failing to ask the right questions and inconsistencies in hypothesis formulation would also lead to methodological pitfalls such as hypothesis testing errors, issues of statistical power, issues in construction of dependent/independent variables, and reliability and validity issues [4].

Research, especially in software engineering, is all about balancing originality and relevance. Therefore, the researcher, before he/she starts the research, should always ask the following question “What is the relevance of this research outcome for the practice?” In a case study, we jointly identified the problem with the software development team during the project kickoff meeting [9]. The development team initially listed their business objectives (improve code quality, decrease defect rates, and measure the time to fix defects), while the researchers aligned the goals of our project (building code measurement and version control repository, and defect prediction model). To achieve these goals, the roles and responsibilities were defined for both sides (practitioners and researchers); the expectations and potential output of the analytics project were discussed and agreed. Furthermore, it was decided that any output produced during the project (e.g., metrics from data extraction, charts from descriptive analytics) would be assessed together before moving to the next step.

In some cases, the business need may be defined too generally or ambiguously, and therefore, the researchers should conduct further analysis to identify and frame the analytics problem. For instance, in a case study concerning a large software organization, the team leaders initially defined their problem as “measuring release readiness of a software product” [10]. The problem was too general for the researchers, since it may indicate building an analytics for (a) measuring the “readiness” of a software release in terms of the budget, schedule, or prerelease defect rates, or for (b) deciding which features could be deployed in the next release, or for (c) measuring the reliability of a software release in terms of residual (postrelease) defects. Through interviews with junior to senior developers and team leaders, the business need was redefined as “assessing the final reliability of a software product prior to release” in order to decide on the amount of resources that would be allocated for the maintenance (bug fixing) activities. We took this business need and transformed it into an analytics problem as follows: building a model that would estimate the residual (postrelease) defect density of the software product at any time during a release. At any given time during a release, the model could learn from the metrics that were available, and could predict the residual defects in the software product. To do that, we further decided on the input of the model as software metrics from requirements, design, development, and testing phases. The details of the model can be found in Ref. [10].

16.3 Data Collection

In our software analytics projects, we initially design the dataset required for a particular problem. Afterward, we extract the data on the basis of our initial data design through quantitative and qualitative techniques. In this section we describe these steps with guidelines for practitioners based on our past experience.

16.3.1 Datasets

16.3.1.1 Datasets for predictive analytics

Most organizations want to predict the number of defects or identify defect-prone modules in software systems before they are deployed. Numerous statistical methods and artificial intelligence (AI) techniques have been employed in order to predict defects that a software system will reveal in operation or during testing.

One part of the approach in empirical software engineering research requires one to focus on the relevance of the outcome for the practice. Therefore, one of our main goals has so far been to find solutions to the problems of industry by improving the software development life cycle and hence software quality. We have used predictive analytics to catch defects for large companies and small- and medium-sized enterprises specialized in various domains, such as telecommunications, enterprise resource planning, banking systems, and embedded systems for white-good manufacturing. We collected data from industry and used these datasets in our research in addition to publicly available datasets from the Metrics Data Program repository for NASA datasets and the PROMISE data repository. Moreover, we donated all datasets we collected from industry to the PROMISE data repository, which consists of publicly available data for reusable empirical software engineering experiments [11].

The data employed in our studies consist of measurable attributes of software development, which are the objects of measurement (e.g., products, processes, and resources) according to the goal-question-metric approach [12]. Refining software metrics and their interactions using the goal-question-metric approach is a good practice to ensure data quality, and it has been employed by various researchers who propose metrics sets to quantify aspects related to the software development life cycle [13]. According to the information quality framework, some of the features of data collection and analysis that ensure high information quality are data resolution, data structure, data integration, temporal relevance, and communication [14]. While collecting data from industrial projects and using publicly available datasets, we focused on these features. Data resolution refers to the measurement scale and the aggregation level of data. Regarding data resolution, we decided on the data aggregation levels relative to the goal we would like to achieve. For instance, we collected metrics (e.g., static code and churn) at method-level granularity or aggregated data to file-level granularity in order to eliminate noise whenever necessary. Data structure relates to the type(s) of data (e.g., numerical, nonnumerical) and data characteristics such as corrupted and missing values because of the study design or data collection mechanism. We employed various techniques to handle the missing data problem [15]. Data integration refers to the need to integrate multiple data sources and/or data types. In terms of data integration, we used multiple data types (e.g., static code metrics, churn metrics, social interaction metrics, people-related metrics) to achieve better defect prediction performance. Temporal relevance is related to the temporal durations of the “data collection,” “data analysis,” and “study deployment” processes, and the gaps between these three processes. To address the temporal relevance of data, we kept the data collection and data analysis periods as short as possible in order to avoid uncontrollable transitions that might be disruptive. Moreover, we tried not to leave time gaps between the data collection and data analysis periods. To achieve information quality, we also improved our communication with software practitioners and other researchers. During the interpretation of the data, which we collected from industry projects, we conducted interviews with software engineers to gain more insight into the data. This helped us to perform more meaningful data analyses. We also shared our findings with software practitioners and discussed with them how these analysis results might help to improve their software development process.

In our previous industry collaborations, we mostly used static code metrics to build predictive analytics. The static code metrics consist of the McCabe, lines of code, Halstead, and Chidamber-Kemerer object-oriented (CK OO) metrics, as shown in Table 16.1.

Table 16.1

Static Code Metrics

Attribute	Description
McCabe metrics
Cyclomatic complexity (v(G))	Number of linearly independent paths
Cyclomatic density (vd(G))	The ratio of the file’s cyclomatic complexity to its length
Decision density (dd(G))	Condition/decision
Essential complexity (ev(G))	The degree to which a file contains unstructured constructs
Essential density (ev(G))	(ev(G) − 1)/(v(G) − 1)
Maintenance severity	ev(G)/v(G)
Lines of code metrics
Total lines of code	Total number of lines in source code
Blank lines of code	Total number of blank lines in source code
Lines of commented code	Total number of lines consisting of code comments
Lines of code and comment	Total number of source code lines that include both executable statements and comments
Lines of executable code	Total number of the actual code statements that are executable
Halstead metrics
n1	Unique operands count
n2	Unique operators count
N1	Total operands count
N2	Total operators count
Level (L)	(2/n1)/(n2/N2)
Difficulty (D)	1/L
Length (N)	N1 + N2
Volume (V)	$N \times log (n)$ $N \times log (n)$
Programming effort (E)	DV
Programming time (T)	E/18

t0010

McCabe metrics, which are one collection of static code metrics, provide a quantitative basis to estimate the code complexity on the basis of the decision structure of a program [16]. The idea behind McCabe metrics is that the more structural complexity a code gets, the more difficult it becomes to test and maintain the code, and hence the likelihood of defects increases. Descriptions of McCabe metrics and the relationship between them are given in Table 16.1.

Lines of code metrics are simple measures that can be extracted from the code. These include, but are not limited to, total lines of code, blank lines of code, lines of commented code, lines of code and comment, and lines of executable code metrics. A description of these metrics is given in Table 16.1.

Halstead metrics, which are also listed in Table 16.1, measure a program module’s complexity directly from source code, with emphasis on computational complexity [17]. These metrics were developed as a means of determining complexity directly from the operators and operands in the module. The rationale behind Halstead metrics is that the harder the code is to read, the more defect-prone the modules are.

McCabe, lines of code, and Halstead metrics were developed with traditional methods in mind; hence they do not lend themselves to object-oriented notions such as classes, inheritance, encapsulation, and message passing. object-oriented metrics were developed by Chidamber and Kemerer (i.e., CK OO metrics) in order to measure unique aspects of the object-oriented approach [18]. CK OO metrics are listed in Table 16.2, together with their definitions.

Table 16.2

Chidamber-Kemerer Object-Oriented (CK OO) Metrics

Attribute	Description
Weighted methods per class	Sum of the complexity of the methods in a class
Depth of inheritance tree	The depth of the inheritance tree for a class is the maximum length from the node to the root of the tree of class inheritance
Number of children	Number of immediate subclasses subordinated in the class hierarchy
Coupling between object classes	Count of the number of other classes to which a class is coupled
Response for a class	The union of the set of methods called by each method in a class
Lack of cohesion in methods	The union of the set of instance variables used by each method in a class

Some researchers criticized the use of static code metrics to learn about defect predictors because of their limited content [19, 20]. However, static code metrics are easy to collect and interpret. In our early research, we used static code metrics to build defect prediction models for a local white goods manufacturing company [21] and for software companies specialized in enterprise resource planning software and banking systems [22]. For the best case, the defect prediction rate and the false positive (FP) rate were 82% and 33%, respectively, whereas for the worst case, they were 82% and 47%, respectively. These results are better than the results obtained for manual code reviews, which are currently used in industry. Moreover manual code inspections are quite labor intensive. Depending on the review methods, 8-20 lines of code can be inspected per minute per reviewer, and a review team mostly consists of four or six members [23].

To improve prediction performance, we proposed a “weighted naïve Bayes” technique, which consists in assigning relevant weights to static code attributes according to their importance, which improves defect prediction performance [24]. We employed eight different machine learning techniques mostly derived from attribute ranking techniques in order to estimate the weights for the static code attributes. The heuristics we used to assign relevant weights to the static code attributes consist of principal component analysis, information gain, the gain ratio, Kullback-Leibner divergence, the odds ratio, log probability, exponential probability, and cross entropy. Detailed information on these machine learning techniques can be found in Ref. [24]. Our proposed method yielded at least equivalent performance and in some cases better performance than the currently best defect predictor [25]. Furthermore, our proposed heuristics have linear-time computational complexities, whereas choosing the optimal subset of attributes requires an exhaustive search in the attribute space.

In another study, we reduced the probability of false alarms by supplementing static code metrics with a call-graph-based ranking (CGBR) framework [26, 27]. Call graphs can be used in tracing the software code module by module. Each node in a call graph represents a software module, and an edge (a,b) indicates that module a calls module b. Our CGBR framework is inspired by the PageRank algorithm of Page and Brin [28]. The PageRank algorithm, which is used by most search engines on the Web, computes the most relevant results of a search by ranking webpages. We have adopted this ranking method for software modules. We hypothesize that if a module is frequently used and the developers/testers are aware of that, they will be more careful in implementing/testing that module, whereas most of the defects in less used modules may not be detected, since such modules are not used frequently, and existing defects can be detected only with thorough testing. In one of our studies, where we also used the CGBR framework in order to increase the information content of prediction models, defect prediction performance improved for large and complex systems, while for small systems, prediction models without the CGBR framework achieved the same prediction performance [29].

During the research project we conducted for Turkey’s largest GSM operator/telecommunications company, we built prediction models by employing 22 projects and 10 releases of one of the company’s major software products [9]. When the research project started, defects were not matched with files in the issue management system. Developers could not allocate extra time to write all the defects they fixed during the testing phase because of their workload and other business priorities. In addition, matching those defects with the corresponding software files could not be handled automatically. Therefore, the prediction models could not be trained using company data. Instead, similar projects from cross-company data were selected by using nearest-neighbor sampling. Projects from the Metrics Data Program repository for NASA were selected as cross-company data. Following this study, the resulting defect prediction model was deployed into the company’s software development process.

In another research project, we also used within-company data (i.e., data from same company but from different projects) in order to build defect predictors for embedded systems software for white goods [30] . During this research project, we mixed within-company data with cross-company data to train prediction models whenever within-company data data was limited. Complementing cross-company data with within-company data yielded better performance results.

There is still considerable room for improvement of prediction performance (i.e., obtaining lower false alarm rates and higher probability of detection). Researchers are actively looking for better code metrics which, potentially, will yield “better” predictors [13, 31, 32]. For this purpose, in one of our studies we used churn metrics as well as static code metrics and the CGBR framework in order to build defect predictors for different defect categories [33]. According to the results we obtained, churn metrics performed the best for predicting all types of defects. Code churn is a measure of the amount of code change taking place within a software unit over time. Churn often propagates across dependencies. If a component C₁, which has dependencies with component C₂, changes (churns) a lot between versions, we expect component C₂ to undergo a certain amount of churn in order to keep in synch with component C₁. Together, a high degree of dependence plus churn can cause errors that will propagate through a system, reducing its reliability. A list of churn metrics is given in Table 16.3.

Table 16.3

Churn Metrics

Attribute	Description
Commits	Number of commits made for a file
Committers	Number of committers who committed a file
CommitsLast	Number of commits made for a file since the last release
CommittersLast	Number of developers who committed a file since the last release
rmlLast	Number of lines removed from a file since the last release
alLast	Number of lines added to a file since the last release
rml	Number of lines removed from a file
al	Number of lines added to a file
TopDevPercent	Percentage of top developers who committed a file

The above prediction models ignore the causal effects of programmers and designers on software defects. In other words, the datasets used to learn about such defect predictors consist of product-related (static code metrics) and process-related (churn) metrics, but not people-related metrics. On the other hand, people’s thought processes have a significant impact on software quality as software is designed, implemented, and tested by people. In the literature, various people-related metrics have been used to build defect predictors, yet these are not directly related to people’s thought processes or their cognitive aspects [13, 34–37].

In our research, we focused on a specific human cognitive aspect—namely, confirmation bias [38–41]. On the basis of founded theories in cognitive psychology, we defined a “confirmation bias metrics set.” Confirmation bias metrics quantify software engineers’ confirmatory behavior, which may result in overlooking defects, leading to an increase in software defect density [42] (i.e., the lower the confirmatory behavior of software engineers, the less likely it is to overlook defects in the software). We used confirmation bias metrics to learn about defect predictors [43]. The prediction performances of models that were learned from these metrics were comparable with those of the models that were learned from static code and churn metrics, respectively. Confirmatory behavior is a single human aspect, yet the results obtained were quite promising. Our findings support the fact that we should study human aspects further to improve the performance of defect prediction models.

In addition to individual metrics, metrics to quantify social interactions among software engineers (e.g., communication of developers on issue repositories by commenting on the same set of issues) serve as datasets having enhanced information content. In the literature, there are also other attempts where using social interaction networks to learn about defect predictors yielded promising performance results [31, 32, 44, 45]. In addition to these empirical studies, we also investigated social interaction among developers [46], and adapted various metrics from the complex network literature [47]. We also used social network metrics for defect prediction with two open source datasets—namely, development data from IBM Rational Team Concert and Drupal [46]. The results revealed that compared with other sets of metrics such as churn metrics, using social network metrics on these datasets either considerably decreases the high false alarm rates without compromising the detection rates or considerably increases the low prediction rates without compromising the low false alarm rates.

We have been building defect prediction models for large software development companies and small- to medium-sized enterprises specialized in domains such as enterprise resource planning, finance, online banking, telecommunications, and embedded systems. On the basis of our experience, we recommend the following road map to decide on the content of the dataset that will be collected to build defect prediction models:

1. Start with static code metrics. Although static code metrics (e.g., McCabe, lines of code, and Halstead metrics) have limited information content, they are easy to collect and use. They can be automatically and cheaply collected even for very large systems. Moreover, they give an idea about the quality of the code, and they can be used to decide which modules are worthy of manual inspection.

2. Try to enhance defect predictors that are learned from static code metrics. Try methods such as weighting static code attributes or employing the CGBR framework in order to improve the prediction performance of the models that are built using static code attributes.

3. Do not give up on static code metrics even if you do not have defect data. In some software development companies, defects may not be stored during the development activities. There might even not exist a process to match the defects with the files in order to keep track of the reasons for any change in the software system. It will not be a feasible solution to match defects manually because of the heavy workload of developers. One possible way might be to call emergency meetings with the software engineers and senior management in order to convince them to change their existing code development process. As a change in the development process, developers might be forced to check in the source code to the version management system with a unique ID for the test defect or requirement request. Referring to these unique IDs, one can identify which changes in the source code are intended to fix defects and which are for new requirement requests by referring to commit logs. Adaptation of the process change will take time; moreover, collection of adequate defect data will also take some time. While defect data is being collected, cross-company data can be used to build defect predictors. Within-company data that has been collected so far can be used in combination with cross-company data until enough local data has been collected. The methods mentioned in step 2 can be used with cross-company data and with cross-company and within-company data as well.

4. Enhance defect prediction using churn metrics and/or social network metrics. Churn metrics can be automatically collected from the logs of version management systems. Automated collection of social interaction metrics is also possible through mining issue management systems, version management systems, and developers’ e-mails, which they send to each other to discuss software issues and new features. After static code, churn, and social interaction metrics have been extracted, defect prediction models can be built by employing all combinations of these three different types of metrics (i.e., static code metrics; churn metrics; social interaction metrics; static code and churn metrics; static code and social interaction metrics; churn and social interaction metrics; and static code, churn, and social interaction metrics). On the basis of the performance comparison results of these models, it can be decided which models will be used to identify defect-prone parts of the software.

5. Include metrics related to individual human aspects. It is much more challenging to employ individual human metrics to build defect prediction models. Formation of the metrics set and defining a method to collect metrics values requires interdisciplinary research, including fields such as cognitive and behavioral psychology besides traditional software engineering. Moreover, one needs to face the challenges of qualitative empirical studies which are mentioned in detail in Section 16.3.2. Because of the challenges, we employed confirmation bias metrics to build prediction models at later stages during the field studies we conducted at software companies.

16.3.1.2 Datasets for effort estimation models

While building models for effort estimation, we used both datasets, which we prepared by collecting data from various local software companies in Turkey, as well as publicly available datasets such as the COCOMO database. Among the datasets we collected are SDR datasets [48–51]. We used the COCOMO II Data Collection Questionnaire in order to collect SDR datasets. SDR datasets consist of 24 projects, which were implemented in the first decade of this century. An exemplary dataset is shown in Table 16.4 in order to give an idea about the content and format of the datasets we used for software effort estimation. Each row in Table 16.4 corresponds to a different project. These projects are represented by the nominal attributes from the COCOMO II model along with their size in terms of lines of code and the actual effort expended to complete the projects. We also collected datasets from two large Turkish software companies, which are specialized in the telecommunication domain and the online banking domain, respectively [52, 53].

Table 16.4

An Example Dataset for Effort Estimation

Project	Nominal Attributes (as Defined in COCOMO II)	Lines of Code	Effort
P1	1.00, 1.08, 1.30, 1.00, 1.00, 0.87, 1.00, 0.86, 1.00, 0.70, 1.21, 1.00, 0.91, 1.00, 1.08	70	278
P2	1.40, 1.08, 1.15, 1.30, 1.21, 1.00, 1.00, 0.71, 0.82 ,0.70, 1.00, 0.95, 0.91, 0.91, 1.08	227	1181
P3	1.00, 1.08, 1.15, 1.30, 1.06, 0.87, 1.07, 0.86, 1.00, 0.86, 1.10, 0.95, 0.91, 1.00, 1.08	177.9	1248
P4	1.15, 0.94, 1.15, 1.00, 1.00, 0.87, 0.87, 1.00, 1.00, 1.00, 1.00, 0.95, 0.91, 1.00, 1.08	115.8	480
P5	1.15, 0.94, 1.15, 1.00, 1.00, 0.87, 0.87, 1.00, 1.00, 1.00, 1.00, 0.95, 0.91, 1.00, 1.08	29.5	120
P6	1.15, 0.94, 1.15, 1.00, 1.00, 0.87, 0.87, 1.00, 1.00, 1.00, 1.00, 0.95, 0.91, 1.00, 1.08	19.7	60
P7	1.15, 0.94, 1.15, 1.00, 1.00, 0.87, 0.87, 1.00, 1.00, 1.00, 1.00, 0.95, 0.91, 1.00, 1.08	66.6	300
P8	1.15, 0.94, 1.15, 1.00, 1.00, 0.87, 0.87, 1.00, 1.00, 1.00, 1.00, 0.95, 0.91, 1.00, 1.08	5.5	18
P9	1.15, 0.94, 1.15, 1.00, 1.00, 0.87, 0.87, 1.00, 1.00, 1.00, 1.00, 0.95, 0.91, 1.00, 1.08	10.4	50
P10	1.15, 0.94, 1.15, 1.00, 1.00, 0.87, 0.87, 1.00, 1.00, 1.00, 1.00, 0.95, 0.91, 1.00, 1.08	14.	60
P11	1.00, 0.00, 1.15, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00	16	114
P12	1.15, 0.00, 1.15, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00	6.5	42
P13	1.00, 0.00, 1.15, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00	13	60
P14	1.00, 0.00 ,1.15, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00	8	42

t0025

The publicly available datasets which we used in our research projects were obtained from the PROMISE data repository [11] and the International Software Benchmarking Standards Group, which is a nonprofit organization [54].

Project managers can benefit from the learning-based effort estimation models we have used when allocating resources among different software projects as well as among different phases of the software development life cycle. Moreover, the outputs of such effort estimation models can guide project managers when they are deciding on whether a new software development project should be launched or not. Using our own datasets together with publicly available datasets helped us form cross-domain datasets (i.e., datasets from different application domains) in addition to within-domain datasets (i.e., datasets from a similar application domain). From our experience, we recommend practitioners build effort estimation models by using projects from different application domains rather than using projects from a similar application [55]. Analogy-based models, which are widely used for effort estimation, assume the availability of project data that are similar to the project data at hand, which can be difficult to obtain. In other words, there may be no similar projects in-house, and obtaining data from other companies may restricted because of confidentiality. Our proposed framework, which uses cross-domain data, suggests that it is not necessary to take care of particular characteristics of a project (i.e., its similarity with other projects) while constructing effort estimation models [55].

16.3.2 Data Extraction

The validity of the results obtained in software analytics projects is limited by the quality of the data extracted. For this reason, accurate data extraction is one of the most important phases in software analytics projects. In every research project, it is important to design the data extraction phase carefully.

The first step during data extraction is the choice of the right requirements for the data that will be extracted. If data requirements are changed after the data extraction phase, all the results of the study might be changed. Similarly, rescheduling extensions to surveys done with industry might be time-consuming.

Automation of the data extraction steps, when possible, reduces the extraction effort in the long term. On the other hand, tracking the history of the data extraction scripts helps in repeating older experiments when necessary. Over the years, we have employed several techniques for quantitative and qualitative data extraction for a given problem [56, 57]. In certain scenarios, we partly automated the qualitative techniques for efficiency. In this section, we describe these techniques and how we customized them for our problems.

16.3.2.1 Quantitative data extraction

Quantitative data can be either continuous data or discrete data. Quantitative data can be extracted from different software artifacts, including the source code and issue repositories. Similarly, quantitative techniques can be applied to postprocess the qualitative data extracted in surveys and questionnaires.

Source code repositories are arguably the primary data sources in software analytics projects. Source code repositories can be used to extract the state of a project at a given time snapshot or track the evolution of the software project over time. Evolution of the software in the source code repositories may be linear or may resemble a directed acyclic graph depending on the project method. For projects with multiple development branches, we have usually focused on the main development branch to keep the project history linear and for easier cross-project comparison [58].

For defect prediction, source code repositories may be mined to extract static code, churn, and collaboration metrics. Extraction of the static code metrics is dependent on the programming language. For extraction and storage of static code metrics and data storage, we have built tools to avoid rework over the years [57, 59]. In the case of effort estimation, software repositories may be mined to extract the attributes related to the size of the software.

Issue management software is used to track the issues related to software. It is the main repository of the quality assurance operations in a software project. We have used issue management software to map the defects to the underlying source code modules and to analyze the issue handling processes of organizations [60, 61].

Defect data may be used to label the defect-prone modules for defect prediction or to assess software quality in terms of defect density or defect count. The ideal method to map defects to issues is to associate source code changes with the issues. By mapping the issues to the source code directly, we can confidently label the defective source code modules. For some of our partners, this strategy has been used efficiently. In this case, text mining is necessary to map issues to changes in the source code as accurately as possible. In our studies, we have frequently used the change set (commit) messages to identify the defect-prone modules [56]. For projects in which even the commit messages are not available, manual inspection of the changes with a team member may be attempted as a last resort for defect matching.

These defect data extraction methods can be used with within-company defect data to train defect prediction models. In addition Turhan et al. [62] proposed certain methods to use cross-company data sources for cases where defect data is completely unavailable for individual software projects.

A few questions to be answered when choosing the time span for data extraction are as follows:

1. What is the specific software method employed by the organization for the given project? Practical implementation of a particular method is always different from its theoretical definition.

2. Which releases and which projects should be used in the analytics project? There may exist a difference among the data quality for different projects or releases. In addition, project characteristics such as size may change the outcome of experiments.

3. What is the estimated quality of the data? It is usually beneficial to check the quality of the data with the quality assurance teams. Identifying problematic parts in the data early may help researchers avoid rework.

Evaluation of qualitative results from surveys and questionnaires can also be done with quantitative methods. For extraction of metrics from qualitative data, digitizing the survey results to programmer-friendly formats (e.g., plain text, JSON, and XML) as early as possible would be useful.

Storage of metrics is a complicated task especially for datasets that may possibly be expanded in the future. These datasets may be used in future work with minimal effort if they are stored properly. The simplest method to store data is as a plain text file. For multidimensional data, we suggest storage of the data as a lightweight database such as an Sqlite database. Sqlite databases can be stored as a single file without the trouble of setting up a full database, and since it is a single file, the history of the database can be easily tracked. Every major language used frequently by software analysts, such as R, Python, and MATLAB, has native clients for Sqlite so it can be used. Sqlite can scale easily to store data with sizes of 1-10 GB in total [63].

Quality of the data, missing data, and data sparsity are three common problems we have encountered frequently. We had to modify our data extraction method for each of these problems. Over the years, we have used several techniques to reduce the noise in our data and to address these problems. For example, in a project we had trouble extracting cognitive bias metrics from all of the developers. We used a linear algebra approximation method named matrix factorization to complete the missing values in our case [42]. One of our partners had several problems in keeping the defect data for its projects because of the lack of maturity of its measurement process. In its case, we had to train our model with cross-company data sources. Finding the right training data for this particular project was a serious problem for us. For this goal, we evaluated sampling strategies to pick the right amount of data in our experiments [64].

16.3.2.2 Qualitative data extraction

Qualitative data is any form of feature set that cannot be expressed numerically [65]. Qualitative data can be extracted through questionnaires and surveys. We refer the reader to social science books for a through introduction to this area [65].

Some input for the software analytics models cannot be extracted directly from software repositories. We have extracted qualitative data for our models in many cases. For example, for modeling people we have used standardized tests to get cognitive biases of developers [38, 43]. In these tests, we interpreted the free text answers manually. We have also used questionnaires to assess process maturity in projects for reliability assessment models. We have used surveys to check the actual benefit of our analytics projects to software companies postmortem too [56]. In our surveys, we found that even for successful projects that within-organization adoption of the analytics model may change over time without some vocal advocates.

It is costly to change qualitative data extraction methods after the initiation of data extraction. Therefore, the design of the qualitative data extraction methods is more time-consuming than that of the quantitative data extraction methods.

16.3.2.3 Patterns in data extraction

Over the years, we have internally developed certain practices for extracting the relevant software data. Here is a short list of practices we recommend for software analysts on the basis of our experience:

1. Do not use proprietary data formats such as Microsoft Excel spreadsheets to store data since manual access to the data is time-consuming. Although these formats can be read through libraries, long-term availability of these formats is not certain. In addition, binary formats make tracking the difference between versions very hard.

2. Track the version of the extracted data and the scripts.

3. Store your parameters for data extraction for each run. It is possible to forget a set of parameters that provided a particular outcome. Keeping the parameters stored in addition to the analytics output would help to overcome this problem. For this goal, separating the parameters from the source code would be helpful.

4. Look for possible factors that may introduce noise into the data. Internal and external factors might include noise in the data. You can control internal factors easily by checking your data model and scripts. On the other hand, tracking the external factors is trickier. Certain parts of the projects might be missing.

5. Cross-check possible problems in the extracted data with a quality assurance team member as early as possible.

6. All of the data extraction and analysis tasks and code should be executable with a single script. Redoing some operations may be impractical in some cases, but if there is some problem with the data extraction method, the researcher saves a lot of time using this practice. Remembering all the custom parameters and script names on every reevaluation is time-consuming.

16.4 Descriptive Analytics

The simplest way to gain insight into data is through descriptive statistics. Even a simple correlation analysis or visualization of multiple metrics may reveal hidden facts about the development process, development team, or data collection process. For example, a graphical representation of commits extracted from a version control system would reveal the distribution of workload among developers—that is, what percentage of developers actively develop software on a daily basis [53], or it would highlight which components of a software system are frequently changed. On the other hand, a statistical test between metrics characterizing issues that were previously fixed and stored in an issue repository may identify the reasons for reopened bugs [60] or reveal the issue workload among software developers [61]. Depending on the questions that are investigated, we can collect various types of metrics from software repositories; but it is inappropriate to use any statistical technique or visualization approach without considering the data characteristics (e.g., types and scales of variables, distributions). In this section, we present a sample of statistical techniques and visualizations that were used in our previous work and explain how they were selected.

16.4.1 Data Visualization

The easiest way to gain an understanding of the data collected from rich and complex software repositories is through visualization. Data visualization is a mature domain, with significant amounts of literature on charts, tables, and tools that can be used for visually displaying quantitative information [66]. In this section, we provide examples of the use of basic charts such as box plots, scatter diagrams, and line charts in our previous analytics projects to better understand the software data.

Box plots are helpful for visualization as they allow one to shape the distribution, central value (median), minimum and maximum values, and quartiles, and give clues about whether the data is skewed or contains noise—that is, outliers. Figure 16.2 illustrates a box plot used to visualize the distribution of a number of active issues owned by developers with different categories (defects found during functional testing, during system testing, and in the field) in a case study concerning a large software organization [61]. During this study, we used box plots to inform developers that the functional testing category dominates the issues owned by developers, and it has a slightly higher median than the system testing and field categories. Furthermore, outliers in each set (e.g., 48 functional testing issues are owned by a single developer) highlight potential noisy instances, or dominance of certain developers regarding issue ownership (Figure 16.3).

f16-02-9780124115194 — Figure 16.2 A box plot of issues owned by developers and their categories for a commercial software system. FT functional testing, ST system testing (from Ref. [61]).

f16-03-9780124115194 — Figure 16.3 A line chart of activity in the issue repository of the Android system (from Ref. [61]).

A scatter diagram is another visualization technique used to explore the relationship between two variables—that is, two metrics that you want to monitor. During an exploratory study on Eclipse releases [67], we used scatter diagrams to observe the trend between the number of edits of source files (first variable) and the number of days between edits (second variable) for three different file sets—that is, files with beta-release bugs, files with other types of bugs, and files with no bugs. Figure 16.4 shows the scatter diagrams for three file sets of an Eclipse release [67]. It is seen that that there is not a clear trend such as a positive monotonic relationship between the two variables. However, it is clear that files with beta-release bugs are concentrated in smaller regions, with very few edits done frequently (in small time periods). On the basis of these visualizations, we suggested that the developers should concentrate on those files that are not edited too frequently in order to catch beta-release bugs [67].

f16-04-9780124115194 — Figure 16.4 Three scatter diagrams for the number of edits and the time (in days) between edits for Eclipse source files containing beta-release bugs, other bugs, and no bugs (from Ref. [67]).

Other types of charts are also useful for the purpose of visualizing software development, such as effort distribution of developers [53], issue ownership, or developer collaboration [61]. For example, we identified the collaboration network of developers and what factors affected the stability of the team working on a piece of large, globally developed and commercial software [68]. We used line charts to visualize the developer collaboration on the basis of the code that developers worked on together. In another study, we also used line charts to depict the number of issues owned and being fixed by developers on a monthly basis (see Figure 16.3). These charts are very easy to plot, and yet they are informative if they are monitored periodically—that is, every week/month or every sprint in agile practices, or per development team.

16.4.2 Reporting via Statistics

Descriptive statistics such as minimum and maximum data points and mean, median, or variance of software metrics are computed prior to building analytics, since these statistics are informative about the distributional characteristics of metrics, central values, and variability. During an industry collaboration, it is a common practice to compute these descriptive statistics (e.g., [56, 61, 69]) or use visualizations at the beginning of a software analytics project, whereas other techniques such as statistical tests are more useful in the next steps to reveal existing relationships between software metrics or to identify differences in terms of distributions of two or more software metrics.

Below, we summarize a set of statistical analysis techniques that were reported in our previous projects conducted with our industry partners. Detailed descriptions of these techniques and their computations can be found in fundamental books on statistics (e.g., [70, 71]); here we explain the benefits of using these methods on software metrics to identify unique patterns.

Correlation analysis. This type of analysis identifies whether there are statistical relationships between variables, and in turn, it helps to decide on a set of independent variables that would be the input of a predictive analytics. Correlations can be computed among software metrics or between metrics and a dependent variable, such as the number of production defects, the defect category, the number of person-months (to indicate development effort), and issue reopening likelihood. Correlations between two variables can be measured using two popular approaches: Pearson’s product-moment correlation coefficient and Spearman’s rank correlation coefficient.

Pearson’s correlation coefficient is a measure of the linear relationship between two variables, having a value between 1 (strong positive linear relation) and − 1 (strong negative linear relation) [70]. The calculation of this coefficient is based on covariances of two variables, and there are different guidelines for interpreting the size of the coefficient depending on the problem being studied. For example, in empirical software engineering, a correlation coefficient above 75% may be interpreted as a strong relation between the variables, whereas in empirical studies in psychology, a quite small correlation coefficient of 37% may indicate a strong relation [72] .

In a case study concerning a large software organization operating in the telecommunications industry, we studied the linear relationship between confirmation bias metrics representing developers’ thought processes and defect rates using Pearson’s correlation coefficient and found that there is a medium (21%) to strong (53%) relationship between the variables [43]. On the basis of correlation analysis, we subsequently filtered the metrics that were significantly correlated and used the resulting metric set to build a predictive analytics that estimates the number of defects of a software system.

Spearman’s rank correlation coefficient is a nonparametric measure of the statistical relationship between two variables [70]. Like Pearson’s correlation coefficient, it takes values between 1 and − 1, and is appropriate for both continuous and categorical variables. Differently, calculation of Spearman’s correlation coefficient is based on the ranks that are computed from the raw values of the variables, and is independent of the distributional characteristics of the two variables. Thus, the correlation can indicate any monotonic relationship, in contrast to the linear relationship in Pearson correlation.

In a study that investigated the experience of software developers and testers, company size, and reasoning skills of software developers and testers in four to medium-sized to large software organizations [41], we computed Spearman’s rank correlation coefficient to observe the relationship between the number of bugs reported by testers and production defects, and the relationship between developers’ reasoning skills and defect-proneness of the code. Analyses showed that there is a significant (rank coefficient 0.82) positive relation between prerelease and production defects. We concluded that testers may report more bugs than the number that developers fix before each release, and hence, as more bugs are reported, the number of production defects increases.

In another case study concerning a large software development organization, we used Spearman correlation to analyze the relationship between metrics that characterize reopened issues (issues closed and opened again during an issue life cycle) [60]. We found a strong statistical relationship between the lines of code changed to fix a reopened issue and the dependencies of a reopened issue on other issues—that is, the higher the proximity of an issue to other issues, the more lines of code are affected during its fix.

In summary, although these results may sometimes look trivial, they are very influential for software teams, and development leads to observation of the hidden facts about the development process in an organization, and/or confirmation of assumptions that developers usually make in their daily decision making process. On the other hand, these types of correlation analyses show only whether there is a statistical relationship between two variables and the scale of that relationship. To further analyze what type of relationship (e.g., linear or nonlinear) exists between these two variables, scatter diagrams can be used or other statistical tests (e.g., hypotheses testing) may be applied. Below, we provide examples of some of these tests that were used in our previous work.

Goodness-of-fit tests. These tests can be used to compare a sample (distribution) against another distribution or another sample that is (assumed to be) generated from the same distribution. Similarly to correlation computation, there are several tests measuring the goodness of fit, and the appropriate one should be chosen depending on the number of instances in a sample, the number of datasets that will be compared, and the distributional characteristics of the sample. For example, in a comparative study between test-driven development and test-last programming, we used the Kolmogorov-Smirnov test to compare the code complexity and design metrics of a software system that is developed using test-driven development against a normal distribution [73]. The Kolmogorov-Smirnov test is a nonparametric test for checking the equality of two continuous probability distributions or for comparing two samples [71]. Our results [73] show that none of the metrics are normally distributed, and in turn, we applied analytics that are appropriate for samples generated from any type of distribution. In other studies, we applied the same goodness-of-fit test to compare development activities between experienced and novice team members of a software organization [40] or between developers and project managers [39].

In some of our studies, software metrics took categorical values, and hence other goodness-of-fit tests such as the chi-square test were more appropriate to compare the categorical metric values from two or more groups of samples. For example, during a case study concerning a software organization [42], the chi-square test was used to compare the distributions of confirmation bias metrics, which characterize developers’ thought processes and reasoning skills, among developers, testers, and project managers. On the basis of the test results, we stated that reasoning skills are significantly different among the three development roles. Hence, we suggested using these metrics in deciding the assignment of roles and in forming a more balanced structure in the software organization.

Differences between populations. Though goodness-of-fit tests can be used to compare the differences between two or more populations—that is, samples—a better approach is to form a null hypothesis (H₀) and use statistical hypothesis tests to check if the null hypothesis is supported. For example, if we aim to check whether developers write more tests using test-driven development compared with test-last programming, we need to collect two samples that include the number of tests collected from a development activity using test-driven development and test-last programming. It is important that the samples are independently drawn from different populations. Later, we can form a null hypothesis as follows: The difference between the number of tests written by developers using test-driven development and the test-last approach has a mean value of 0.

A test (e.g., t test and the Mann-Whitney U test, also called the Wilcoxon rank-sum test) can reject the null hypothesis, meaning that the mean values are significantly different between two development practices, or it cannot reject the null hypothesis, meaning that the difference of means is not significant enough to make a statistical claim [71]. Hypothesis tests cannot actually prove or disprove anything, and hence they also calculate the power of the significance—for example, a p value of 0.05 tells us that there is a 95% chance of the means being significantly different. As this p value increases, it is more likely that the difference happens by chance.

We have often used hypothesis tests to compare two or more samples representing different software systems, development methods (e.g., Mann-Whitney U test [73]) and/or software teams (e.g., Mann-Whitney U test [41], and Kruskal-Wallis analysis of variance with more than two teams [43]). Alternatively, these tests can be used to compare the performance of two predictive analytics in order to decide which one to choose—for example, t test [26, 49].

In summary, hypothesis testing techniques could be carefully selected by considering the information about sample distributions (e.g., t test for normally distributed samples, and nonparametric tests such as the Mann-Whitney U test for the others) during a descriptive analytics process. We also suggest that descriptive statistics, reporting via visualization or through statistical tests, could guide researchers or data analysts in software organizations in cases where there is rich, multivariate, and often noisy data.

16.5 Predictive Analytics

16.5.1 A Predictive Model for all Conditions

In this section, we present our experiences from multiple industry collaborative projects in terms of predictive analytics. Throughout this section we will review multiple algorithms, from simpler alternatives to more complex ones. However, our main intention is not to repeat the content of commonly used machine learning algorithms, since there are more than enough machine learning books [74–76]. We rather intend to share the lessons learned in the course of applying predictive analytics for more than a decade to industry-academia collaborative projects.

Often, we will see that knowing the learner (aka prediction algorithm) is only the part of the story, and the practitioner needs to be flexible and creative in terms of how he/she uses and alters the algorithm depending on the problem at hand. So, before going any further, just to set the right expectations, we briefly quote a conversation with an industry practitioner. Following the presentation of a predictive analytics model, a practitioner from a large software company asked the following question: “Would your predictive model work under all conditions?” (“Conditions” meaning different datasets and problem types.) The short answer to that questions is: “No.” There is no predictive model that would yield high-performance measures under all the different conditions. Therefore, if you are looking for a silver-bullet predictive model, this section will disappoint you. On the other hand, we can talk about a certain approach to predictive models as well as some likely steps to be followed, which have been applied in real life to a number of industry projects by the authors of this chapter.

Terminology. A predictive model is a specific use of a learner that is often aided by preprocessing and postprocessing steps to improve predictive performance [30]. We will talk about improving the performance in Section 16.5.3. For the now, let us focus our attention on the learners, which are machine learning algorithms that learn the known instances and provide a prediction for an unknown instance. The known instances refer to instances for which we know the dependent variable information—for example, the defect information of software components. These instances are also referred as the “training set.” The unknown instances are the ones for which we lack the dependent variable information—for example, the software components that have just been released, and hence that do not have defect information. These instances are referred to as the “test set.” In the example of defective and nondefective software components, a predictive model is supposed to use the training set and learn the relationship between the metrics defining software components and the defects. Using the relationship learned, the predictive model is expected to provide accurate estimates for the test set—that is, the newly released components.

Go from simple to complex. One misleading approach to predictive modeling is to bluntly use whichever learner we come across. It may be possible to get a decent estimation accuracy with a randomly selected learner, but it is unlikely for such a random learner to be used by practitioners. For example, in a software effort estimation project for an international bank, every month we presented the results of the experiments to the management as well as the software developers [9, 52]. The focus of these presentations was never merely the performance, but how and why the algorithm presented could achieve the presented performance. In other words, merely using a complex algorithm, without a high-level explanation of how and why it applies to the prediction problem at hand, is unlikely to lead to adoption by product groups. Therefore, it is a good idea to start the investigation with an initial set of algorithms that are easy to apply and explain, such as linear regression [38], logistic regression [60, 77], and k-nearest neighbors (k-NN) [62, 78]. The application of such algorithms will also serve the purpose of providing a performance baseline, so that we can see how much added value the more complex additions will bring. These algorithms are relatively easier to understand, and in most of the industry projects in which the authors participated, they proved to have quite good performance [38, 60, 62, 77, 78]. On the other hand, simplicity just for the sake of choosing algorithms that nontechnical audiences can easily understand is also misleading. The simplicity of the algorithm for that purpose should have a minimal effect on the decision making process. If the best performing algorithm turns out to be some complex ensemble of relatively more difficult to understand learners, then so be it. But to make the case for deciding on a complex alternative, we should start simple and make sure that value is added with the complexity.

Linear regression is a predictive model that assumes a linear relationship between the dependent and independent variables [74]:

$\begin{array}{l} y = X β + ε . \end{array}$ $\begin{array}{l} y = X β + ε . \end{array}$

(16.1)

The independent variables (defined by X) are multiplied by coefficients (β), and we want to set up the coefficients such that the error (ε) is minimized.

We have used linear regression in various industry settings for regression problems (the problems in which the dependent variable is a continuous value). The focus of a project in which we collaborated with a large telecommunications company was the effects of confirmation bias on the defect density (defect density measured in terms of the defect count, which is a continuous value, and hence a regression problem). We opted to use linear regression in this project as the confirmation bias metrics were observed to be linearly related to defect counts [38]. The observation of a linear relationship can be checked with the R² value (aka the coefficient of determination) as well as by plotting the metric value against the defect count. The R² value measures how much of the response variable variation (around its mean) is explained by a linear model. It can take values from 0 to 1, where 0 means none of the response variable variation is explained by the linear model, whereas 1 means all of the response variable variation is explained by the linear model. Therefore, values close to 1 mean a better model fit. In the confirmation bias project we defined the defect density for each developer group as the ratio of the total number of defective files created/updated by that group to the total number of files that group created/updated. So as to visualize the effect of confirmation bias on software defect density, we constructed a predictive model based on linear regression, with confirmation bias metrics as the predictor (independent) variables and defect density as the response variable. Our results showed that 42.4% of the variability in defect density could be explained by our linear regression model (R² = 0.4243)

Note that in the telecommunications company project, we used a linear regression model so as to predict a continuous value (defect count) [38]. However, not all the predictive problems that we face in real-life predictive analytics involve continuous variables. In the case of discrete dependent variables, we can make use of logistic regression [60, 77]. For example, assume that we know only whether a software module is defective or not (but not the exact defect count), then we can define the discrete classes of “defective” and “nondefective.” Such a grouping would give us a two-class (aka binary) problem (instead of a continuous-variable prediction problem). In the case of two-class classification problems (y_i = [0,1]), logistic regression is a frequently used prediction algorithm. The authors have employed logistic regression in various defect prediction projects [60]. The general logistic regression formula is as follows:

$\begin{array}{l} \Pr (y_{i} = 1) = {logit}^{- 1} (X_{i} β), \end{array}$ $\begin{array}{l} \Pr (y_{i} = 1) = {logit}^{- 1} (X_{i} β), \end{array}$

(16.2)

where Pr(y_i = 1) denotes the probability of y_i belonging to class 1, X is the vector of independent variable values, and the β is the coefficient vector. One benefit of this predictive method is that—given the logistic regression provides high accuracy—one can use the corresponding coefficient values in order to see the importance of different input variables. An example application of this approach can be found in one of our projects [60], where we used logistic regression to analyze the possible factors behind issues being reopened in software development. For this purpose, we fit a logistic regression model to the collected issue data, but our main aim was to understand which issue factors are most important (through the use of coefficient values). Therefore, for the analysis of factors leading to reopened issues, logistic regression was an appropriate choice.

Another simple, yet quite successful learner for classification problems is naïve Bayes [33, 62, 78]. Particularly for software defect prediction studies, the authors have observed that naïve Bayes proved to be better than some more complicated rule-based learners (such as decision trees) [62]. As the name of the learner suggests, the naïve Bayes classifier is based on the Bayes theorem, which states that our next observation depends on how new evidence will affect old beliefs:

$\begin{array}{l} P (H | E) = \frac{P (H)}{P (E)} \prod_{i}^{} P (E_{i} | H) . \end{array}$ $\begin{array}{l} P (H | E) = \frac{P (H)}{P (E)} \prod_{i}^{} P (E_{i} | H) . \end{array}$

si4_e (16.3)

In Equation 16.3, given that we have old evidence E_i and a prior probability for a class H, it is possible to calculate its next (posterior) probability. For example, for a classifier trying to detect defective modules in a piece of software, class H would represent the class of defective modules. Then the posterior probability of an instance being defective (P(H|E)) is the product of the fraction of defective instances P(H)/P(E) and the probability of each observation P(E_i|H).

The success of this learner for defect prediction tasks also paved the way for us to use it in conjunction with other learners. For example, we have employed the k-NN learner as a cross-company filter for naïve Bayes [62]. Before going into how we made use of the k-NN learner as a filter, we briefly explain how it works: k-NN finds the labeled k instances most similar to the test instance. The distance is calculated via a distance function (e.g., Euclidean distance or Hamming distance).

k-NN can also be used for classification [78] and regression [79] problems. For a classification problem, usually the majority vote (i.e., the majority class of k-NN ) is given as the predicted value. For a regression problem, the mean or median of the dependent variable value k-NN is given. Our use of k-NN as a filter in the defect prediction domain—that is, a classification problem [62]—questions whether organizations without data of their own (so-called within-company data) can use the data from other organizations (so-called cross-company data). In our initial experiments we use naïve Bayes as the predictive method and compare the performance when an organization uses within-company data versus cross-company data. The performance results show that the use of cross-company data yields a poor performance. This is understandable as the context of another organization may differ considerably. Then we use k-NN to filter instances from the cross-company data—for example, instead of using all the instances of cross-company data in a naïve Bayes classifier, we first find the closest cross-instances to the test instance (using k-NN). Filtering only the closest instances improves the performance of cross-company data such that it is very close to that of within-company data. The ability to use cross-company data has been quite important in a number of our projects, because our observation is that initially the within-company data will be quite limited at best. In other words, “a practitioner will have to start the analysis with the data at hand.” In such cases, the ability to use cross-company data has helped us provide initial predictions for the within-company test data [52].

Other relatively more complex learners are also frequently employed in software engineering research as predictive methods. Neural networks [49] and decision trees [50, 80] are examples of such learners that were used by the authors. We will not go deeply into the mechanics of these learners; however, we will provide the general idea and possible inherent biases because having an idea of how these algorithms work and their biases aids a practitioner in choosing the right learner for datasets of different characteristics.

Neural networks are known to be universal approximators [81]—that is, they can learn almost any function. Neural networks are defined to be a group of connected nodes, where the training instances are fed into the nodes that form the input layer (see Figure 16.5) and the information is transmitted to hidden layer nodes. During transmission, each connecting edge applies a weight to the number it receives from the previous node. Although there is one hidden layer in Figure 16.5, there may be multiple hidden layers to model more complex functions. Finally, the values are fed into the output layer, where a final estimate for each dependent variable is obtained. Note that instances can be fed into a neural network one by one—that is, the model can update itself one instance at a time as the training instances become available. However, the problem with neural networks is that they are sensitive to small changes in the dataset [82]. Also, overtraining a neural network will have a negative impact for the future test instances that we want to predict. We addressed such issues related to neural networks by arranging the neural networks in the form of an ensemble [49]. Application of our neural network ensemble on software effort estimation data revealed that unlike the problems inherent in a single neural network, having an ensemble of neural networks provides stabler and improved accuracy values [49]. We will discuss ensembles of other learners further in the next section as a way to improve the predictive power of learners.

f16-05-9780124115194 — Figure 16.5 A sample neural network with a single hidden layer.

A capable and commonly used learner type is decision trees [50, 80]. Unlike neural networks that take instances one by one, decision trees require all the data to be available before they can start learning. Decision trees work by recursively splitting data into smaller and smaller subsets on the basis of the values of independent features [83]. At each split the instances that are more coherent with respect to the selected features are grouped together into a node. Figure 16.6 presents a hypothetical decision tree that uses two features, F1 and F2. The way this decision tree would work for a test instance is as follows: If F1 of this feature is smaller than or equal to x, then the decision tree will return the prediction of A. Otherwise, we will go down the node of F1 > x and look at F2. If F2 > y is true for our test instance, then we would predict B, otherwise we would predict C.

f16-06-9780124115194 — Figure 16.6 A sample decision tree that uses two features (F1 and F2).

Particularly for datasets where instances form interrelated clusters, it is a good idea to try decision tree learners as their assumptions overlap with the structure of the data. Software effort estimation data is a good example of this case, where instances (projects) form local groups that are composed of similar projects. Therefore, we made use of decision trees in different industry projects focusing on estimating the effort of software projects [50, 52, 53]. One of our uses of decision tree learners on software effort data [50] includes the idea of grouping similar instances so as to convert the software effort estimation problem into a classification problem (recall that originally software effort estimation is a regression problem). By using decision trees, we form software effort intervals, where an interval is identified by the training projects falling in the final node of a decision tree. Then the test instance is fed into the decision tree and it finds its final node, where the estimate is provided as an interval instead of a single value.

Things to keep in mind:

1. The learner is only part of the picture. Until now we have reviewed some of the learners that we successfully applied on different real-life software engineering data. We have also seen how these learners can be altered for different tasks—for example, our use of k-NN as an instance-filtering method [62] or the use of decision trees to convert a regression problem into a classification problem [50]. However, the application of the learner on the data is only one part of the picture. There are possible pitfalls we identified over the course of different projects that a practitioner should be aware of when building a predictive model.

2. Building a predictive model is iterative.As you are applying different learners on the data, you should consider including the domain experts (or customers) of the data in the loop. In other words, we recommend discussing the initial results and your concerns with the domain experts. The initial benefit of having the customer in the loop is that you can get early feedback about the quality of the data. For certain instances that your learner yields a poor performance, you can get a clear explanation as to what the cause is. Another benefit will be to identify potential data issues early on. If there is a feature that looks suspicious, the domain experts may give you a clear answer as to whether the data is wrong. For example, in one of the industry projects, for some projects the bug count was surprisingly low, whereas the test cases for these projects were quite high. A domain expert who worked on this project may tell you whether this is indeed the case, or whether in this particular project bugs were not tracked consistently. Then you may decide not to include this unhealthy information in your model, as we did in that particular project.

3. Automation is the key. During any part of the project, you may need to rerun all your experiments. In an industry project, we learned halfway into the project that one of the assumptions we were told for a group of projects would be invalid for another group of projects developed within the preceding year. So we had to repeat all the analyses we had done up to this point for the projects developed within the preceding year. If we had not coded our experiments, it would have taken much time to repeat all the experiments. Therefore, it is beneficial to use machine learning toolkits such as WEKA [76], but it is also critically important to have your experiments coded for reruns of the experiments. Another reason why it is important to code your predictive models is customization. You may want to combine or alter different learners in a specific way, which may be unavailable through the user interface of a toolkit. Coding your experiments may provide you with the flexibility of customization.

16.5.2 Performance Evaluation

Depending on the problem being studied, performance evaluation of predictive analytics can be computed differently. Naturally, the output of a predictive analytics—that is, a prediction model—is compared against the data points in the test set, or against the actual data points after they have been retrieved and stored in data repositories of software organizations. The type of output variables can be categorical or continuous, which determines the set of performance evaluation measures. For instance, in a defect prediction problem, the output of a predictive analytics can take a categorical value (e.g., defect-proneness of a software code module, i.e., package, file, class, method, being defect-free (0), or being defect-prone/defective (1)) or it can take a continuous value ranging from zero to infinity (e.g., number of defect-prone code modules in a software product). For a model with a categorical output, a typical confusion matrix can be used for computing a set of performance measures. Table 16.5 presents a confusion matrix in which the rows represent the actual information about whether the software module is fixed or not because of a defect, whereas the columns represent the output of a predictive analytics.

Table 16.5

A Typical Confusion Matrix Used for Performance Assessment of Prediction Models

		Predicted
		True	False
Actual	True	TP	FN
Actual	False	FP	TN

t0030

All performance measures can be calculated on the basis of the four base measures presented in Table 16.5—namely, true positives (TPs), false negatives (FNs), false positives (FPs), and true negatives (TNs). For example, in the context of defect prediction, TPs are the actual number of defective modules that are correctly classified as defective by the model, FNs are the number of defective modules that are incorrectly marked as defect free, FPs are the number of defect-free modules that are incorrectly classified as defective, and TNs are the number of defect-free modules that are correctly marked as defect free by the model. A predictive analytics ideally aims to classify all defective and defect-free modules accurately—that is no FNs and no FPs. However, achieving the ideal case is very challenging in practice.

In our empirical studies, we have used six popular performance measures to assess the performance of a classifier: accuracy, recall (also called TP rate or hit rate or specifically in defect prediction, probability of detection rate), FP rate (also called false alarms in defect prediction), precision, F measure, and balance. The calculations of these six measures are shown below [30, 51, 77]:

$\begin{array}{l} ACC & = (TP + TN) / (TP + TN + FN + FP), \end{array}$ $\begin{array}{l} ACC & = (TP + TN) / (TP + TN + FN + FP), \end{array}$

(16.4)

$\begin{array}{l} REC & = TP / (TP + FN), \end{array}$ $\begin{array}{l} REC & = TP / (TP + FN), \end{array}$

(16.5)

$\begin{array}{l} PREC & = TP / (TP + FP), \end{array}$ $\begin{array}{l} PREC & = TP / (TP + FP), \end{array}$

(16.6)

$\begin{array}{l} FPR & = FP / (FP + TP), \end{array}$ $\begin{array}{l} FPR & = FP / (FP + TP), \end{array}$

(16.7)

$\begin{array}{l} F measure & = 2 (PREC \times REC) / (PREC + REC), \end{array}$ $\begin{array}{l} F measure & = 2 (PREC \times REC) / (PREC + REC), \end{array}$

(16.8)

$\begin{array}{l} BAL & = 1 - \sqrt{{(1 - REC)}^{2} + {(0 - FPR)}^{2}}, \end{array}$ $\begin{array}{l} BAL & = 1 - \sqrt{{(1 - REC)}^{2} + {(0 - FPR)}^{2}}, \end{array}$

(16.9)

where ACC is accuracy, REC is recall, PREC is precision, FPR is the FP rate, and BAL is balance. ACC, REC, PREC, F measure, and BAL should be as close to 1 as possible, while FPR should be close to 0. However, there is an inverse relationship between REC and PREC—that is, it is only possible to increase one (REC) at the cost of the other (low PREC). If these two are prioritized over all performance measures, it is useful to compute F measure, which depicts the harmonic mean between REC and PREC.

Furthermore, there is a positive relationship between REC and FPR—for example, an algorithm rends to increase REC at the cost of a high false alarm rate. BAL is a measure that incorporates the relationship between REC and FPR; it calculates the distance between an ideal case (REC = 1, FPR = 0) and the performance of a predictive analytics.

Achieving high REC and PREC while lowering FPR, and in turn, achieving high BAL and F measure, has been a real challenge for us when building predictive analytics in software organizations. A good balance between these measures needed to be targeted according to the business strategies in these organizations. For example, in software organizations operating in mission-critical domains (e.g., embedded systems), software teams aim to catch and fix as many defects as possible before deploying the production code. Thus, they prioritize achieving a high REC over a low FPR from a predictive analytics. Conversely, in software organizations operating in competitive domains (e.g., telecommunications), reducing costs is the primary concern. Accordingly, they would like to reduce the additional costs caused by high FPR and achieve a balance between REC, PREC, and FPR.

We have preferred to use REC, FPR, and BAL in most of our work since reducing the cost of predictive analytics by lowering FPR has always been the focus of our industry partners. For example, in a case study concerning a software organization operating in the telecommunications domain, we deployed a software analytics for predicting code defects [9]. On the basis of user feedback, we calibrated the analytics to reduce FPR while getting as high REC as possible. The model that was later deployed in the company produced 87% recall and 26% false alarms [84]. In a study concerning a company operating in embedded systems, we built a predictive analytics for catching the majority of code defects at the cost of high false alarm rates [85]. We reported that an ensemble of classifiers was able to catch 82% recall and 35% false alarms. In other case studies, we have studied different techniques for lowering false alarm rates in defect prediction research (e.g., algorithm threshold optimization [29], increased information content of training data [24, 26, 30], and missing data imputation [15]).

For localizing the source of software faults, we built a predictive analytics that extracts unique patterns stored in previous fault messages of a software application and assigns a newly arrived fault message to where it belongs [51]. In the context of fault localization, both REC and PREC are equally important since misclassification of faults has equal costs for both classes. In other words, localizing a fault to component A even though it belongs to component B can be as costly as localizing a fault to component B though it belongs to component A. Therefore, the proposed model in Ref. [51] managed to achieve 99% REC and 98% PREC in the first application, whereas it achieved 76% REC and 71% PREC rates in the second application.

Another predictive analytics was built for a software organization to estimate software classes that need to be refactored on the basis of code complexity and dependency metrics [86]. The proposed model was able to predict 82% of classes that need refactoring (TP rate) using 13% inspection effort by developers.

If predictive analytics produces a continuous output (e.g., total number of defects, project effort in terms of person-months), evaluating its performance requires either a transformation of the output to categorical values or use of other performance measures that are more appropriate for a continuous model output. For example, we built a software effort estimation model that would dynamically predict the interval of a project’s effort—that is, a categorical value that indicates which effort interval is the most suitable for a new project—rather than the number of person-months [80]. Since the output was categorical, performance measures that were used in defect prediction studies—that is, REC, PREC, FPR, and ACC—were used to assess the performance of the proposed effort estimation model.

A more convenient approach is to use other performance measures that are more suitable for evaluating the performance of a model that produces continuous output. Some of these measures that we have extensively used are the magnitude of relative error (MRE), the mean MRE (MMRE), and predictions at level k (PRED(k)). Calculation of each of these measures is shown below:

$\begin{array}{l} {MRE}_{i} = \frac{∣ x_{i} - \hat{x_{i}} ∣}{x_{i}}, \end{array}$ $\begin{array}{l} {MRE}_{i} = \frac{∣ x_{i} - \hat{x_{i}} ∣}{x_{i}}, \end{array}$

si11_e (16.10)

$\begin{array}{l} MMRE = mean ({a l l M R E}_{i}), \end{array}$ $\begin{array}{l} MMRE = mean ({a l l M R E}_{i}), \end{array}$

(16.11)

$\begin{array}{l} PRED (k) = \frac{100}{N} \sum_{i = 1}^{N} \{\begin{array}{l} 1 if {MRE}_{i} \leq \frac{k}{100}, \\ 0 otherwise. \end{array} \end{array}$ $\begin{array}{l} PRED (k) = \frac{100}{N} \sum_{i = 1}^{N} \{\begin{array}{l} 1 if {MRE}_{i} \leq \frac{k}{100}, \\ 0 otherwise. \end{array} \end{array}$

si13_e (16.12)

Ideally, MRE and MMRE should be as low as 0, whereas PRED(k) should be close to 1. Instead of MMRE, the median magnitude of relative error (MdMRE) can abe used, since the mean of a sample is always sensitive to outliers in that sample compared with the median. In the context of software effort estimation, reducing MRE indicates that effort values are estimated with low error rates. As the error rate decreases, PRED increases. The value of k in the PRED calculation is usually selected as 25 or 30, meaning that the percentage of estimations whose error rates are lower than 25% or 30%. PRED might be a better assessment criterion for software practitioners, since it shows the variation of the prediction error—that is, what percentage of predictions achieve a level of error that was set by practitioners. Hence, it implicitly presents both the error rate (k) and the variation of this error among all predictions made by the analytics.

We have used MMRE and PRED(25) measures during our empirical studies in the context of software effort estimation, and we observed that there is not an optimal value for these measures that fits best for all software projects and teams. For example, in one study, one of our colleagues from our research laboratory built different predictive analytics to estimate the project effort in person-months [87]. Two datasets—that is, one public dataset and another dataset collected from different software organizations—were used to train the predictive analytics. The results show that the best model applied to the public dataset produced 29% MMRE, 19% MdMRE, and 73% PRED(30), whereas it produced 49% MMRE, 28% MdMRE, and 51% PRED(30) when applied to the commercial dataset.

In other studies, we have proposed relatively more complex analytics to estimate software projects’ effort in the case of a limited amount of training data. For example, the intelligence in our predictive analytics proposed in Ref. [49] works as follows: it has an associative memory that estimates and corrects the error of a prediction on the basis of the algorithm’s performances over past projects. The results obtained using this associative memory show that we could achieve 40% MMRE, 29% MdMRE, and 55% PRED(25), compared with a simple classifier that predicts 434% MMRE, 205% MdMRE, and 10% PRED(35) [49].

On the basis of the increase/decrease in performance measures, the best performing algorithm can be selected when building predictive analytics. A raw comparison between the values of a performance measure may sometimes be problematic—that is, the change in a measure might happen because of the sample used for training the algorithm and the improvements may not be statistically significant. So, even though an algorithm produces higher values in terms of a performance measure—say, 30% over 27%—statistical tests should be used to confirm that the change does not happen by chance.

Statistical tests, as explained in Section 16.4.2, can be used to identify patterns of software data as well as to compare the performance of predictive analytics that are built with different algorithms (e.g., the Mann-Whitney U test [73], and Nemenyi’s multiple comparison test and the Friedman test [52]). Box plots are also useful visualization techniques to depict the differences between the performance measures of two or more predictive analytics [30, 62]. Other performance measures we used for predictive analytics that produce continuous output can be found in Ref. [38].

Finally, we calculated context-specific performance evaluations, such as cost-benefit analysis (in terms of the amount of decrease (in lines of code) in inspection effort for findings software defects [84, 85]) in empirical studies for software defect prediction. These types of analysis are also easy to understand and interpret for software practitioners as they represent the tangible benefits—that is, effort reduction—of using predictive analytics.

16.5.3 Prescriptive Analytics

Once you decide on an algorithm, it is possible to improve the performance via some simple methods such as the following:

• forming ensembles [49, 85, 88],

• applying normalization [50, 80],

• feature selection through methods such as information gain [62],

• instance selection via k-NN-based sampling.

Ensembles are one powerful way to combine multiple weak learners into a stronger learner [88]. The idea behind ensembles is that different learners have different biases (recall how neural networks and decision trees are different from one another); hence, they learn different parts of the data. When their predictions are combined, they may complement one another and provide a better prediction. For example, in one of our studies, we employed a high number of prediction methods and combined them in simple ways such as taking the mean and median of the predictions coming from single learners [88]. However, before combining single learners, we paid attention to choosing the successful ones, where we ranked the learners and combined only the ones that had a high performance. Such a selection of only the successful learners provided us with ensembles that are consistently more successful than all the single learners. In other words, in the formation of an ensemble, it is better not to include the single learners whose performance is already low when they are run alone.

Normalizing the values of features is “a must” to improve learner performance if some of the features have very high values compared with others. For example, a feature keeping the number of classes defined in a project will have much smaller values compared with another feature that keeps the number of lines of code in this project. If you are using a k-NN learner, the impact of lines of code will dominate over the impact of the number of classes in a Euclidean distance calculation. A very common method we use for a number of different datasets is min-max normalization [50, 80]:

$\begin{array}{l} \frac{x_{i} - min (X)}{max (X) - min (X)}, \end{array}$ $\begin{array}{l} \frac{x_{i} - min (X)}{max (X) - min (X)}, \end{array}$

si14_e (16.13)

where X represents the feature vector and x_i represents a single value in this vector. Although a very high number of features may be available to practitioners to be included in a data set, it is usually the case that some of these features are less important than the others. We have observed in different scenarios that it is in fact possible to improve the performance of a learner by selecting a subset of all the features [62, 88]. There are multiple ways to select features, such as information gain [62], stepwise regression [88], and linear discriminant analysis to name just a few.

Similarly to the idea of “not all features being helpful” for a prediction model, not all the instances are beneficial for a prediction problem too. There may be various reasons why certain instances should be filtered out of the data. For example, the instance may contain erroneous data, or the data stored concerning the instance may be correct, but it may be so different from all the rest of the instances that it turns out to be an outlier. In either case, we may want to filter out such instances from the dataset. We have discussed our use of k-NN-based instance filters prior to us of a naïve Bayes classifier [62], which is a good example of instance-based filtering. k-NN-based instance filters select only the training instances closest to the test instance, so the learner uses only filtered-out instances. Another algorithm that was inspired by the k-NN-based filters is filtering by variance [79], where we selected only the instances that form groups of low variance (of the dependent variable value).

The proposed methods of ensembles, feature and instance selection, and feature normalization are a subset of possible ways to improve the performance of a learner. However, their application should not be interpreted as a must. Depending on the problem and dataset at hand, a practitioner must experiment with his or her application separately or as combinations (e.g., feature and instance selection applied together). The combination that yields the best performance should be preferred.

In summary, predictive analytics requires a clear definition of business challenges from a statistical point of view: from understanding data through visualizations and statistical tests to defining the input-output of a predictive analytics; from the selection of an algorithm to performance assessment criteria, and finally, to improving the performance of the algorithm. On the basis our previous experience, we provide Table 16.6 to guide software data analysts toward building a software analytics framework. Note that Table 16.6 is by no means an exhaustive list of all the possible algorithms for predictive purposes. However, it covers the selected algorithms that were employed in the multiple industrial case studies that we have discussed so far.

Table 16.6

A Classification and Application of Some of the Statistical Methods for Building a Software Analytics Framework

Type of Response Variable (Output)	Type of Learning	Predictive Analytics
Type of Response Variable (Output)	Type of Learning	Algorithms	Performance Measures	Prescriptive Analytics
Categorical (e.g., defect-proneness, issue reopening)	Classification	Logistic regression, naïve Bayes, k-NN, decision trees	Base measures from confusion matrix (TP, TN, FP, FN). Other derived measures: ACC, REC, PREC, false alarms, F measure, BAL	Normalization, feature and instance selection, ensembles
Continuous (e.g., number of defects, defect density, project effort in person-months)	Regression	Linear regression, k-NN, neural networks, decision trees	MRE, MdMRE, PRED(k), R²

t0035

16.6 Road Ahead

In many real-world problems, there are lots of random factors affecting the outcomes of a decision making process. It is usually impossible to consider all these factors and their possible interactions. Under such uncertainty, AI methods are helpful tools for making generalizations of past experiences in order to produce solutions for the previously unseen instances of the problem. These past experiences are extracted from available data, which represents the characteristics of the problem. Many data mining applications deal with large amounts of data, and their challenge is to reduce this large search space. On the other hand, there exist domains with very limited amounts of available data. In this case, the challenge is making generalizations from limited amounts of data.

In this context, software engineering is a domain with many random factors and relatively limited data. Nevertheless, in the software domain, remarkably effective predictors for software products have been generated using data mining methods. The success of these models seems unlikely considering all the factors involved in software development. For example, organizations can work in different domains, have different process, and define/measure defects and other aspects of their products and processes in different ways. Furthermore, most organizations do not precisely define their processes, products, measurements, etc. Nevertheless, it is true that very simple models suffice for generating approximately correct predictions for software development time and the location of software defects.

One candidate explanation for the strange predictability in software development is that despite all the seemingly random factors influencing software construction, the net result follows very tight statistical patterns. Building oracles to predict defects and/or effort via data mining is also an inductive generalization over past experience. All data miners hit a performance ceiling effect when they cannot find additional information that better relates software metrics to defect occurrence, or effort intervals. What we observe from our past results is that the research paradigm, which relied on relatively straightforward application of machine learning tools, has reached its limits.

To overcome these limits, researchers use combinations of metric features from different software artifacts, which we call information sources, in order to enrich the information content in the search space. However, these features from different sources come at a considerable collection cost, and are not available in all cases. Another way to avoid these limits is to use domain knowledge.

So far in our research we have combined the most basic type of these features—that is, source code measurements—with domain knowledge, and we propose novel ways of increasing the information content using these information sources. Using domain knowledge, we have shown that, for example, data miners for defect prediction can be easily constructed with limited or no data.

Research in AI, programming languages, and software engineering shares many common goals, such as high-level concepts, tools, and techniques: abstraction, modeling, etc. But there are also significant differences in the problem scope, the nature of the solution, and the intended audience. The AI community is interested in finding solutions to problems; the software engineering community tries to find efficient solutions, and as a result it needs to tackle simpler, more focused problems.

When we define intelligence, we have to be precise. What do we mean by intelligence? Which system is considered as intelligent? Those are the questions that are important to be able to create “smart” oracles. Building those systems becomes extremely important when we have large and distributed teams for software development. Users want an easy environment in which to switch between subsystems. For example, in the defect prediction domain, it is obvious that cooperative use of oracles with test benches is essential to support business decisions.

AI needs to be thought of as a large-scale engineering project. Researchers should build systems and design approaches that merge theory with empirical data, that merge science with large-scale engineering, and that merge methods with expert knowledge, business rules, and intuition.

In our previous work we saw that static code attributes have limited information content. Descriptions of software modules only in terms of static code attributes can overlook some important aspects of software, including the type of application domain, the skill level of the individual programmers involved in system development, contractor development practices, the variation in measurement practices, and the validation of the measurements and instruments used to collect the data. For this reason we have started augmenting and replacing static code measures with repository metrics such as past faults or changes to code or the number of developers who have worked on the code. In building oracles we have successfully modeled product attributes (static code metrics, repository metrics, etc.), and process attributes (organizational factors, experience of people, etc.). However, in software development projects people (developers, testers, analysts) are the most important pillar, but are very difficult to model. It is inevitable that we should move to a model that considers the product, process, and people.

We believe that in defect and effort estimation, more value will come from better understanding of developer characteristics, such as a grasp of how social networks are formed and how they affect the defect-proneness and/or effort allocation. Therefore, research in this area will include input from other disciplines, such as the social sciences, cognitive science, economics, and statistics.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 16: Lessons Learned from Software Analytics in Practice

Create new playlist

Sign In

Sign Up