Which machine learning method do you need?

L.L. Minku    University of Leicester, Leicester, United Kingdom

Abstract

Machine learning can be used for several different software data analytics tasks, providing useful insights into software processes and products. For example, it can reveal what software modules are most likely to contain bugs, what amount of effort is likely to be required to develop new software projects, what commits are most likely to induce crashes, how the productivity of a company changes over time, how to improve productivity, etc. The right machine learning algorithm depends on the data and the environment being modeled. Therefore, in order to create good data models, it is important to investigate the data analytics problem in hand before choosing the type of machine learning algorithm to be used. This chapter discusses questions that software engineers can ask about their data analytics problem in order to choose an appropriate machine learning algorithm.

Keywords

Machine learning; Machine learning styles; Software effort estimation; Software bug prediction; Crash-inducing commits

Learning Styles

An unarguable fact in teaching is that people do not all learn in the same way. Some of us spend hours reading instructions, whereas others just start trying things out to learn from the outcome. Some of us quickly adjust to new situations and technologies, whereas others tend to stick to the traditions. People also tend to use different approaches to learn different tasks. If we try to use an approach that is unsuitable for us, we will not be able to learn well or will take a much longer time to learn.

This is no different in machine learning for software data analytics. There is a plethora of learning algorithms that can be used for several different purposes. Learning algorithms can give us insights into software processes and products, such as what software modules are most likely to contain bugs [1], what amount of effort is likely to be required to develop new software projects [2], what commits are most likely to induce crashes [3], how the productivity of a company changes over time [4], how to improve productivity [4], etc.

The right learning algorithm depends on the data and the environment being modeled. In order to create good data models, it is important to investigate the data analytics problem at hand before choosing the type of learning algorithm to be used. Here are a few useful questions to ask.

Do additional Data Arrive Over Time?

Databases containing software project data may not have a static size—they may grow over time. For example, consider the task of predicting whether commits are likely to induce crashes [3]. New commits and new information on whether they have induced crashes may become available over time. When additional data are produced over time, it is desirable to use such incoming data to improve data models. Online learning algorithms are able to update data models with incoming data continuously. Chunk-based learning algorithms wait for a new chunk of data to arrive, and then use it to update the data models. Different from offline learning algorithms, online and chunk-based learning algorithms do not need to reprocess old data or completely rebuild the data model once new data becomes available [5]. In this way, they are able to update data models faster. Given that rebuilding data models several times can be painfully slow when data sets are not small, this is particularly useful for larger data sets.

Are Changes Likely to Happen Over Time?

Environments that produce data may suffer changes over time. For example, consider the data analytics tasks of software effort estimation [6] and software bug (defect) prediction [7]. Software companies may hire new employees, may change their development process, may adopt new programming languages, etc. Such changes may cause old data to become obsolete, which in turn would cause old software effort estimation and software bug prediction data models to also become obsolete. Such changes may also bring back situations that used to occur in the past, but were not occurring recently. Therefore, simply using all available data together can lead to contradictory and misleading data models. When temporal information about the data is available, change detection techniques can be used in combination with online or chunk-based learning algorithms [5] to handle changes. For instance, they can be used to (1) identify when a change that affects the adequacy of the current data model is occurring [8], (2) determine which existing data models best represent the current situation [6], and (3) decide how to update data models to the new situation [4,6,8]. When an environment has the potential to suffer changes, it is essential to collect additional data over time to be able to identify such changes and adapt data models accordingly.

If You Have a Prediction Problem, What Do You Really Need to Predict?

In prediction/estimation problems, we wish to predict a certain category or quantity based on features describing an observation. It is important to decide what the target to be predicted really is, because this may affect both the predictive data models' accuracy and its usefulness. For example, we may wish to estimate the number of bugs in a software module based on features such as its size, complexity, number of commented lines, etc. For that, we should use a regression learning algorithm. Or, we may wish to predict whether or not a certain software module contains bugs. For that, we should use a classification learning algorithm. Or, we may wish to estimate the ranking of modules based on their bug-proneness. For that, we should use a rank learning algorithm. Depending on data availability, estimating the precise number of bugs may be more difficult than simply predicting whether or not a module is likely to contain bugs, and less informative. If we do not really need to know the exact number of bugs, rank learning algorithms [9] may be able to provide a balance between predictive accuracy and informativeness.

Do You Have a Prediction Problem Where Unlabeled Data are Abundant and Labeled Data are Expensive?

In order to create predictive data models, it's helpful to use supervised learning algorithms to review data whose quantity/category to be estimated is known (labeled data). These learning algorithms are considered “teachers.” Even though the existence of a teacher can help us create good predictive data models, it is sometimes expensive to hire such a teacher, ie, it is sometimes expensive to collect labels. This may result in few labeled data despite the existence of abundant unlabeled data, potentially causing supervised learning algorithms to perform poorly. An example of a problem that results from this issue is software effort estimation, where the actual effort required to develop software projects is costly. However, other features describing software projects may be collected in an automated manner [10,11], which would be less costly.

Semisupervised learning algorithms are able to use unlabeled data in combination with labeled data in order to improve predictive accuracy [10,12]. They typically learn the structure underlying the data based on the unlabeled data, and then combine this structure with the available labels in order to build predictive data models. If it is possible to request specific observations to be labeled, one may also opt for active learning algorithms. These learning algorithms are able to determine which observations are most valuable if labeled, instead of requiring all data to be labeled [11].

Are Your Data Imbalanced?

In some cases, there may be abundant data representing certain aspects of an environment, but little data representing other aspects. Software bug prediction is an example of an imbalance problem, where there are typically fewer buggy software modules than nonbuggy ones. When the data are imbalanced, learning algorithms tend to be biased toward the more common aspects and completely fail to model the less common ones. Learning algorithms specifically designed for imbalance learning should be used to deal with that [13].

Do You Need to Use Data From Different Sources?

Even though there may be little data from within the targeted environment, there may be more data available from other environments. Such data may be useful to improve data models for the targeted environment. For example, in software bug prediction, there may be little data telling whether modules within given software are buggy, despite a lot of data from other software. Learning algorithms able to transfer knowledge among environments can be used in these cases [4,6,14,15].

Do You Have Big Data?

Certain data analytics tasks may need to process large quantities of potentially complex data, causing typical learning algorithms to struggle in terms of computational time. This may be the case, for example, when modeling developers' behavior based on software repositories hosting hundreds of thousands of projects. Online learning algorithms able to process each observation only once can help to build data models faster [5,8].

Do You Have Little Data?

When there is not so much data to learn from, learning algorithms tend to struggle to create accurate data models. This is because the available data are not enough to represent the whole environment well. This is typically the case for software effort estimation data, but other software engineering problems may suffer from similar issues. When there is not much data, simpler learning algorithms that do not have too many parameters to be learned tend to perform better [16].

In Summary…

…examine your data analytics problem first, then choose the type of learning algorithm to consider! Given a set of learning algorithms to consider, it is also advisable to run experiments in order to find out which of them is best suited to your data.

References

[1] Hall T., Beecham S., Bowes D., Gray D., Counsell S. A systematic literature review on fault prediction performance in software engineering. IEEE Trans Softw Eng. 2012;38(6):1276–1304.

[2] Dejaeger K., Verbeke W., Martens D., Baesens B. Data mining techniques for software effort estimation: a comparative study. IEEE Trans Softw Eng. 2012;38(2):375–397.

[3] An L., Khomh F. An empirical study of crash-inducing commits in Mozilla Firefox. In: Proc. 11th international conference on predictive models and data analytics in software engineering; 2015 [article no. 5, 10 pp].

[4] Minku L., Yao X. How to make best use of cross-company data in software effort estimation? In: Proc. 36th international conference on software engineering; 2014:446–456.

[5] Gama J., Gaber M. Learning from data streams. Berlin: Springer-Verlag; 2007.

[6] Minku L., Yao X. DDD: a new ensemble approach for dealing with concept drift. IEEE Trans Knowl Data Eng. 2012;24(4):619–633.

[7] Ekanayake J., Tappolet J., Gall H.C., Bernstein A. Tracking concept drift of software projects using defect prediction quality. In: Proc. 6th IEEE international working conference on mining software repositories; 2009:51–60.

[8] Minku L., Yao X. Can cross-company data improve performance in software effort estimation? In: Proc. 8th international conference on predictive models in software engineering; 2012:69–78.

[9] Yang X., Tang K., Yao X. A learning-to-rank approach to software defect prediction. IEEE Trans Reliab. 2015;64(1):234–246.

[10] Kocaguneli E., Cukic B., Menzies T., Lu H. Building a second opinion: learning cross-company data. In: Proc. 9th international conference on predictive models in software engineering; 2013 [article no. 12].

[11] Kocaguneli E., Menzies T., Keung J., Cok D., Madachy R. Active learning and effort estimation: finding the essential content of software effort estimation data. IEEE Trans Softw Eng. 2013;39(2):1040–1053.

[12] Chapelle O., Scholkopf B., Zien A. Semi-supervised learning. Cambridge, MA: MIT Press; 2006.

[13] Wang S., Yao X. Using class imbalance learning for software defect prediction. IEEE Trans Reliab. 2012;62(2):434–443.

[14] Nam J., Pan S., Kim S. Transfer defect learning. In: Proc. international conference on software engineering; 2013:382–391.

[15] Turhan B., Menzies T., Bener A., Distefano J. On the relative value of cross-company and within-company data for defect prediction. Empir Softw Eng. 2009;14(5):540–578.

[16] Kocaguneli E., Menzies T., Bener A., Keung J. Exploiting the essential assumptions of analogy-based effort estimation. IEEE Trans Softw Eng. 2012;38(2):425–438.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset