Limitations and context of research

B. Murphy    Microsoft Research, Cambridge, United Kingdom

Abstract

The software industry has existed for over 60 years and is now a massive multinational, multi-billion dollar industry. During this time, the number of people working in the software industry has expanded significantly, as has the number of software engineering researchers. In theory, these software engineering researchers should be driving the future of software development, but even the most optimistic software engineering researcher would not make this claim. In reality the number of practitioners attending Software Engineering conferences is decreasing. This chapter examines research limitations that may explain its apparent disconnect with the software engineering industry.

Keywords

Data cleaning; Research projects; Novel research techniques; Open source repositories; Academia; Industry practitioners; Software engineering; Data analytics

The software industry has existed for over 60 years and is now a massive multinational, multi-billion dollar industry. Software controls a vast amount of our daily lives, from providing entertainment, controlling household goods, controlling transport systems, and providing worldwide services such as internet searching. The objectives and attributes of software range from small game applications that focus on fast-feature deployment, to software controlling planes whose focus is fault and safety tolerance. The products range in scale from software that runs on individual mobile phones all the way to a search engine running across tens of thousands of machines distributed across multiple server farms distributed around the world. During this time, the number of people working in the software industry has expanded significantly, as has the number of software engineering researchers. In theory, these software engineering researchers should be driving the future of software development, but even the most optimistic software engineering researcher would not make this claim. The increase in the number of people developing software should have resulted in an increase in the numbers attending research conferences to learn what is happening in the field of research. In reality, there is concern within the research community over the lack of practitioners attending software engineering conferences.

This section examines research limitations that may explain its apparent disconnect with the software engineering industry. A number of these issues are inter-related and are detailed herein.

Small Research Projects

As discussed previously, there are great variations in the types of software being developed, and many factors that can influence development. The accuracy of any study is dependent on the ability of the researchers to interpret results, and a major factor influencing this is how clean the data is. Data cleaning can take significant domain knowledge, enabling researchers to interpret whether exceptions are valid or invalid. Often, research will be performed on small projects due to the lack of available data from the larger projects. These small projects can be controlled by the researchers, but are often contrived projects using students as developers. In these projects, researchers focus on a small number of development attributes on a single or small number of products. The results of these studies are unique to the data set and product and are rarely reproducible on any other data set for any other product. Open Source and repositories like GitHub have helped researchers access software project developments, but as discussed later in the chapter, these repositories have issues of data quality. Additionally, with the exception of Eclipse, little data from large software developments is available for research.

Although, due to necessity, papers often make claims that results are generic, which often undermines the research in the eyes of practitioners. The potential consumers of research are commercial developers who would like to see results in their own context, often large proprietary software developments. Often, they struggle to understand how to apply research to their projects. This differs from research disciplines such as systems, where the Symposium on Operating Systems Principles (SOSP) conference offers papers on large projects based on the work of many researchers.

The Importance of Being the First Author on a Publication

A significant driver of researchers wishing to publish in conferences is the need for publication requirements to allow students to graduate. For the students, it is very important that they are the first author of any paper so they can claim ownership of their work, another factor is the importance of conferences that publish their work. This inevitably results in two limiting factors:

 The research projects are small, as there are limited resources that can be applied to the projects as other students in the group would rather focus on their own projects than that of their colleagues.

 Each student, and tenure-seeking faculty member is supposed to be carving their own unique research area, which is shown through being the first author on the paper, but also detracts from the work of the collaborators on the paper.

Other research disciplines that try to promote larger research projects will place authors in alphabetic order, and credit the work to all authors.

Requirements for Novel Research Techniques

One of the major criteria for a paper's publication in both journals and at conferences is the novelty of the technique. As discussed herein, there are great variations in the types of software products, their attributes, and the development methods applied. Additionally, software development is a human activity, and as such, has great variations between different development teams, even if they are applying the same development techniques. To fully understand the universality of a specific finding, the same research experiment should be applied to numerous products developed by different development groups. But replicated studies are implicitly discouraged in software engineering because top software engineering venues reject papers that are not considered to be novel research areas.

Most other engineering disciplines welcome the publication of papers that reflect the application of the same engineering techniques within different environments. For instance, medical journals encourage replicated trials, applied to different sets of patients, which often result in different results. The medical journals value the ability of the researchers to interpret the differences between the two trials; this enhances the general knowledge of the discipline. The lack of replicated studies in software engineering research results in the limitations of any proposed technique not being fully understood, and more importantly, its limitations will be unknown. As practitioners understand that most techniques have limitations, then they will not apply any new techniques whose limitations are unknown.

The quality of papers within the research community is measured by the number of citations in other research papers. Conferences (Foundations of Software Engineering (FSE), International Conference on Software Engineering (ICSE)) define influential papers over time based on the “novelty” of the idea and the number of citations. Other engineering disciplines, logically, measure a paper's influence based on its impact on the industry.

Data Quality of Open Source Repositories

Over recent years there has been a significant increase in the amount of data available to system engineering researchers due to the increase in developers using open repositories such as GitHub. This has opened up a vast resource of software development activity for research. But there are various data quality issues that hinder interpretation of its results, specifically;

1. GIT provides tremendous flexibility in how code can be moved between branches, but that flexibility can result in a lot of data attributes, such as on which branch a change was made and the time order it occurred, being lost.

2. Triaging of bug data: A bug submitted to a repository may not be reproducible due to lack of data, or insufficient description of the defect. Additionally, a submitted bug may be a replica of an already existing bug in the repository. In theory, engineers, when correcting or discarding a bug, will properly document the reasons for their actions, in practice, they rarely do as it is far too time consuming. As such, it is difficult to relate the bugs in the repository to defects in the software.

3. Definition of the complete product: Often the code repository will contain a master branch, with the natural assumption is that all code in that branch forms the released product. In reality, the code in the master branch will be a mixture of product, test, and process code, often developed using different criteria, complicating the interpretation of any results.

4. Source code from industry is often limited to post-release information (eg, Android): A much richer data set would be the development period, but that is rarely made available.

5. Limited access to, and identification of, developers: A major issue with the analysis of software engineering data is the management of exceptions. The simplest way of interpreting data is to talk to the engineers responsible for the development; without access to the developers, the researchers manage outliers using statistical techniques that may not necessarily be accurate.

Lack of Industrial Representatives at Conferences

The steering committees, chairs, and program committees (PCs) are dominated by academics, with few industrial representatives, with most of these being from research departments within the industry. While academics personally benefit from working at conferences, industrial practitioners get little recognition within their companies for working at conferences, and may even need to perform the work on their own time. Additionally, as these conferences grow, the amount of work required from people on the committee grows: the recent ICSE conference required each PC member to review 18 papers, to respond to authors' comments, and to discuss the papers with other reviewers. As the number of industrial representatives at conferences drops, it is inevitable that these engineering conferences will increasingly focus on academics' interests, without any reference on whether those interests are relevant to the software industry.

Research From Industry

As previously noted, industry involvement in software engineering conferences is limited. A factor in this is the decrease in the number of research labs in industry, with companies closing down or limiting their research departments, and newer companies not creating new labs; although a number of these companies do employ researchers who do submit research material for publications. A major source of publications from industry are those written by interns, studying at universities, but working in industry for a short period of time.

Industrial researchers have access to a much richer data set than academic researchers, using open source data. This is due to product teams often triaging bug data to identify replicated bugs and to discard bugs that are non-reproducible. Additionally, industrial bug data will often contain a reference to any resulting code changes. Industrial researchers often have full access to developers and product information, allowing them to better interpret their results. They also have the flexibility to apply the same techniques to different data sets for the same or equivalent products developed by different product groups within the same company, resulting in much larger experiments, and a greater ability to identify limitations in any technique.

A major issue is industrial research limiting access to the raw, or interpreted, data. They also rarely provide absolute values in their papers, leaving the y-axis values in graphs blank. There are many reason for this, as firms may fear for their reputation if the absolute values of bug reports are published. Another frequent reason is that data may contain information on products from other companies, which they do not have the right to release. For instance, Windows crash data contains a large amount of non-Microsoft failure data, including the names of third party drivers and applications. This can produce negative reactions by academic reviewers of papers written by industry practitioners.

There are also practical issues regarding getting industry people to contribute to conferences or journals. The most important factor is that the majority of industry practitioners will not personally benefit from getting work published, as that is not their primary job. To attend a conference, they will need permission, so their company will have to view attending a conference as beneficial. Ironically, the greater the amount of industrial participation at a conference, the more attractive the conference is to industry. Industrial practitioners are also at a disadvantage as the structure of academic papers are different to those in industry, so practitioners often struggle to ensure the paper will meet all the required criteria.

Summary

There has been a growing disconnect between software engineering research and software practitioners, and based on the underlying factors driving academia discussed in this chapter, this disconnect is only going to increase. There are numerous people within academia who work closely with industry, but perversely, publishing the results of these joint collaborations is more difficult than focusing on purely theoretical areas. Improving the impact of software engineering research, in terms of its applicability in the software industry, will require universities to change the way they measure the contribution of their students, decreasing the focus on published papers and increasing how they value their contributions to the field. Conferences should actively recruit people from industry to help drive their direction, which requires the conferences to lower the burden on participating in the conference. Additionally, conferences, and technical journals, need to focus less on the quality of the paper, in terms of its structure and technical content, and rather, they should focus on how much the paper adds to the general knowledge of software engineering.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset