Surveying Software

The first step in this study was to select a representative sample of software, just like surveys in the social sciences. From a known population, statistical methods allow us to obtain the minimum sample size that lets us extract conclusions with a given uncertainty. For instance, the U.S. population is about 300 million people at the moment of writing this text; exit polls can accurately predict the results of elections if the size of the polled sample is large enough, say, 30,000 people.

The problem with this approach when carried over to software engineering is that we do not know the size of the world’s software. So we cannot determine a minimum sample that can answer our questions with a given uncertainty, and the classic “survey approach” to the whole population of software becomes unfeasible.

However, even though the whole population of software is indeterminable, a portion of that population is open, accessible for research, and willing to share its source code with the world: open source software. Of course, restricting our population to this kind of software should theoretically bind the answer to our initial question to this kind of software. But when all is said and done, the only difference between open source and closed source software is the license. Although open source software is usually developed using particular practices (projects that are community-driven, source code available, etc.), the open source software population is very heterogeneous, ranging from projects that are completely community-driven to projects that remain under the close control of companies. The only feature held in common by open source software is the set of licenses under which it is released, making its source code publicly available. Therefore, we can assume that the best complexity metrics for source code obtained from open source projects are also the best complexity metrics for any other source code, whether open source or not.

Open source software also presents some other interesting properties for research. Open source software repositories are usually completely available to researchers. Software repositories contain all the artifacts produced by the software project—source code releases, control version systems, issue tracking systems, mailing lists archives, etc.—and often all the previous versions of those artifacts. All that information is available for anyone interested in studying it. Thus, open source not only offers a huge amount of code, allowing us to study samples as large as we might want, but also makes possible repeatable and verifiable studies, which are fundamental and minimal requirements to ensure that the conclusions of empirical studies can be trusted.

There are some perils and pitfalls when using open source software, though. It is hard to obtain data for a large sample of open source software projects because of the heterogeneity in the data sources. Not all projects use the same tools for the different repositories, or even the same kind of repositories, and often those repositories are dispersed across different sites. Some efforts, such as the FLOSSMetrics (http://flossmetrics.org) and FLOSSMole (http://flossmole.org) projects, deliver databases containing metrics and facts about open source projects, which alleviate the heterogeneity of data when mining open source software repositories. And this problem is partially solved by the open source software community itself in the form of software distributions, such as the well-known Ubuntu and Debian distributions. Distributions gather source code from open source projects, adapt it to the rest of the distribution, and make it available as compiled binary and source code packages. These packages are tagged with meta-information that helps classify them and retrieve the different dependencies needed to install them. Some of these distributions are huge, encompassing thousands of packages and million of lines of source code. So they are ideal for any study that needs to gather a large amount of source code, like the one in this chapter.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset