Our Study

The categorization we have presented here came out of our experiences analyzing code clones that we found in several open source software systems, including the Linux kernel, the Apache httpd web server, the PostgreSQL relational database system, the Gnumeric spreadsheet application, the Columba email client, nine text editors, including vim and emacs, and eight X11 window managers. Since we had spotted each of these patterns multiple times across several systems, we were pretty confident that the patterns were “real” and not peculiar to a particular system or application domain. And although we were also pretty sure that cloning was often used as a principled practice, we lacked any concrete quantitative evidence. So we set out to examine two large open source systems from different domains, and tried to measure just how commonly cloning is used as a principled design tool, at least in those systems.

To crawl and categorize these large systems, we used our own clone detection tool, called CLICS (CLoning Interpretation and Categorization System). CLICS tokenizes the source code input and then employs suffix arrays to perform parameterized string matching; this is similar to the approach of other tools such as CCfinder. In order to detect clones where variable names might have been changed, this technique maps all identifiers to a single proxy token; that is, all identifiers will match each other, and the remaining tokens in the input stream—keywords, operators, and separators—play an enhanced role in identifying cloned segments of code. However, this approach also leads to a lot of false positives, since we lose information by ignoring the actual identifiers. Because of this, we performed aggressive filtering on the initial results to remove likely false positives. For example, switch statements in C-like languages often look similar at this level, even when they are doing tasks that are pretty different. In such cases where we felt that false positives were more likely, we used stronger matching criteria for the candidates to be considered clones.

If two chunks of code passed our matching criteria and automated filtering, we considered them to be a clone pair. We then looked at regions of code within source files to make better sense of the results. The precise definition of what we did can be found in our paper, but a simplified explanation goes that we broke up files into function boundaries and other “regions,” such as struct definitions, and then considered the cloning relationships between code regions. We broke the previously found clones at region boundaries, and looked for regions that had cloning relationships between them; we called these Regional Groups of Clones (RGCs). The RGCs represent a coarser view of cloning within a software system, at a level that is likely to make more sense to developers. We have done most of our studies using RGCs, as they seem more intuitive to us as a measure of significant cloning. Thus, for each system, we report both the number of clone pairs found and the number of RGCs.

For our case study, we decided to use minimum thresholds of 30 and 60 consecutive matching tokens for candidates to be considered clones of each other, thresholds that were suggested by our previous work in the area.

We picked two large open source systems that we had studied before: Apache and Gnumeric. We thought they were good choices because they were both successful, long-lived systems of similar size but from different application domains. In particular, we looked at Apache version 2.2.4, which has more than 300,000 lines of code across 783 files, and Gnumeric version 1.6.3, which has more than 300,000 lines of code across 530 files.

Using the settings we described, CLICS identified 21,270 clones comprising 1,580 RGCs for Apache, and 11,400 clones comprising 3,437 RGCs for Gnumeric. That sounds like a lot, doesn’t it? It is a lot, yet it’s not at all atypical of the systems we have examined. That is, cloning is pretty common in big systems, and probably a lot more common than you might have imagined.

We randomly selected 100 RGCs from each of the systems. Then, we manually examined each one and asked the question: is this use of cloning a good idea, a bad idea, or simply unavoidable? Good clones, in our estimation, were those that represented an improvement over any alternative design that might reasonably have been picked. Bad clones were those for which we could see an obviously better design, and were likely due to developer laziness, design drift, or some other unfortunate circumstance. Unavoidable clones were those that were either too trivial to bother refactoring, or for which there was no reasonable alternative; API protocols were most commonly considered unavoidable, as they are hard to abstract and the client typically does not “own” the code anyway. We also found some false positives: 7% of the Apache RGCs and 29% of the Gnumeric RGCs that we examined manually were judged to not be real clones; consequently, the totals in the following tables don’t sum to 100 for each system.

The following table summarizes what we found. We show only the 60-token results, since longer clones are more likely to be interesting and less likely to be false positives. In general, we judged about 35‒40% of the clones to be good, a little less than that to be bad clones, and 15‒20% to be simply unavoidable. To us, this was pretty good evidence that open source developers often use cloning as a principled development tool.

CategoryPatternApache: GoodApache: UnavoidableApache: HarmfulGnumeric: GoodGnumeric: UnavoidableGnumeric: Harmful
ForkingHardware variation000000
ForkingPlatform variation1000000
ForkingExperimental variation400000
TemplatingBoilerplating500601
TemplatingAPI protocols0170081
TemplatingProgramming idioms0012100
TemplatingParameterized code511210024
CustomizingBug workarounds000000
CustomizingReplicate and specialize12041501
Other 308103
Total 39183633830

To be clear, we certainly don’t think that this is the last word on the relative harmfulness of cloning. Our categorization is our current best effort, but certainly other categories and even organizing principles are possible. Furthermore, no empirical study is perfect. For one thing, we designed and executed the study ourselves; employing a neutral third party to make the judgment calls about harmfulness of clones would have reduced the risk of bias, but at the cost of less expertise in the decision-making process. Also, we studied only two systems, which is neither statistically significant nor representative of the innumerable possible application domains in the known universe. And both systems were open source, which may bias the results further. Finally, we have no data on, say, how cloning affects long-term code quality or if the risks of inconsistent maintenance are significant, although some studies are now beginning to appear [Krinke 2007]. But we think that the evidence that cloning can be used as a principled development tool is pretty strong.

Although you can see our paper for details about our results, it’s probably worth pointing out here a couple of the interesting differences we found between the two systems. For one thing, we found that Apache made significant (and principled, in our opinion) use of forking, whereas Gnumeric did not. We didn’t find this too surprising, since Apache provides a large set of fairly low-level services; it’s meant to run directly on top of a variety of platforms, so the kind of virtualization provided by the Apache Portable Runtime, which we knew uses the platform variation pattern, makes good sense. Gnumeric, on the other hand, achieves portability largely by relying on the GTK widget set, which is not actually part of the Gnumeric codebase. We also noticed that Gnumeric appeared to make more use of the parameterized code pattern than Apache; manual inspection suggested that many of the features implemented as GUI-based operations were indeed highly similar to each other, hence the cloning. Finally, it’s worth pointing out that these numbers represent a random sample of 100 RGCs from each system, which amounts to only about 6% of the Apache RGCs, and about 3% of the Gnumeric RGCs.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset