What Is There to Mine?

During software development, programmers routinely produce and collect lots of data, all of which can be accessed and analyzed automatically:

  • The source code for your product. This is the most important input to your analysis, as it provides you with locations (files, units, classes, components, etc.) that can be associated with various product or process factors.

  • Collecting data on the execution of the software provides you with profiles, telling you which parts are frequently used and which parts are not.

  • Your product may come with additional documentation, such as design documents or requirements documents; these may also provide important features that explain why code looks the way it does.

  • The resulting software can be analyzed statically, providing features such as complexity metrics or dependencies.

  • Version archives record the changes made to the product, including who, when, where, and why. Version archives can tell a lot about a project’s history, if the stored changes are all logically separated and if the stored rationales are used in a systematic and consistent manner.

  • To map problems to locations, it is important to have a problem database that describes all the problems that ever occurred and tracks their life cycles.

  • Finally, you may have social data: a partitioning of developers into projects or groups, emails or other messages between developers, and even billing or effort data. With such data, you can, for instance, determine how effort maps to individual tasks or locations, or how individual groups contribute to changes—and to errors. Before you rank groups by their error density, though, keep in mind that the most difficult tasks produce the most errors. Making mistakes might be just a side effect of being assigned the most difficult tasks—which developers get by being the most experienced and trusted programmers.

From a researcher’s standpoint, the advantage of accessing these data sources is that they are unbiased—they record changes, problems, and other events at the moment they happen and with a realistic perspective that directly reflects the activities of the developers dealing with them. On the other hand, the data also may be noisy and incomplete, which is why special steps are required before analysis.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset