The First Experiment: Testing Pattern Documentation

The first question we asked was whether merely documenting design patterns would improve programmer performance. The idea was to give two sets of subjects the same program to modify, but one group would get slightly extended program documentation pointing out the patterns present; the other group would get normal documentation, but no pattern information. This experiment would thus test only the presence or absence of pattern documentation, not the effect of patterns. We chose this approach partly because when designing the experiment in 1996, we hadn’t figured out how to construct two versions of a program that were equivalent, but with only one containing patterns. In a later experiment we were able to overcome this hurdle. But testing pattern documentation alone still proved useful because it gave us a first indication of whether patterns could be effective. If documenting patterns alone would increase performance, even bigger effects could be expected when comparing programs with patterns against those without.

Design of the Experiment

The experiment question was this: does it help the software maintainer if the design patterns in the program code are documented explicitly (using source code comments), compared to a program without explicit reference to design patterns? The experiment question was refined into the following two hypotheses:

Hypothesis 1

Documentation of design patterns speeds up pattern relevant maintenance tasks

Hypothesis 2

Documentation of design patterns reduces errors in pattern relevant maintenance tasks

The experiment measures work time and errors when programmers perform maintenance tasks that involve design patterns. We actually conducted two experiments. The first experiment was performed in January 1997 at the University of Karlsruhe (UKA), and the second in May 1997 at Washington University St. Louis (WUSTL). In Karlsruhe, 64 graduate and 10 undergraduate students participated; in St. Louis, 22 undergraduate students participated. All students had been trained in Java (UKA) or C++ (WUSTL) and had written programs with design patterns prior to the experiment; a pretest made sure participants knew about the relevant patterns. Details are available in [Prechelt et al. 2002].

An oft-repeated complaint is that one should use professionals rather than students, but at that early time in design pattern history, professionals with pattern experience were extremely difficult to find. Even with students, the experiment would be useful, though. If patterns made no difference for students, then there was scant hope professionals would show a benefit from patterns. The reasons for this are several: professionals have experience in dealing with large systems, so they might need less help from design patterns, which reduces the effect size, i.e., the difference between pattern and no-pattern measurements. Second, professionals have more diverse backgrounds: many of them do not have formal training in computer science, and their programming experiences vary from several years to several decades. In contrast, students from a university computer science program have all seen the same material for the same time and have less diverse programming experience. So one needs to expect a lot more noise in experiments with professionals. More noise combined with a reduced effect size makes an insignificant difference all but disappear. In that case, one can save the expense of running an experiment with professionals—the result would be inconclusive. The student experiment can be seen as a preliminary test that tells the experimenter whether to pursue a hypothesis with professionals.

Although the two experiments were similar, there were some variations that turned out to complement the experiments rather nicely. In Karlsruhe, students wrote their solutions on paper using Java, whereas the WUSTL subjects produced running programs in C++. Working on paper avoids many problems that are unrelated to the experiment question, such as difficulties with the programming environment or programming language. Producing running programs, however, provides firmer evidence that programmers actually understood the design patterns. Using both scenarios and comparing the results allowed us to conclude that working on paper alone was good enough for later experiments. Also, using two different object-oriented languages supports the claim that results do not depend on the choice of object-oriented language.

Each of the experiments was performed in lieu of a final exam. Participants needed between two and four hours, but a number of WUSTL students gave up because of time constraints (they wanted to catch transportation home after the final). Experiments are conducted in the real world, and the real world tends to intrude in unplanned ways. We learned this lesson the hard way.

Since we knew that we had only a few participants, we planned to let each participant work once with design pattern documentation and once without. This way, we would get two data points from each participant. Obviously, it would not be appropriate to let participants solve the same maintenance problem twice. Hence, we needed two different programs, similar in size and complexity. Since we were going to combine the responses from both programs, they did not need to be identical in complexity, but they couldn’t be vastly different, because in that case differences in response could be due to size and complexity rather than design pattern documentation. The first sample program is called Phonebook. It manages an address database and displays first and last names plus phone numbers in several different formats. It consists of 11 classes and 565 lines, 197 lines of which were comments. Phonebook contains the patterns Observer and Template Method. The second sample program implements an And-Or-Tree, a recursive data structure. With 7 classes and 362 lines (133 of which were comments), it is shorter, but somewhat more difficult to understand. It uses the patterns Composite and Visitor. When documenting the patterns, we added 14 lines to Phonebook and 18 lines to And-Or-Tree. (The line counts shown are for the Java versions; the counts for C++ are slightly different.) Here are two examples of the extra documentation:

*** DESIGN PATTERN: ***
the two TupleDisplays are registered as observers at the Tupleset.

*** DESIGN PATTERN: ***
newTuple together with its auxiliary method mergeIn() forms a 
*** Template Method***. The hooks are the methods select(), format(), and compare().

Note that the programs were well documented, even without the design pattern information. Maintenance tasks were entirely manageable without this extra information. Thus, the experiment was designed extremely conservatively, almost against showing any effect. If any reduction in error count or work time in response to pattern documentation would show up, this reduction would probably be more pronounced in real programs, because of the often scant documentation of “professional” programs.

For Phonebook, the solution involved declaring and instantiating two new observers, one with and one without a template method. For And-Or-Tree, participants needed to declare and instantiate a new visitor. The description of the maintenance tasks did not mention patterns.

Because of two different sample programs, a potential threat to validity needed to be addressed. Suppose participants who are given pattern information in the first round start to look for patterns in the second round? Then the data points for the second round would be useless, because by hunting for patterns, participants would behave as if they had pattern information available. Similarly, suppose by starting with Phonebook, participants learn something they can use on the other program? Again, the effect of pattern documentation would then be confounded with something else. This threat to validity is called a sequence or learning effect. The answer to this threat is a counter-balanced experiment design. In this design, we divide participants into four groups that differ in the order in which they receive the programs and the order in which they receive the treatment (with or without design pattern documentation).

Figure 22-2 shows the counterbalanced experiment design. Group 1, for instance, works first on Phonebook with pattern documentation, and then on And-Or-Tree without pattern documentation. By comparing the results cross-wise, one can check whether providing pattern documentation initially makes a difference. For instance, Group 1 and Group 4 both work on Phonebook with pattern documentation, but in a different order; Group 2 and Group 3 do the same for And-Or-Tree. The experimenter checks whether the pooled results of the left circles differ noticeably from the pooled results of the right circles. If so, a learning effect is present. By comparing the results for the diamonds in a similar way, one checks for a learning effect (in the absence of pattern documentation). In our experiments, these checks showed no learning effects. (To be safe, we also asked in a post-questionnaire whether participants were looking for design patterns on their own, and none of them did.) Since there were no noticeable learning effects, it is permissible to compare the results from the circles with the results from the diamonds directly, without correction for learning.

Counter-balanced design of the pattern documentation experiment (PH is Phonebook, AOT is And-Or-Tree)

Figure 22-2. Counter-balanced design of the pattern documentation experiment (PH is Phonebook, AOT is And-Or-Tree)

Results

Solutions handed in by participants were graded on a point scale, and time was measured in minutes between assigning and handing in a task. Furthermore, we counted solutions that were completely correct. It turned out that the point scale did not show any significant differences. We therefore compare only time and completely correct solutions. The following table shows the results for And-Or-Tree.

VariableWith pattern documentationWithout pattern documentationSignificance (p-value)
UKA, And-Or-Tree    
correct solutions15 of 387 of 360.077
time (min), mean58.052.20.094
time (min) of 7 best38.645.40.13
WUSTL, And-Or-Tree    
correct solutions4 of 83 of 8 
time (min), mean52.167.50.046

There are several things to note about this table. First, with design pattern documentation, UKA participants produced over double the number of completely correct solutions (15 versus 7 out of 36), a sizeable effect. The WUSTL students showed a smaller difference. Surprisingly, the average completion time with pattern documentation is longer for UKA (58 versus 52 minutes). However, this observation is misleading, because the number of correct solutions is much lower for the group without pattern information. Recall that UKA handed in solutions on paper. In a real maintenance environment, incorrect solutions would be detected and corrected, taking additional time not observed in this experiment. The work on paper made it difficult for participants to check their solutions; obviously, the time spent on incorrect work cannot be sensibly compared to time spent on correct work. We therefore reduced the sample: since the group without pattern documentation had only seven correct solutions, we compared these against the seven best solutions from the other group and found that the time spent with pattern documentation tends to be less, though with low statistical significance. Time difference is significant for the WUSTL group at the 0.05 level, presumably because this group not only designed, but also tested and corrected solutions, and thus produced more homogeneous quality. By comparing work time and solution quality, we also found out that without pattern documentation, the slower (less capable) subjects produced solutions of much lower quality, whereas with pattern documentation, quality is independent of the time required.

The discussion of time versus correctness reveals a flaw in the experimental design: quality and productivity are confounded, i.e., they depend on each other. Obviously, good quality takes time, so taking longer does not necessarily mean that patterns are worthless. The implication is that comparing time makes sense only if solution quality is the same. We corrected for this problem by comparing only totally correct solutions, but that cost us half of the data points and statistical significance. In retrospect, we could have avoided this problem entirely with an acceptance test that everyone had to pass. There will be more about this technique in the conclusions.

Overall, identifying patterns in And-Or-Tree saves maintenance time and produces better solutions, and even less capable programmers produce good solutions. For Phonebook, results (not shown) also suggest that pattern documentation saves time; results for quality are not available due to lack of data (students quitting).

We conclude that if maintenance tasks are performed on design patterns, including pattern documentation may reduce the time for implementing a program change or may help improve the quality of the change. We therefore recommend to always document design patterns in source code. As stated before, this experiment does not test the presence or absence of patterns, only the presence or absence of pattern documentation. Encouraged by the results, however, we started planning the second experiment. This time, we were going to check the more fundamental question and use professionals from the start.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset