Lessons Learned

This series of experiments was a tremendous learning experience. The experiments were difficult to construct. Adding pattern documentation was easy enough once we thought of it, but constructing “fair” comparisons for pattern programs was hard, because the alternative solutions always looked very much like the pattern solutions. The experimenters were already locked into thinking in patterns! We made progress only after we accepted that the alternative solutions could be simpler and less flexible and that that was exactly what should be tested. Finding ways to capture communication stymied us for a long time. After a lot of reading about experimental design, we decided to use protocol analysis. But what to do with the protocols? The concept of the ideal communication line and comparing it with actual ones was a stroke of genius by Barbara Unger (see the Acknowledgments section). Preparing, running, and analyzing the experiments was unexpectedly time-consuming (three PhD dissertations’ worth), but we did it at the right time, when the topic was fresh and subjects easy to find. In the end, we were thrilled with the results.

The book by Christensen [Christensen 2007] about experimental design was a godsend. It is in its 10th edition and covers all the major topics of experimental methodology.

I recommend to always use a counterbalanced design if subjects do more than one task. Counterbalancing does not cost anything except a bit of organization, and the ability to check for sequence effects provides peace of mind. Also, counterbalancing reduces the chance for copying, if one arranges for people sitting next to each other to be in different groups. (Copying can cost valuable data points and ruin an experiment.) For analyzing results, the free statistics package R was wonderful. It is available at www.r-project.org.

The following we would do differently. In statistical tests there is the concept of power. Power is the probability of finding a difference among treatments if there is one. One wants to keep this probability high (say, at 80%) because otherwise one might have an inconclusive result (the experiment does not show anything, because the chances for finding it are too low). Power analysis helps here. The process is briefly as follows: determine the approximate effect size with a few subjects in a pre-test. With this information, one can estimate the number of subjects needed for a given power and significance level. We were lucky that effect sizes in our experiments were large enough so we could get by with the number of subjects we had. For small effect sizes, the number of subjects can easily go to 80 or 120, and getting this many subjects is a major undertaking. Power analysis is described in statistics books, for instance, in Chapter 15 of [Howell 1999]. Statistics packages such as R provide packaged solutions for power analysis. Be sure to have enough participants before starting because otherwise you get into a never-ending search for additional subjects!

Another problem we did not recognize at first is that correctness of solutions and the time subjects take cannot be analyzed independently. Clearly, producing poor solutions takes hardly any time, whereas good solutions take a while. It seems obvious in hindsight, but blindly comparing work times is not acceptable; instead, one must compare work times on the basis of similar quality. For instance, one could devise three levels of quality, and then compare work times for each of those levels. The problem with that is the number of subjects needed. Each of the three correctness levels requires enough subjects to satisfy significance and power levels by itself, which in this case would triple the number of subjects. An alternative is to normalize quality by providing an acceptance tests. Every participant continues working until the solution handed in passes a pre-defined acceptance test. Thus, all participants provide a minimum of quality. The acceptance test can be an automatically executing test suit, and it can be run by the participants themselves or by the experimenters. It is also possible, but less objective, that the experimenters check solutions by hand (during the experiment!) and hand them back if unacceptable errors are present. Once the acceptance test is passed, one can measure time or other aspects. After the acceptance test, one can also apply a much larger test suite to differentiate quality in more detail. For one of the first experiments where an acceptance tests is used, see [Müller 2004].

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset