The Contribulyzer

Like most open source projects, Subversion [21] is continually trying to identify potential new core maintainers. Indeed, one of the primary jobs of the current core group is to watch incoming code contributions from new people and figure who should be invited to take on the responsibilities of core maintainership. In order to honestly discuss the strengths and weaknesses of candidates, we (the core maintainers) set up a private mailing list, one of the few non-public lists in the project. When someone thinks a contributor is ready, she proposes the candidate on this mailing list, and sees what others' reactions are. We give each other enough time to do some background checking, since we want a comfortable consensus before we extend the offer; revoking maintainership would be awkward, and we try to avoid ever being in the position of having to do it.

This behind-the-scenes background checking is harder than it sounds. Often, patches [22] from the same contributor have been handled by different maintainers on different occasions, meaning that no one maintainer has a good overview of that contributor's activities. Even when the same maintainer tends to handle patches from the same contributor (which can happen either deliberately or by accident), the contributor's patches may have come in irregularly over a period of months or years, making it hard for the maintainer to monitor the overall quality of the contributor's code, bug reports, design suggestions, and so forth.

I first began to think we had a problem when I noticed that names floated on the private mailing list were getting either an extremely delayed reaction or, sometimes, no reaction at all. That didn't seem right: after all, the candidates being proposed had been actively involved in the project, usually quite recently, and generally had had several of their patches accepted, often after several iterations of review and discussion on the public lists. However, it soon became clear what was going on: the maintainers were hesitant to solely rely on their memories of what that candidate had done (no one wants to champion someone who later turns out to be a dud), but at the same time were daunted by the sheer effort of digging back through the list archives and the code change history to jog their memories. I sensed that many of us were falling into a classic wishful postponement pattern when it came to evaluations: "Oh, so-and-so is being proposed as a new maintainer. Well, I'll save my response for this weekend, when I'll have a couple of hours to go through the archives and see what they've done." Of course, for whatever reason the "couple of hours" don't materialize that weekend, so the task is put off again, and again … . Meanwhile, the candidate has no idea any of this is going on, and just continues posting patches instead of committing [23] directly. This means continued extra work for the maintainers, who have to process those patches, whereas if the candidate could be made a maintainer himself, it would be a double win: he wouldn't require assistance to get his patches into the code, and he'd be available to help process other people's patches.

My own self-observation was consistent with this hypothesis: a familiar sense of mild dread would come over me whenever a new name came up for consideration—not because I didn't want a new maintainer, but because I didn't know where I'd find the time to do the research needed to reply responsibly to the proposal.

Finally, one night I set aside my regular work to look for a solution. What I came up with was far from ideal, and does not completely automate the task of gathering the information we need to evaluate a contributor. But even a partial automation greatly reduced the time it takes to evaluate someone, and that was enough to get the wheels out of the mud, so to speak. Since the system has been up, proposals of candidates are almost always met with timely responses that draw on the information in the new system, because people don't feel bogged down by time-consuming digging around in archives. The new system took identical chores that until then had been redundantly performed by each evaluator individually, and instead performed them once, storing the results for everyone to use forever after.

The system is called the Contribulyzer (http://www.red-bean.com/svnproject/contribulyzer/): it keeps track of what contributors are doing, and records each contributor's activity on one web page. When the maintainers want to know whether a given contributor is ready for the keys to the car, they just look at the relevant Contribulyzer page for that contributor, first scanning an overview of his activities, and then focusing in on details as necessary.

But how does a computer program "keep track of" what a contributor is doing? That sounds suspiciously like magic, you might be thinking to yourself. It isn't magic: it requires some human assistance, and we'll look more closely at exactly how in a moment. First, though, let's see the results. The first figure on The Catch shows the front page of our Contribulyzer site.

If you click on a contributor's name, it takes you to a page showing the details of what that contributor has done. The second figure on The Catch shows the top of the detail page for Madan U S.

The four categories across the top indicate the kinds of contributions Madan has had a role in. Each individual contribution is represented by a revision number—a number (prefixed with r) that uniquely identifies that particular change. Given a revision number, one can ask the central repository to show the details (the exact lines changed and how they changed) for that contribution. For r22756 and r18324, Madan found the bugs that were fixed in those revisions. For the largest block of revision numbers—the "Patches"—he wrote the change that some maintainer eventually committed. For the remaining revisions, he either reviewed a patch that someone else committed, or suggested the fix but (for whatever reason) was not the one who implemented it.

The Contribulyzer (main page).

Figure 21-1. The Contribulyzer (main page).

The Contribulyzer (contributor page).

Figure 21-2. The Contribulyzer (contributor page).

Those four sections at the top of the page already give a high-level overview of Madan's activity. Furthermore, each revision number links to a brief description of the corresponding change. This description is known as a log message: a short bit of prose submitted along with a code change, explaining what the change does. The repository records this message along with the change; it is a crucial resource for anyone who comes along later wanting to understand the change.

If you click on "r20727", the top of the next screen will show the log message for that revision, as shown in the next figure.

The Contribulyzer (revision entry).

Figure 21-3. The Contribulyzer (revision entry).

The revision number here is a link, too, but this time to a page showing in detail what changed in that revision, using the repository browser ViewVC (http://www.viewvc.org/); see the first figure on The Catch.

From here, you can see the exact files that changed, and if you click on "modified", you can see the code diff itself, as shown in the second figure on The Catch.

As you can see, the layout of information in the Contribulyzer matches what we'd need to jog our memories of a contributor's work. There's a broad overview of what kinds of contributions that person has made, then a high-level summary of each contribution, and finally, detailed descriptions for those who want to go all the way to the code level.

But how did all this information get into the Contribulyzer?

ViewVC revision page.

Figure 21-4. ViewVC revision page.

ViewVC file diff page.

Figure 21-5. ViewVC file diff page.

The Catch

Unfortunately, the Contribulyzer is not some miraculous artificial intelligence program. The only reason it knows who has made what types of contributions is because we tell it. And the trick to getting everyone to tell it faithfully is twofold:

  • Make the overhead as low as possible.

  • Give people concrete evidence that the overhead will be worth it.

Meeting the first condition was easy. The Contribulyzer takes its data from Subversion's per-revision log messages. We've always had certain conventions for writing these, such as naming every code symbol affected by a change. Supporting the Contribulyzer merely meant adding one new convention: a standard way of attributing changes that came from a source other than the maintainer who shepherded the change into the repository.

The standard is simple. We use one of four verbs (for the expected types of contributions), followed by the word by: and then the names of the contributors who made that type of contribution to that change. Most changes have only one contributor, but if there are multiple contributors, they can be listed on continuation lines:

Patch by: name_1_maybe_with_email_address
          name_2_maybe_with_email_address
Found by: name_3_maybe_with_email_address
Review by: etc...
Suggested by: etc...

(These conventions are described in detail at http://subversion.tigris.org/hacking.html#crediting.)

One reason it was easy to persuade people to abide by the new standard is that, in a way, it actually made writing log messages easier. We'd been crediting people before, but in various ad hoc manners, which meant that each time we committed a contributor's code, we had to think about how to express the contribution. One time it might be like this:

Remove redundant code introduced in r20091.  This came from a patch by
name_1_maybe_with_email_address.

And another time like this:

Fix bug in baton handoff.  (Thanks to so-and-so for sending in the patch.)

While the new convention was one more thing for people to learn, once learned it actually saved effort: now no one had to spend time thinking of how to phrase things, because we'd all agreed on One Standard Way to do it.

Still, introducing a new standard into a project isn't always easy. The path will be greatly smoothed if you can meet the second condition as well, that is, show the benefits before asking people to make the sacrifices. Fortunately, we were able to do so. Subversion's log messages are editable (unlike some version control systems in which they are effectively immutable). This meant that, after writing the Contribulyzer code to process log messages formatted according to the new standard, we could go back and fix up all of the project's existing logs to conform to that standard, and then generate a post-facto Contribulyzer page covering the entire history of the project. This we did in two steps: first, we found all the "@" signs in the log messages, to detect places where we mentioned someone's email address (since we often used people's email addresses when crediting them), and then, we searched again for just the names—without the email addresses—harvested from the first search. The resultant list of log messages numbered about one thousand, and with the help of a few volunteers (plus some rather labyrinthine editing macros), we were able to get them all into the new format in about one night.

Thus, the proposal of the new standard coincided with a demonstration of what it could do for us: we had the full Contribulyzer pages up and running from the moment the Contribulyzer was announced to the team. This made its benefits immediately apparent, and made the new log message formatting requirements seem like a small price to pay by comparison.

The Limits of the Contribulyzer

There is a famous saying, much used in open source projects, but surely long predating them:

The perfect must not be the enemy of the good.

The Contribulyzer could do much more than it currently does. It's really the beginnings of a complete activity-tracking system. In an ideal world, it would gather information from the mailing list archives and bug tracker as well as from the revision control system. We would be able to jump from a log message that mentions a contributor to the mailing list thread where that contributor discusses the change with other developers—and vice versa; that is, jump from the mailing list thread to the commit. Similarly, we would be able to gather statistics on what percentage of a given person's tickets in the bug tracker resulted in commits or in non-trivial discussion threads (thus telling us that this person's bug reports should get comparatively more weight in the future, since she seems to be an effective reporter).

The point would not be to make a ratings system; that would be useless, and maybe even destructive, since it would suffer from inflationary pressures and tempt people into reductively quantitative comparisons of participants. Rather, the point would be to make it easy to find out more about a person once you already know you're interested in her.

Everyone who participates in an open source project leaves a trail. Even asking a question on a mailing list leaves a trail of at least one message, and possibly more if a thread develops. But right now these trails are implicit: one must trawl through archives and databases and revision control histories by hand in order to put together a reasonably complete picture of a given person's activity.

The Contribulyzer is a small step in the direction of automating the discovery of these trails. I included it in this chapter as an example of how even a minor bit of automation can make a noticeable difference in a team's ability to collaborate. Although the Contribulyzer covers only revision control logs, it still saves us a lot of time and mental energy, especially because the log messages often contain links to the relevant bug-tracker tickets and mailing list threads—so, if we can just get to the right log messages quickly, the battle is already half-won.

I don't want to claim too much for the Contribulyzer; certainly, there are many aspects of running an open source project that it doesn't touch. But it significantly reduces our workload when evaluating potential new maintainers, and therefore makes us more likely to do such evaluations in the first place. For one day's investment in coding, that's not a bad payoff.

Writing metacode rarely feels as productive as writing code, but it's usually worth it. If you've correctly identified a problem and have seen a clear technical solution, then a one-time effort now can bring steady returns over the life of the project.



[21] Subversion is an open source version control system; see http://subversion.tigris.org/.

[22] A patch is code contribution, such as a bugfix, sent in a special format known as "patch format." The details of that format don't matter here; just think of a patch as being a proposed modification to the software, submitted in extremely detailed form—right down to which precise lines of code to change and how.

[23] To commit means to send a code change directly into the project's repository, which is where the central copy of the project's code lives (see http://en.wikipedia.org/wiki/Revision_Control for more). In general, only core maintainers are able to commit directly; all others find a core maintainer to shepherd their changes into the repository. See http://producingoss.com/en/committers.html#ftn.id304827 for more on the concept of "commit access."

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset