An empirical experiment to assess the usefulness of a software metrics tool

This document is a participant's-eye view of the experiment which will be conducted as part of the final phase of Tim Littlefair's PhD research project on the use of software metrics.

The experiment consists of an exercise comparable to the preparation required for a software code review. Each participant will be supplied with a sample of source code and a checklist of quality issues to be assessed across the code supplied. The sample will consist of Java classes, and the checklist will ask 5 simple yes/no questions, with an answer required to each question for each class. The questions are phrased so that a 'yes' answer implies the presence of some kind of risk to the project, and implies a possible need for consideration of corrective action. A 'no' answer implies the absence of risk, and that the issue does not need to be addressed in relation to the class under consideration.

The participants in the experiment will be divided to three groups. One group will be asked to take as much time as they feel useful to perform the exercise (it is suggested that one hour is the maximum time participants can reasonably be asked to spend on the exercise). The responses of this group will be treated as representative of a consensus of informed professional opinion on the code sample in relation to the issues discussed. Two other groups will be formed, each of which is asked to perform the same task under a very tight time constraint (15 minutes). It is expected that the responses of these two groups will be more erratic than those of the first group, due to the time pressure. One of these groups will be supplied with the output of a metric analysis tool in addition to the code sample.

The purpose of the experiment is to determine whether the time-constrained group operating with the support of the metric tool can be demonstrated to have a 'better' performance on the exercise than the time-constrained group without that support. Performance of the two groups will be judged in terms of statistical measures of agreement between the degraded responses provided by each of the time constrained groups and the supposedly authoritative response from the non-time constrained group.

A few important points about the investigation:

The experiment has now been completed. In summary, there were only 15 volunteers who participated in the experiment, the treatment group were observed to perform marginally better than the control group, but statistical processing ruled that the difference observed was insufficient to make a finding of proven benefit. Despite the negative finding, I feel that the design of the experiment was adequately validated, and that useful work was done on establishing a viable framework for future work in this and related fields.

The results are described in detail in my PhD thesis. I have also done a Powerpoint presentation setting out a summary of the project, including the findings of this experiment. For the convenience of the audience on the web, selected slides from this presentation appear below.