AN INVESTIGATION INTO THE USE OF SOFTWARE CODE METRICS IN THE INDUSTRIAL SOFTWARE RESEARCH ENVIRONMENT
Supplementary Research Proposal
Tim Littlefair May 1999
This document is a proposal for supplementary research to be conducted by the candidate in order to extend an existing project, originally conducted as a part of a program for the degree of Master of Science, to meet the requirements for the degree of Doctor of Philosophy.
The candidate is grateful to his supervisor and other reviewers for their observations on the M.Sc. draft thesis which have been helpful in identifying some of the issues raised below. These issues may form a basis for further work to reach the standard required for submission at PhD level.
Review of the existing project
The research work for the existing project was conducted over the period July 1994 to December 1998, and has been written up as an M.Sc. thesis, and submitted to the University. At the request of the candidate, the assessment of this thesis has been deferred pending a decision on the upgrade path described in this document.
The original research proposal envisioned a sequence of activities involving the recruitment as a partner in the research of a concern involved in software development prepared to experiment with metrics on at least one substantial team project. The candidate would work with members of the partner team to identify ways in which source code based software metrics could be used to feed back into the development process and hopefully enhance development outcomes. This process was to lead to the implementation an automated tool to calculate and report selected software metrics from source code. It was decided to seek a team using the language C++ (Ellis & Stroustrup, 1990) as this was at the time of proposal development (as it remains now) one of the most important languages across a wide variety of development domains. This tool was to be exercised over a period of time by the industry partner team, and at the end of the period the team members were to be surveyed. The survey was intended to assess the value or otherwise of the tool and (more importantly) the concepts about the use of metrics which it embodied.
The research questions defined in the proposal were as follows:
The research work as executed followed the pattern laid down by the proposal, although the recruitment of a suitable industry partner presented some problems. Ultimately the candidate obtained agreement from his employers, that the project on which he was currently engaged could be used as the partner team. While this agreement was willingly given, the candidate regarded this arrangement as falling short of being completely satisfactory for the original project model for a number of reasons, including:
These problems were dealt with by the decision, agreed by the candidate and his supervisor in late 1997, to use the industry partner team as a sounding board and test site for the development of the analyser tool but not for its evaluation. At that time, the tool had already been released for use in the wider software engineering community via the Internet: it was decided to attempt to perform the evaluation by surveying users and potential users in that wider community.
Over the duration of the project, a number of releases of the analyser tool were made via the world wide web, initially by uploading to FTP sites outside the University, and later by creating a project home page on the faculty web site. The first such release was performed in July 1995, the project home page was established in early 1997, and there were a succession of releases between April and September 1997. The tool was named CCCC which stands for ‘C and C++ Code Counter’.
The initial release supported a set of procedurally-oriented metrics including lines of code (Conte, Dunsmore & Shen, 1986, p. 34), lines of comment, and McCabe’s cyclomatic complexity (McCabe, 1975). Later releases added support for a set of structurally-oriented metrics based on work on information flow metrics by Henry and Kafura (Henry & Kafura, 1981; Henry & Kafura, 1984), and a set of metrics based on the object-oriented design/programming paradigm proposed by Chidamber and Kemerer (Chidamber & Kemerer, 1994). The structural metrics supported included modified versions of the information flow metrics designed to incorporate weighting policies to favour some modern C++ coding idioms (Coplien, 1992; Carroll & Ellis, 1995). These releases also fixed bugs reported from the field and added a number of new features including modified metrics and support for the languages Java and Ada in addition to C and C++.
FTP logs for the period November 1997 to November 1998 reveal that over this period over 1000 downloads of version 2.1.4 of CCCC took place. There may be some double counting in this figure as three different distributions were provided (source plus MS/DOS binaries, source plus Linux binaries, source only). There is no way of knowing how many sites are presently actively using the tool, although there is a steady trickle of email on the subject still coming in. It has been used to support at least two published research projects in related areas (Judge & Mistry, 1998; Judge & Williams, 1997; Kim & Boldyreff, 1997).
On completion of the development process for the CCCC tool, sections were written for the thesis covering the work to date. This included documentation of the grounds for choosing the metrics implemented using the Goal/Question/Metric format (Basili & Rombach, 1988). The metrics were also examined in the light of theoretical frameworks for considering the validity of software metrics suggested by Fenton (1991) and Weyuker (1988).
A questionnaire implemented as an HTML form was posted on the project web site, and publicized in a number of USENET news postings. Over the period January – September 1998, 24 responses were received. The outcomes of the survey are described at length in the MSc thesis draft, but some significant points are:
Critique of the existing project
It has been observed that the current thesis accepts existing literature in the subject area with less critical evaluation than is appropriate at the level of a PhD thesis. There is a need to seek out literature which conveys negative sentiments about the use of source code based metrics in addition to that which broadly supports such use.
The evaluation survey conducted as the end point of the initial work has some valuable attributes but others are more dubious. The sample size is respectable, and the demographic questions included in the survey itself indicate that (assuming truthful responses) the response group is representative in some ways of an interesting cross-section of the software engineering profession. However, the administration of the survey using the Internet gives us a response group which was self-selecting and which we would expect to be biased in favour of respondents with a degree of time on their hands. It is also difficult to avoid a presumption that the respondents would be biased toward at least some positive sentiments towards some aspects of code-based software metrics. The administration mode of the questionnaire also restricted the ability of the candidate to ensure that respondents had been exposed to the issues under investigation by use of the CCCC tool over a period of time before responding. In the light of these issues, the survey results reported above (particularly those relating to the value of individual metrics) cannot be seen as a ringing endorsement of any of the concepts the project explores.
Proposal for further research
As noted above, one of the weakest points of the current project is the concluding evaluation phase, and it would seem appropriate to attempt to find more satisfactory ways to address the issues this phase was intended to explore.
There are relatively few examples in the literature on the subject of attempts to perform empirical evaluation on the value of the use of software metrics in real industrial situations (the papers cited by Kafura and Henry are honourable exceptions to this generalisation). Of the work of this kind which has been seen to date, the majority attempt to validate a metric as a predictor of specific quality problems by correllating it with some available attribute which can be interpreted as a proxy for the existence of the predicted problems.
While the approach above has some appeal for the evaluation of metrics, it adopts a simple-minded view that the usefulness of a metric is determined solely by its ability to provide predictions of quality problems independent of human intervention. This view cannot be seen as wrong, but it is not necessarily adequate to capture the subtleties of interaction between a body of software, a metrics tool and the human mind in the expected context of use. In this context, the metric analysis tool operates as a kind of decision support tool, and its value is directly related to the enhancement to the performance of a human operator in the presence of such a tool. The role of the tool as a component in a human-centred system is implicit in much of the material on the subject, but is considered more explicitly in (Shepperd & Ince, 1988).
The candidate will develop a design and realistic materials for a controlled experiment which tests the effect of the features, presence or absence of a metric analysis tool on performance of an individual in an exercise structured broadly like a software code review executed under severe time constraints. Review techniques (Fagan, 1976) are widely used in the software industry and are an appropriate forum in which to examine the interaction of metric ideas with professional judgement. As part of the experimental design, consideration will be given to the likely number of samples in each control group required to generate meaningful results, and an estimate of the labour cost associated with each sample will be prepared.
An attempt will be made to execute the test design over at least a pilot sample of volunteers using the developed material. If sufficient industry cooperation is forthcoming the experiment may be executed on a realistic scale and results and conclusions reported. Given the labour cost associated with the execution of the experiment, it would be unethical to proceed with execution on more than a pilot sample unless theoretical projections indicate that there is a strong chance of statistically significant results emerging from the available volunteer population.
Literature on the evaluation of decision support systems will be considered as part of the experimental design process: while there may be a lack of such materials in the software engineering area, the medical field is seen as an analogous area where useful theoretical insights may be found. Raiffa (1968) presents a mathematic framework for integration of subjective judgements with statistical analysis to arrive at optimal decisions.
The remainder of this section of the proposal presents a more detailed description of the proposed experimental design.
The main part of the experiment will consist of administering a standard exercise to a large number of individual subjects. The exercise itself will take the form of a simulated software code review in which each subject is asked to examine a number of source code components and is asked to answer the same set of questions about each component. The questions will be the kind of generic questions which might occur in a checklist for code reviews, e.g. "Does the component contain appropriate comments relating to each interface?". Each question will require a simple yes/no answer, with the affirmative response indicating the relative absence of perceived risk in relation to the underlying code, and a negative response indicating a perception of the presence of risk. In a real-life situation, the negative responses would be the triggers for a requirement on the developer responsible for the reviewed code to address the issue raised by the question.
All subjects will be posed an identical exercise of this type with the same set of subject components, but they will be divided into a number of groups which will be treated differently from the point of view of provision of access to metric tools and ideas. The different treatments to be applied will be considered in detail as the experimental design is developed, but the following are a proposed starting point:
The three groups above are intended to support separation of the effect of thinking about metrics issues from the effect of access to the metrics tool. The overall relative performance of these groups will be compared by use of a technique known as Receiver Operating Characteristic (ROC) analysis, which is discussed in the literature relating to the evaluation of medical diagnostic techniques (Metz, 1996). This technique focusses on statistical analysis of the performance of a diagnostic indicator by corellating the indicator outcome with the independently determined state of nature. For this analysis, the entire system exhibited by each group (i.e. code review by unmodified intellect for group 0, code review by metrics-sensitized intellect for group 1, code review by metrics-sensitized intellect supported by tool for group 2) will be evaluated, and an ROC curve for each group derived. The curves for each of these groups should give a good indication of relative performance in terms of projected false positive and false negative review outcomes.
The ROC technique that will be applied to the data from the primary investigation requires the provision of an independently obtained reference response to the set of questions, which must be taken by the ROC analysis to be the correct ‘state of nature’ for the questions posed. It is anticipated that this will be obtained by recruiting a fourth group of volunteers to undertake the exercise without access to the tool and analysing their responses to identify items on which there is a reasonable consensus. This fourth group will operate under conditions very similar to group 0, but we wish to treat their response as being more representative of a consensus on best professional practice. It is for this reason that we artificially degrade the responses of groups 0, 1 and 2 by subjecting all of them to an identical artificially short time constraint in the execution of the exercise. The members of the fourth group, on the other hand, are invited to take as much time as they feel useful to complete the exercise. It may be appropriate to attempt to recruit individuals of a significantly higher skill level to form the fourth group, as its purpose is to yield an authoritative response profile. This group will be known as group G (the G standing for ‘guru’).
Recruitment of subjects will be attempted using USENET news groups on the Internet. As one of the aims of the experiment is to separate the effect of thinking about metrics from the effect of having them reported, it is important that the recruitment be done in forums which are not oriented to metrics topics. The initial recruitment material must not identify the exact nature of the investigation: a vague description like ‘evaluation of decision support techniques for software code reviews’ should be used. The group G volunteers may be recruited by seeking peer nominations from an authoritative source (e.g. specialist object oriented training consultancies) rather than via self-selection. The administration of the exercise will probably require the cooperation of a non-subject invigilator at each site where volunteers are found, so that the exposure of preparatory materials and the exercise itself can be controlled.
The identification of a substantial set of responses over which a professional consensus exists is a required for the main study to be meaningful, and it cannot be assumed that such a consensus will be found at the first attempt. For this reason, the group G exercise should be conducted before the main study. Following the group G exercise, pilot exercises should be conducted on a number of subjects from other groups to verify that the emerging response is as expected an attenuated version of the signal emerging from group G. As a result of this pilot group adjustments may be made to the experimental materials or the time constraints. Finally the full scale group 0-2 studies should be executed. If volunteer numbers are insufficient to execute the full scale study, a scaled-down version of the group G and group 0-2 pilot studies should be executed and described for publication, in the hope that the design (if it proves workable) may be executed in full at a later date.
Projected Schedule
It is not possible at this stage to lay down a firm timetable for the execution of the study, but the following tasks are likely to be required. Each task has been allocated a nominal duration.
List of References
Basili, V.R., & Rombach, H.D. (1988, June). The TAME project: Towards improvement-oriented software environments. IEEE Transactions on Software Engineering, 14(6), 758-773.
Carroll, M., & Ellis, M. (1995). Designing and coding reusable C++. Reading, MA: Addison Wesley.
Chidamber, S.R., & Kemerer, C.F. (1991). Towards a metric suite for object-oriented design. In OOPSLA '91: Conference on object-oriented programming, systems, languages and applications. (pp. 184-196). New York: ACM Press.
Chidamber, S.R., & Kemerer, C.F. (1994). A metrics suite for object-oriented design. IEEE Transactions on Software Engineering, 20(6), 476-493.
Conte, S.D., Dunsmore, H.E., & Shen, V.Y. (1986).
Software engineering metrics and models. Menlo Park, CA: Benjamin/Cummings.
Coplien, J.O. (1992). Advanced C++ programming styles and idioms. Reading, MA: Addison-Wesley.
Ellis, M.A., & Stroustrup, B. (1990). The annotated C++ reference manual. Reading, MA: Addison-Wesley.
Fagan, M.E. (1976). Design and code inspections to reduce errors in program development. IBM Systems Journal, 15(3), 182-211.
Fenton, N.E. (1991). Software metrics: A rigorous approach. London: Chapman and Hall.
Henry, S., & Kafura, D. (1981). Software structure metrics based on information flow. IEEE Transactions on Software Engineering, SE-7(5), 510-518.
Henry, S., & Kafura, D. (1984). The evaluation of software systems structure using quantitative software metrics. Software Practice and Experience, 14(6), 561-573.
Judge, T.R., Mistry, N. S. (1998). Metrics for estimation. Available email: tom.judge@parallax.co.uk.
Judge, T.R., Williams, A. (1997). OO estimation – an investigation of the predictive object points (POP) sizing metric in an industrial setting. Available email:
Kim, H., & Boldyreff, C. (1997). Recovering design patterns using object-oriented metrics. Available email: hyoseob.kim@durham.uk.ac.
McCabe, T.J. (1976, December). A complexity measure.
IEEE Transactions on Software Engineering, SE-2(4), 308-320.
Metz, C.E. (1996). ROC analysis and design of observer performance studies.
Available WWW: http://www.radonc.uchicago.edu/IWDM/IWDMa85.html
Raiffa, H. (1968). Decision analysis: introductory lectures on choices under uncertainty.
Reading, MA: Addison-Wesley.
Shepperd, M.J., & Ince, D. (1989). Metrics, outlier analysis and the design process. Information and Software Technology, 31(2) , 91-98.
Weyuker, E.J. (1988). Evaluating software complexity metrics. IEEE Transactions on Software Engineering, 14(9), September, 1357-1365.