Deploy and utilize tracking systems based upon facts and data
A team should measure progress by successful completion of explicit integration objectives, including demonstration of desired product quality, not just be schedule-driven. To focus on schedule alone will ultimately create a bow-wave of work, and an overly complex environment in which to troubleshoot problems in which nothing works, and so everything must be suspect in every situation.
Unfortunately, as integration proceeds, schedule pressures naturally increase. As a result, retaining a parallel focus on quality can be very difficult to do. A measurement system that tracks and reports the status on the total quality picture, in light of the state of all testing, is crucial to providing decision-makers with the right information to pick the right path forward as integration proceeds.
During integration, there are several different kinds of verification activities that are being performed:
- Verification activities driven by requirements (sometimes called 'requirements verification'), and whose primary purpose is to assure that all requirements have been incorporated into the development processes of individual components, and attempt to get the requirements set as stable and high quality as possible going into the design effort.
- Verification activities used to assess development processes and the resulting products of those processes, and thus assure the requirements are properly being elaborated into robust designs that will be consistently produced, perform as expected, and not have undesirable behavior. Such activities are often called 'design verification', and focus on defects introduced during the development process, attempting to detect such defects as early as possible. Since these development processes often span many different groups, and since the components being developed often interact in unanticipated ways, the primary purpose of this verification is thus to produce information about these development processes and associated component maturity.
- Verification activities that are driven by manufacturing, deployment, and installation provisions. Such activities are sometimes called build verification, and are intended to detect defects unique to installation of a component into the corresponding system, and that may be injected by that process (for example, improper installation procedures, incomplete execution of installation steps, etc).
Additionally, validation activities are also typically being performed, and should be driven by realistic, end-to-end operational usage scenarios, to tie together the components into a cohesive whole and confirm the collection of development and verification processes are all producing the desired result. Such validation activities are typically defined and performed by members of an independent, but representative user community, rather than by the original development team itself, and thus provide a degree of independent assurance that things are proceeding per plan. All of these activities require careful process management and oversight, to ensure the desired results are achieved; see descriptions of verification practices and validation practices which are typically utilized to perform such oversight.
These verification and validation activities are usually performed in a variety of different planned test sites, facilities, and analysis environments, and are sequenced in a pipelined (but overlapping) fashion, so that the defect detection and removal processes are most efficient and effective. Each of these environments will typically have their own set of stakeholders, perspectives, and challenges. Generally, these activities will be performed according to a pre-planned, overall verification strategy, which provides for a means of looking across all these activities in order to develop an overall assessment of both the effectiveness of each activity, and what the impact is likely to be on what will be produced at the end.
Metrics are typically used to assess progress against these objectives, and often are deployed over time, according to the business goals which are important to consider at different points in the various integration cycles. An organized approach to defining such metrics is to work through a 'Goals - Questions - Metrics' (or GQM) framework, to ensure that the business purpose for the metrics is clear and understood, and that the metrics themselves are designed to meet those business goals.
Below is an example structure and representative metrics set for a large software-intensive integration effort; obviously, not all of these will be relevant to all phases of such an integration effort, and may not all be appropriate to other development efforts, but are offered to stimulate thought about how overall activities might be instrumented for decision-making. A detailed discussion of how metrics can be effectively used in a data-rich environment can be found here.
|
Goals |
Questions |
Metrics |
|
Ramp-up testing |
What percentage of planned test procedures have been written? |
Percentage of planned test procedures written by week (tests to be written vs actually written) by test site over time |
| What percentage of planned testing objectives are being achieved in a given reporting period? | Percentage of planned test procedures accomplished by week (tests actually run vs planned to be run) by test site over time | |
| How frequently are scheduled test procedures completed successfully? | Percentage of scheduled tests run to completion by test site by week vs plans | |
| How much time is it taking to accomplish testing procedures? | Average time + range by week to run tests to completion (whether pass or fail) by test site over time | |
| Stabilize the components to enable efficient integration | How many components are under test across all test sites vs plans? | S-curve of planned vs actual components under test by standalone test site by week |
| What percentage of total functionality is under test across all standalone test sites vs plans? | S-curve of total functionality under test (planned vs actual) by standalone site by week | |
| What is the rate of discovery of defects per hour of standalone testing across all test sites? | Actual defects discovered per hour of standalone testing by week across standalone test sites | |
| How long does it take from when a problem is identified until it is isolated to the actual component which is non-compliant with specifications? | Average troubleshooting time from problem discovery until a problem report is written against the specific area in which a change must be incorporated, by site by week? | |
| What is the projected and actual backlog of defects which will require regression testing over time by each standalone test site? | Planned and actual count of open defects which must be regression tested across standalone testing sites by week | |
| How much of test available time has been required and is projected to be required (vs planned to be available) to support regression testing? | Planned, 'demand', and actual time required to perform regression testing across standalone test sites by week | |
| Stabilize the integration elements |
How many integration baselines are currently under test across all sites vs plans? | S-curve of planned vs actual integration baselines under test by integration test site by week |
| What percentage of all integration elements are under test across all integration sites vs plans? | S-curve of total functionality under test (planned vs actual) by integration test site by week | |
| What is the rate of discovery of defects per hour of integration testing across all test sites? | Actual defects discovered per hour of integration testing by week across integration test sites | |
| How long does it take from when a problem is identified until it is isolated to the actual component which is non-compliant with specifications? | Average troubleshooting time from problem discovery until a problem report is written against the specific area in which a change must be incorporated, by site by week | |
| What is the projected and actual backlog of defects which will require regression testing over time by integration test site? | Planned and actual count of open defects which must be regression tested across integration sites by week | |
| How much of test available time is required to support regression testing (planned vs actual)? | Planned, 'demand', and actual time required to perform regression testing across integration test sites by week | |
| Monitor the efficiency and utilization of resources |
What are the planned (both original and current plan) test hours over time to achieve all test objectives? | S-curve of planned vs actual test hours for all baselines under test by test site by week |
| What is the forecast demand for test hours per week vs current plan? | S-curve of planned vs demand test hours for all baselines under test by test site by week | |
| What percentage of defects identified during integration could potentially have been detected earlier with more thorough testing? | Percentage of defects by week by test site that could have been detected in preceeding test sites and phases | |
| How frequently are duplicate defects detected at multiple test sites during testing? | Count of defects by week discovered but on analysis determine to be previously found at another test site | |
| How much time is spent troubleshooting and resolving problems uncovered at each verification stage? | Mean time to isolate problems to a defect which is then scheduled in a future build by week | |
| Forecast realistic projections of release dates | How stable are the requirements for releases? | New and changed requirements per release (planned vs actual) |
| When will the discovery of defects have reached a diminishing return? | Defects discovered per hour of new testing by week Defects discovered per hour of regression testing by week | |
| How many hours of critical test procedures remain to be run by week in the future? | Burn-down S-curve of critical test procedures, by site. | |
| How frequently do test procedures successfully complete to determine pass vs fail? | Percentage of test procedures completed by week, by pass/fail, by test site | |
| How long will it take to run required regression tests? | Average time per regression test Frequency of injection of new defects during repair Quantity of regression tests required to cover areas of new and changed code |
|
| How long will it take to fix critical problems? | Mean flow time & range to repair problems by component, by problem severity, by week resolved | |
| Forecast product quality at a point in the future | Quantity of defects leaking into subsequent testing sites? | Defects detected per thousand lines of new or changed code by verification phase Defects detected per thousand lines of code in validation (after verification completed for a given baseline) |
| How many of planned tests are forecast to be completed by a future date? | Burn-down S-curve of critical test procedures, by site | |
| What percentage of critical defects to total defects are being uncovered during testing? | Percentage of critical to total defects found, by week, by test site | |
| How many of current and forecast critical problems are anticipated to be repaired and regression tested over time? | Burn-down S-curve of hours remaining, critical tests only (new and regression) | |
| Determine if sufficient testing has been planned | How much additional testing would it take to reveal more defects? | Defects discovered per hour of new testing by week |
| How many of the system states have been and will be exercised? | S-curve of critical system states exercised (planned vs actual) by week, by test site | |
| How many of the input / output pathways of the system have been traversed? | S-curve of unique i/o pathways exercised (planned vs actual) by week, by test site |
While the number of measurements in the above list may seem overwhelming, the goals and questions can actually all be satisfied with a relatively small set of information about each defect and how and when it was detected, isolated, and resolved, i.e:
- Defect characterizations- when and where it was detected, and how serious it is
- Information about what it took to find and repair these defects - time and effort
- Information about the component in which the defect was introduced - how big, how complex, and against which version (so statistics like injection and removal rates of defects per thousand lines of code can be calculated)
- Information to relate the defect to the individual development and verification processes which were used to uncover the defect, to assess their efficiency and effectiveness
Often a single capture of data can thus be used to aggregate information and thereby answer multiple questions and contribute towards gaining insights for multiple goals. This can only be achieved if well planned, however, as even the above minimal set itself can have significant variation in the costs of collection, accuracy and uniformity of information, and resulting usability of the information that is produced. For example, tracking the costs of rework is often difficult to collect over time, without recording per-defect efforts of many people (in hours or minutes) for their individual participation. For this reason, such measures are typically best done as subjective assessments, such as 'easy (<1 person-hr), typical (1-8 person-hrs), substantial (8-40 person-hrs), and major (>40 person-hrs), rather than absolute measures, to ease this burden, while preserving the ability to meet the original business goal. However, even with bucketing, such assessments, while very important, can be difficult to analyze across organizations, tracking systems, and processes, without considerable planning.
Many groups often prefer to just collect and count the minimum set of information (type 1 above). While this may be desirable, if this path is chosen, it will be difficult to use that information to manage the quality of the product proactively and make accurate predictions of the future. Instead, all you will do is report on the past, without being able to predict the future.
Of course, you can always start with this basic information, and expand what you collect over time... though if you make this choice, you may not be able to go back and analyze information from previous phases when you didn't collect this data, and thus the reliability of your predictions will take more time to validate and improve. Always collect at least three points before presuming that a trend has been established.
