How to burn down problems before they burn you
A problem is a questionable property observed during some evaluation or operation of a system. It may be reclassified as an anomaly if it's determined to be a nuisance, a failure if it's determined that it affects operation of the system in an unacceptable manner, or a request for enhancement if desired functionality is in fact missing - but is always a distraction, if its root cause is difficult to isolate, if it cannot be duplicated, or if the problem is with the user, not the system. The problem may even become a feature if it turns out to be useful in some circumstances. Determining which of these paths is actually pursued is highly dependent upon how well the problem is written in the first place - and if you think that always is done well, read this.
Much of project management, systems or software engineering, and verification is about solving problems. If you don't think there are problems to solve in the context of the projects or systems you manage, you are either very lucky, clueless, or have happened upon a trivial situation in which robust project management and engineering techniques are not required (which means spending time here is probably itself not very useful).
Some groups prefer to use the term 'issues' rather than problems, and attempt to distinguish between two types of issues - changes (or making things different than originally intended) and corrective actions (or fixing things to work as originally intended), though in practice this is often a subjective assessment that is quite prone to bias. Others may call them squawks, trouble tickets, or even kaizens. Regardless of the name you choose, what is underway here is an attempt to close a gap between what is needed and what exists.
Classic internet resources associated with such problem solving includes:
- Basic Guidelines to Problem Solving and Decision Making
- Problem solving strategies
- Problem solving tools
Each of these require an early and clear definition of what the problem is in the first place.
The time it takes to resolve problems is highly dependent upon the maturity of your issue management and troubleshooting approaches. Such problem solving is only likely to be done effectively when you have the discipline to track the status of such problems to closure, typically through some kind of an issue tracking system, or similar means of Getting Things Done. In the context of an operational system, such problems usually will also be analyzed in the context of how they may have been injected in the first place, the conditions under which symptoms of the problems were detected, and information which will allow you to locate the defect(s) which are associated with the symptoms which the problem is a manifestation of. A call center may also be useful to centralize the capture and management of problems overall.
Whether under development or in operation, the steps and flow required to implement repairs and reverify the solution must be considered; these items collectively can be considered a verification strategy. The importance of using an effective verification strategy, and it's impact on the flow time for solving problems, is highlighted here.
A problem's lifecycle typically consists of the following steps:
- A problem detector informs a problem collector about some problem, typically by writing an 'issue report'
- An aggregated collection of problems is reviewed on a periodic basis to track progress and coordinate action across responsible organizations
- An owner for the problem is assigned. The owner monitors and evangelizes the following actions.
- Reproduction of the problem
- Isolation of the symptoms to an originating component or components
- Locating the specific defect(s) which caused the problem in the component, and developing a candidate solution
- Confirming that this solution works with affected users
- Delivering a revised component that contains the repaired solution to the users.
- The problem is closed and archived
Some of the challenges associated with managing and solving problems include:
- How to organize and manage problems and action across multiple tracking systems and organizations?
- How to prioritize actions on the most critical problems?
- How to analyze whether similar problems have occurred in the past, and if so, how to track the related set as one issue?
An effective issue report is well-structured, is as simple and general as possible while assuring effective communications and efficient resolution of the problem over the above lifecycle, and is neutral and stays with the facts. In a group setting, meeting these simple requirements can often become quite complex. A useful workflow to track progress of problem-solving efforts over time, and which addresses the above suggestions, might include the following states:
- Proposed - an initial, unanalyzed state, in which a problem is submitted by a point of contact requesting resolution of a problem
- A Hold state, in which activites associated with the problem are suspended until some defined trigger occurs
- Active - the state in which the problem is being worked
- Resolved - This is intended to indicate that a decision has been reached that further action is no longer required (though it's typically kept in this state, for reporting purposes, for a defined period of time pending transition (typically automatically) to the next state.
- Closed - Such problems typically are not included in reports, except for 'closed problem' reports.
Tracking the flow of work through such states can be done effectively, and visually, through kanbans. Additionally, while in the active state, problem resolution protocols typically require further sub-states, to track the problem through many other stages, and involve collecting answers to many different questions. These questions aren't always answered serially, but the aggregate information which is collected in this processing is an indicator of the overall status of resolving the problem, and should highlight any blocking issues which potentially will hold up resolution of the immediate fix 'to the field'. The sooner these questions can be answered, the more rapidly a solution can be found:
- Has the problem been acknowledged as one worthy of attention?
- Are the symptom attributes of the problem sufficiently well defined to allow diagnosis and isolation to occur and determine what needs to change?
- Have the solution attributes been established (priority, need dates, etc)?
- Has an owner been assigned who has committed to planning and tracking progress towards a solution?
- Has the owner reproduced the problem and confirmed that it is well written and can be solved?
- Is the diagnosis complete, solution options analyzed, and action steps necessary to resolve the problem identified (including who needs to do that work)?
- Are resources and steps necessary to close all gaps committed to with firm dates?
- Are all necessary steps for the solution implemented and ready for verification?
- Was the verification successful so that resolution can be communicated?
- Should the problem be withdrawn from consideration as it's no longer important or relevant (this decision can be difficult to properly make, but once made, needs to be communicated in all related problem logs)
Tracking and understanding where any particular problem is with respect to the above workflow is important in order to understand what progress has been made, coordinate all the actions required to implement a solution, and to be realistic about what it will really take (in time and resources) to resolve issues. In addition to this workflow state and sub-state information, additional information is also required, and should include:
- A one-line summary of the problem used as a short-hand descriptor (a 'title')
- A unique identifier that allows this problem to be cross-referenced with others
- A succinct and accurate description of the problem's symptoms, which are typically described as a contrast between what is happening, and should be happening (the 'gap').
- A complete description of the environment in which these symptoms were observed. This should fully describe the steps necessary to recreate the symptoms, including input test conditions, configuration version descriptions of the components under test (whether a formal baseline or unique configuration) and the test fixture itself, and instructions on setting up the test.
- Linkages to related problems
- A log of actions taken to attempt to isolate and diagnose the underlying problem
Such information is crucial when handing off an issue from one responsible party to another (such as from a problem detector to a component producer), as it helps to avoid traversing troubleshooting ground already covered, and helps to ensure that the problem can in fact be re-created. If it cannot, whoever was assigned the problem is likely to just return it as something that 'cannot be reproduced', even though the latent defect may remain. Ideally, the issue tracking system in use will allow groups to 'subscribe' and be notified via email when any of the above information changes.
Every new problem should be assigned a priority, which should be relative to other problems already under review.
- The higher the priority, the sooner the problem should be addressed, given available resources
- A problem's priority should not be confused with the problem's severity, which is the impact to stakeholders
- Prioritizing problems should be the primary method of controlling development and problem solving
Policies on the aggregate collection of issues should address the following questions:
- Who enters issues into the issue tracking system, and by when?
- Who gets access to what issues (for protection of intellectual property, tracking of status, etc)
- How will assignments of who is to investigate the problem be made and recorded?
- How will issues be classified once they are entered?
- Who will set priorities?
- How will capacity limitations and capabilities be managed against demand?
- How will need dates and realistic solution dates be balanced?
- Who will have authority to close problems, and what actions are required at problem closure (how do you know it should be closed)?
All of this is a lot of work, which is motivation to minimize the creation of problems in the first place, to establish a verification strategy to find them early before lots of people are involved, and to give adequate attention to planning so that the management of these activities and information can be done smoothly and efficiently.
- Bryan Pflug's blog
- Login or register to post comments
