Home

Pflogging

the never-ending quest for pragmatic solutions, useful plans, flawless execution, and designs that endure
Home Areas of interest Agents of change Improvement strategies Process-based improvements

User login

  • Create new account
  • Request new password

A number of key features are only available to registered users. They include:

  • Access to the full content of top-rated material (only teasers are available to anonymous users after the material has been posted for 45 days)
  • The ability to search site content
  • The ability to access reviews of books relevant to site material
  • The ability to access key quotes relevant to site material
  • The ability to access content from partner sites
  • The ability to rate material
  • The ability to post comments
  • The ability to post new information and propose it for publication
  • The ability to request email notification when selected content is added or updated

The overhead of switching contexts

  • View
  • links
Submitted by Bryan Pflug on Sat, 02/16/2008 - 07:23
  • Process-based improvements
  • Diagnosing

Eye inside monitorFor a few minutes, let's think of the data collection, decision-making, and status reporting involved as a person or work group performs their work, and when they need to switch from performing one set of tasks to another. Let's call this monitoring and control the 'Work Operating System' for their production (as an analog of an operating system for a computer). Like a computer system, when a person or work group switches contexts, there is overhead (in both time and energy) involved.

Psychologists describe these context switches as disrupting an individual's psychological flow, and for this reason, such changes can significantly reduce overall throughput of these work teams, whether small or large. Such context switches need to occur when one set of tasks completes, or when the Work Operating System has to respond to high-priority jobs (such as emergent work, which may be more important than anything else, and deserve expedited attention). When requests for expedited work are authorized, the system must have sufficient resources and operating performance for accomodating this emergent work, and after it is accomplished, to continue to process both normal priority (time-critical in the short-term) jobs and lower-priority jobs (batch runs that are not immediately time-critical, but are still important over the longer term), and have sufficient throughput capacity to 'catch up' with the demand over time.

In such a processing system, work can arrive at rates far different from the capacity of the system to process that work, and in various states of 'readiness' for execution. There may need to be separate input job queues for different types of work (for example, for different customers, or different service levels, or different stages of 'readiness') and these job queues can each change length quite quickly. The jobs in these queues themselves may also need to be regularly reprioritized or shuffled, to respond to external conditions. To further complicate things, these job queues may be 'serviced' by multiple, concurrent processing agents that themselves have different characteristics.

Planning work such that an aggregate throughput is achieved at a desired rate through such a system is very challenging, and is only possible when the underlying system itself is operating deterministically. Over time, once a system (and it's resulting performance and variation) has been calibrated as achieving a given rate for a known set of inputs, various improvements can be incorporated to gradually increase that rate:

  • characteristics of the system itself (queue size, etc),
  • improvements to bottlenecks and constraints,
  • management and controls on the input job stream.

But achieving determinism of the underlying system itself is far more difficult than it sounds, and making the right decisions about what to change can involve considerable risk, until that determinism is achieved. Similarly, attempting to operate a system at a higher rate than it is capable of, for long durations, is also a risky venture.

Consider the rule set that is used when switching from one task to another (the WOS scheduling algorithm). This rule set comes into play in many different circumstances, such as when a roadblock is encountered in performing one task - waiting on resources, information, or decisions - or when overall priorities need to be reevalated so that focus can be redirected. The need to apply this ruleset can arise while in the middle of processing another task, when a task completes, or when a critical event (time, or some external stimulus) has transpired that necessitates the allocation of attention and resources on a new set of actions. Just determining which task to run next takes time, as some effort is required to analyze and characterize the options.  A scheduling algorithm can make such decisions if it indeed has 'control', but if it's not even clear which component is responsible for scheduling and decision-making, the throughput of the system will be unpredictable.

In order to determine what to do next, it is usually also useful to have some idea of how long things in the job queue will take before they begin execution, and to complete execution once they have begun. Preference is usually given at some level to completing things which are already started and close to completion (to keep the list of things that have to be regularly re-evaluated shorter). But sometimes, it may make more sense to start something else, at the expense of completing work in process.

Tracking these decisions and their consequences to overall performance is important to improving throughput. This allows adjustments to be implemented over time, as patterns develop (an example might be correcting for chronic underestimation of how long things will take). Selection of the right thing to improve, in the absence of such data, can be frustrating, because you may end up working on the wrong things, pouring a lot of energy into it, and not see the results that you desire.

The scheduling algorithm to optimize performance of this Work Operating System for maximum effective throughput, when resources are constrained, is conceptually simple. The first scheduling heuristic is to give attention to high priority tasks as they arise, by suspending lower-priority tasks. If no high priority tasks exist, then the second heuristic comes into play. Work should be performed according to the priorities of those tasks first, and the amount of remaining work to do second (so that things are close to completion are completed, and there are thus fewer context switches over time). This favors working on tasks for which the minimum requirements have not yet been achieved, and completion is within striking distance. A walkthrough of a group's scheduling algorithm (from documented processes, and with all involved factors and decision-makers) under various scenarios is an effective way of verifying the fitness of the algorithm to these scenarios. In pratice, though, using a tool such as a kanban can also be very effective, as it allows a visual indicator of overall system health to emerge, and enables the group to dynamicly tune the group's scheduling algorithm and dashboard over time, until an effective approach 'settles in'.

Tracking the time overhead associated with these activities and decision-making (the evaluation of what to change to, and the process of reconfiguration for the change) is important, especially when overall system characteristics - responsiveness to arrivals of new work, throughput, etc. - are key success criteria for the system overall. But accounting for (and reducing) this overhead is also a key factor in assuring that the overall system behavior is predictable, especially as the system approaches capacity limits.

If you schedule a computer at 90% of capacity, the computer generally will encounter a condition called thrashing, in which overall throuput drops significantly, due to resource constraints and constant context switches. This can happen with people or work teams, too, if they are not able to focus attention on individual tasks, but are constantly interrupted. Meetings, email, phone calls, and other disruptors can all become a source of these interruptions. Of course, if all jobs complete, a computer's operating system essentially 'spins' in an idle loop, waiting for more work. Spending much time in this state is also not an efficient use of resources. So there should be a way of minimizing interrupts, feeding the right mix of medium priority tasks, and having low priority tasks to work for those times that there is nothing else to do, if overall utiilization is to be maximized, while avoiding thrashing.

To stretch our analogy a bit further, from the perspective of the computer center's leadership team in this WOS analogy (the parent organization, or the leader of a work team), they may not want to buy more computers, even when there are backlogs of tasks to do, until they get insight about how effectively the existing computers are being used. Of course, this decision should be made based upon risk-based assessments of the cost of delaying jobs vs the cost of acquisition of new computers, and the time it takes for them to be brought on-line. While there is also a belief that adding more computers will provide corresponding linear increases in capacity, it should be recognized that while such changes are being made, and for a period of transition after, the change itself takes time from other resources, at least until the aggregate is fully operational; even then, there will be losses incurred. Simply put, 8 parallel computers generally does not result in 8 times the throughput of one computer, due to communications overhead, latency, and the complexity of scheduling.

One of the values of the analogy of a WOS is that it helps to emphasize the value of conducting periodic reviews on data collected against jobs run in a given period, and making decisions to tune the job queue, scheduling algorithms, and associated priorities accordingly. Such is the focus of computer operators for major data centers around the world, and for operations managers of production centers. Such operating system tweaks are not easy, as they can be disruptive to overall performance, until they are just right. And when the platform itself changes, the tuning must begin again.

This WOS analogy is just a mental model - but it points out that one way of evaluating performance is to review parameters about job status and throughput, through consideration of issues such as the following:

  • What is the overall utilization (relative to capacity), and what might be done to improve that?
  • How long are benchmark jobs taking to complete, and how does that compare to the last time they ran?
  • Which resources are most frequently the constraint on processing?
  • How frequently have high priority tasks been injected into the queue, how much time is spent working them, and how disruptive is that?
  • What is the estimated time remaining to complete tasks in the job queues, at current performance levels?
  • Which jobs have had to be (or may need to be) re-started because of interdependencies with some other job that themselves have had to be rescheduled?
  • Are we servicing most important tasks at the level of responsiveness that they require?

You can see how designing a system that allows questions like this to be answered is important to improving performance of an operating system, whether that operating system runs a computer, or is the internalized rule-set used to decide what people work on from hour to hour. You can also see why it's important to have the discussion be performed in terms of 'machine time' as well as 'calendar time', because some jobs may take a long period to complete, but not take much effort; this might be due to the workload of other tasks that are concurrently executing.

However, unlike computers, people are not robots. They have strengths and weaknesses on different days or in different situations. They have emotional needs and desires, and those influence the outcomes that are achieved. They are the key in determining the extent to which the system can be made deterministic, and how quickly it can be evolved or tuned.

0
Your rating: None
  • Bryan Pflug's blog
  • Login or register to post comments