Amazon (hardback, kindle)

InformIT (pdf, epub, mobi)

中文 (in Chinese)
日本語 (in Japanese)
한국말 (in Korean)
português

You can also see a list of all my publications and talks, including slides, on the Publications page.

Risk Management Theatre: On Show At An Organization Near You

Translations: 한국말

One of the concepts that will feature in the new book I am working on is “risk management theatre”. This is the name I coined for the commonly-encountered control apparatus, imposed in a top-down way, which makes life painful for the innocent but can be circumvented by the guilty (the name comes by analogy with security theatre.) Risk management theatre is the outcome of optimizing processes for the case that somebody will do something stupid or bad, because (to quote Bjarte Bogsnes talking about management), “there might be someone who who cannot be trusted. The strategy seems to be preventative control on everybody instead of damage control on those few.”

Unfortunately risk management theatre is everywhere in large organizations, and reflects the continuing dominance of the Theory X management paradigm. The alternative to the top-down control approach is what I have called adaptive risk management, informed by human-centred management theories (for example the work of Ohno, Deming, Drucker, Denning and Dweck) and the study of how complex systems behave, particularly when they drift into failure. Adaptive risk management is based on systems thinking, transparency, experimentation, and fast feedback loops.

Here are some examples of the differences between the two approaches.

Adaptive risk management (people work to detect problems through improving transparency and feedback, and solve them through improvisation and experimentation) Risk management theatre (management imposes controls and processes which make life painful for the innocent but can be circumvented by the guilty)
Continuous code review in which engineers ask a colleague to look over their changes before check-in, technical leads review all check-ins made by their team, and code review tools allow people to comment on each others’ work once it is in trunk. Mandatory code review enforced by check-in gates where a tool requires changes to be signed off by somebody else before they can be merged into trunk. This is inefficient and delays feedback on non-trivial regressions (including performance regressions).
Fast, automated unit and acceptance tests which inform engineers within minutes (for unit tests) or tens of minutes (for acceptance tests) if they have introduced a known regression into trunk, and which can be run on workstations before commit. Manual testing as a precondition for integration, especially when performed by a different team or in a different location. Like mandatory code review, this delays feedback on the effect of the change on the system as a whole.
A deployment pipeline which provides complete traceability of all changes from check-in to release, and which detects and rejects risky changes automatically through a combination of automated tests and manual validations. A comprehensive documentation trail so that in the event of a failure we can discover the human error that is the root cause of failures in the mechanistic, Cartesian paradigm that applies in the domain of systems that are not complex.
Situational awareness created through tools which make it easy to monitor, analyze and correlate relevant data. This includes process, business and systems level metrics as well as the discussion threads around events. Segregation of duties which acts as a barrier to knowledge sharing, feedback and collaboration, and reduces the situational awareness which is essential to an effective response in the event of an incident.

It’s important to emphasize that there are circumstances in which the countermeasures on the right are appropriate. If your delivery and operational processes are chaotic and undisciplined, imposing controls can be an effective way to improve – so long as we understand they are a temporary countermeasure rather than an end in themselves, and provided they are applied with the consent of the people who must work within them.

Here are some differences between the two approaches in the field of IT:

Adaptive risk management (people work to detect problems through improving transparency and feedback, and solve them through improvisation and experimentation) Risk management theatre (management imposes controls and processes which make life painful for the innocent but can be circumvented by the guilty)
Principle-based and dynamic: principles can be applied to situations that were not envisaged when the principles were created. Rule-based and static: when we encounter new technologies and processes (for example, cloud computing) we need to rewrite the rules.
Uses transparency to prevent accidents and bad behaviour. When it’s easy for anybody to see what anybody else is doing, people are more careful. As Louis Brandeis said, “Publicity is justly commended as a remedy for social and industrial diseases. Sunlight is said to be the best of disinfectants; electric light the most efficient policeman.” Uses controls to prevent accidents and bad behaviour. This approach is the default for legislators as a way to prove they have taken action in response to a disaster. But controls limit our ability to adapt quickly to unexpected problems. This introduces a new class of risks, for example over-reliance on emergency change processes because the standard change process is too slow and bureaucratic.
Accepts that systems drift into failure. Our systems and the environment are constantly changing, and there will never be sufficient information to make globally rational decisions. Humans solve our problems and we must rely on them to make judgement calls. Assumes humans are the problem. If people always follow the processes correctly, nothing bad can happen. Controls are put in place to manage “bad apples”. Ignores the fact that process specifications always require interpretation and adaptation in reality.
Rewards people for collaboration, experimentation, and system-level improvements. People collaborate to improve system-level metrics such as lead time and time to restore service. No rewards for “productivity” on individual or function level. Accepts that locally rational decisions can lead to system level failures. Rewards people based on personal “productivity” and local optimization. For example operations people optimizing for stability at the expense of throughput, or developers optimizing for velocity at the expense of quality (even though these are false dichotomies.)
Creates a culture of continuous learning and experimentation: People openly discuss mistakes to learn from them and conduct blameless post-mortems after outages or customer service problems with the goal of improving the system. People are encouraged to try things out and experiment (with the expectations that many hypotheses will be invalidated) in order to get better. Creates a culture of fear and mistrust. Encourages finger pointing and lack of ownership for errors, omissions and failure to get things done. As in: If I don’t do anything unless someone tells me to, I won’t be held responsible for any resulting failure.
Failures are a learning opportunity. They occur in controlled circumstances, their effects are appropriately mitigated, and they are encouraged as an opportunity to learn how to improve. Failures are caused by human error (usually a failure to follow some process correctly), and the primary response is to find the person responsible and punish them, and then use further controls and processes as the main strategy to prevent future problems.

Risk management theatre is not just painful and a barrier to the adoption of continuous delivery (and indeed to continuous improvement in general). It is actually dangerous, primarily because it creates a culture of fear and mistrust. As Bogsnes says, “if the entire management model reeks of mistrust and control mechanisms against unwanted behavior, the result might actually be more, not less, of what we try to prevent. The more people are treated as criminals, the more we risk that they will behave as such.”

This kind of organizational culture is a major factor whenever we see people who are scared of losing their jobs, or engage in activities designed to protect themselves in the case that something goes wrong, or attempt to make themselves indispensable through hoarding information.

I’m certainly not suggesting that controls, IT governance frameworks, and oversight are bad in and of themselves. Indeed, applied correctly, they are essential for effective risk management. ITIL for example allows for a lightweight change management process that is completely compatible with an adaptive approach to risk management. What’s decisive is how these framework are implemented. The way such frameworks are used and applied is determined by—and perpetuates—organizational culture.

  • HV

    One interesting thing I just thought about is, how typical implementations of regulations like DO-178B might fit in here. This is an extreme example, but certainly an example where strict rules and process enforcements are very effective in helping to produce high quality software and manage risk (although at a very high cost and certainly not very agile). There are examples of processes and tools of implementations of DO-178B which would fit in both sides of the table, so maybe there’s another criterion other than “people work to detect problems through improving transparency and feedback, and solve them through improvisation and experimentation)” vs. “management imposes controls and processes which make life painful for the innocent but can be circumvented by the guilty” that defines “Risk managment theatre”…

    I have no idea what this criterion might be… This just came to my mind.

    • http://continuousdelivery.com/ Jez Humble

      Thanks for your comment HV. I see you’re going straight for the jugular by bringing up flight management software.

      I’m not saying it’s impossible to create high quality software and manage risk using top-down control – just that it’s very hard and painful. It can work in situations (such as flight management systems) where the system doesn’t have to change very much once it’s released.

      The problem comes if you have to make frequent changes to the system, and particularly if you have to do so rapidly in response to problems discovered in the field. In this case, much of the control apparatus makes it very difficult to get patches out quickly in a low-risk way – exactly the class of risk I talk about in the section “Uses controls to prevent accidents and bad behaviour.”

      This is a particular problem in medical devices, as mentioned in an excellent talk by Nancy Leveson excerpted by John Allspaw here: http://www.kitchensoap.com/2013/05/31/prevention-versus-governance-versus-adaptive-capacities/

      PS One of the many services Craig Larman has done is show that iterative, incremental methods have a long history of use in safety critical systems: http://blog.jezhumble.net/?p=5

      • HV

        I didn’t want to imply that it is a good idea to use these tools for business software but maybe there are some aspects to learn from. I agree, that this works best in mostly static environments. An airplane design does not change as much as say a HFT application – there are not that many new opportunities to safe fuel by making a fast software delivery to the flight controls (and even if it were, the saftey risks would be to high)

        I think one thing that gets overlooked in business software development – and I’m not sure if this is a risk mangement thing or just basic development itself – is the fact, that the world keeps spinning. So one decision that was right two weeks ago might still produce the wrong result down the line – not because the reqirements were wrong (or captured wrong) but just because they don’t fit the new world any more. I don’t know if there can be some kind of risk mangagment method to deal with that other than “beeing aware” and reacting fast if stuff like this happens. Detection of the cause of “something went wrong” might me the most important part here.

        • http://continuousdelivery.com/ Jez Humble

          I don’t know if there can be some kind of risk management method to deal with that other than “being aware” and reacting fast if stuff like this happens. Detection of the cause of “something went wrong” might be the most important part here.

          I think this is exactly right. Situational awareness is actually something you can work towards through training, process, and building good tools. John Allspaw talks about it here (and see his slides)

          The capability to create resilient systems and to react fast is really important, which is why (as Allspaw discusses) I think TTR (time to restore service) is an essential metric to track. One of my main problems with risk management theatre is that it gets in the way of reducing TTR.

          Once you accept that change and failures are going to happen, you have to consider that one of the biggest risks is how fast you can safely respond to them.

          One of the goals of adaptive risk management is to be able to create situational awareness and respond quickly, effectively and safely when things go wrong.

  • Pingback: Rules Or Principles? | Michael Lutton's Blog

  • Pingback: There’s No Such Thing as a “Devops Team” | Continuous Delivery

  • Pingback: On Antifragility in Systems and Organizational Architecture | Continuous Delivery

  • Pingback: Adaptive Risk Management instead of Risk Management Theatre | Technological Musings

  • Elizabeth Geno

    I see this exact pattern of ‘risk management theatre’ where I work in higher education. At one point I was in a division where the pointy-haired honcho made the team leads do stupid chants like ‘variation is the enemy!’. Our number one objective was to get ‘ding-free’ reports from the auditors by any means necessary. I taught a mini-course in forgery. ;) I thought the man was barking, but someone was pleased with his output: all of the highly-paid original members of the unit quit within 18 months (the useless ones are still there 8 years later). He smugly believed that his unit would be the model for ‘reforming’ the rest of the university’s administration, and by Cthulhu’s slimy tentacles, it’s starting to happen.

    I found a nearly perfect example of the better way working in a famous artisanal production bakery. We had no time or margin for bullsh*t.

    So why does this sacrifice of quality for control go on, when research consistently shows poor outcomes?

    I don’t think it’s a simple will to dominate; it’s more subtle and heartless thing: to dehumanize people to be able to treat them as liabilities rather than assets. Those who enable such an outcome are rewarded; those who won’t are passed over or leave. There are few if any consequences to the beneficiaries of classic rent-seeking behavior.

  • Pingback: Always Agile · Application Pattern: Consumer Driven Contracts

  • Pingback: Always Agile · Organisation Antipattern: Release Testing

  • Pingback: Always Agile Consulting · Organisation Antipattern: Release Testing

  • Pingback: Always Agile Consulting · Application Pattern: Consumer Driven Contracts

  • Pingback: Always Agile Consulting · Organisational Antipattern: Consumer Release Testing