Safety Engineering: A Journey
We engineer our systems to be reliable. The profession has maintained a keen focus on approaches to testing to best ensure that everything goes right. We expect our people to think through potential failures and architect them to be resilient regardless. But despite our best efforts, unexpected failures will always occur.
This talk discusses ‘safety engineering’ – the design of IT control systems for when unforeseen circumstances arise. Drawing on the experience of implementing a safety program within a HFT company the talk will cover where safety systems sit in the overall architecture, what they do, and when to invest in them. Finally, the talk will cover the practical aspects of implementing a safety regime within a company.
Outline/Structure of the Talk
- A background anecdote of where our company's safety journey began, the Knight Capital meltdown of August 2012.
- The incident wiped out a large company in a morning.
- The "cause" was a combination of deployment error and a repurposed feature toggle.
- Management asked what prevented the same thing happening to their company. Luck
- A physical world example of safety engineering, the recent failed Soyuz launch. Contrast with an older example, the Challenger disaster.
- Introduce the idea safety engineering: the design of systems to manage unexpected failure. Contrast:
- Reliability engineering. Ensuring that a system works as expected and does not fail. Dealing with "known knowns" through practices including testing (backed by appropriate operational practices).
- Resilience engineering. Accepting that there are various predictable failure modes within a system, and designing and architecting a system to be resilient to those "known unknowns" failures.
- Safety engineering. Accepting that not all circumstances are foreseeable - "unknown unknowns", but that the behaviour of the system, while it may no longer be able to fulfil its mission, still needs to act "safely"; to protect the physical, financial, legal, informational, and/or reputational security of the organisation.
- Describe that safety engineering requires a different approach that is unfamiliar to many engineers because it involves finding solution without working out how the underlying system is working.
- Key point. Emphasise: The enemy of safety is hubris.
- Anecdote. Consider the Titanic; engineered to the highest reliability and resilience standards, but a failure in the safety systems: not enough lifeboats! So obvious!
- Follow-up anecdote of the SS Eastland which became top-heavy with lifeboats that were made mandatory after the Titanic, capsized, and killed more people than the Titanic sinking.
- Safety engineering is hard because it's needed in exceptional circumstances, when the consequences of getting it wrong are highest.
- The enemy of safety is hubris.
- Safety engineering involves the design of controls. Broadly, controls fill three roles:
- Validate the state of the system.
- Validate the (outward facing) behaviour of the system.
- Design constraints on controls:
- Independent of the main system.
- Orthogonal to the main system (i.e., not just an independent replication of the main system).
- System-oriented rather than component-oriented.
- Discuss when to invest in safety engineering.
- It's not a typical investment, and you can't calculate value in the same way. It's more like insurance where the expected value is negative but greater protection against catastrophic loss.
- Investment in safety must be the decision of the client since it's for them to decide what their risk profile is.
- Different companies will have different desires to invest. In some cases (particularly where physical safety is involved, but also in other regulated industries) it will likely be a legal requirement.
- The same company may have a different desire to invest over its lifetime. Consider a start-up -vs- that company 10 years later which may by then be worth > $1B.
- General rule: if you're expecting positive value from safety engineering, you probably should be spending more time on reliability and resilience.
- Our management was surprised by our answer to the Knight Capital question because we'd never really talked about it. We found a mismatch between their "low risk" expectations and our "be in the market" assumptions. It required a big cultural shift.
- Optional aside: the finance industry approach to quantifying operational loss, the LDA from Basel II. It's problematic for this kind of case: lots of good data required, backwards-looking not forward predicting, problems of dealing with unlikely events (black swans).
Part 2: Implementing a safety program
- Talk through some of the actions we took to establish a safety program.
- A very clear top-down directive
- Deep engagement from senior people
- Focus on risk culture
- An approach to incident reviews which forced deep engagement
- Explicit education ( / indoctrination). The idea of stories as organisational memory of why safety is required, and some thoughts on building these kinds of narrative.
- Talk through some of the challenges we found
- We treated it as a software development problem and massively failed. The human processes are a huge component and require constant attention. (Physical world example: annual fire evacuation drills are done for a reason.)
- Developing controls isn't "fun" for most people. For a trading company, with remuneration tied to company profits, working on negative-expected value projects has misaligned incentives. We made some attempts at ensuring we got the financial incentives right ("punishing" risky behaviour, appropriately rewarding risk-minded work), but it wasn't a complete success.
- Not cheap.
- Talk through some of the outcomes.
- Much improved safety (beware hubris!).
- Much improved efficiency at managing incidents: earlier detection, better tools for resolving, faster resolution.
- We have a much improved detector of things being "not quite right" which feeds back to our reliability and resilience work, though hard to value.
- There was a dream that by having a great set of controls that we could be more "pragmatic" about the engineering quality of our primary system: it may go down (for us, not the end of the world), but we would be "safe". This went back and forth a bit. My take: it's a potential "moral hazard" to be relying more on the safety systems. The intent is to reduce the overall risk profile. Consequently, bumping into the controls is treated as a serious incident.
- Reliability vs resilience vs safety engineering
- Controls: Validate state, validate action, containment
- Controls: Independent, orthogonal, system focus, minimal
- Choice to invest sits with the client, but requires an open conversation
- The enemy of safety is hubris.
- Safety as a distinct property from reliability and resilience.
- Different types of controls.
- When to invest in safety engineering.
- Practical considerations for instigating a safety problem; cultural and technical.
CTOs and Engineering Managers of scale-up and larger companies
schedule Submitted 8 months ago
People who liked this proposal, also liked:
Herry Wiputra - Scaling Engineering Teams for GrowthHerry WiputraChief Product and Technology Officerhipages
schedule 8 months agoSold Out!
Your startup has been very successful, it is growing very quickly and it is putting a lot of pressure on your team to meet the needs of the market. It is a really good problem to have. You added more people to your development team but you noticed that you are not getting the same ROI as you did in the past, in fact, it is a declining ROI, you are just not getting the outcome that you look for.
Operating a 10 people engineering team is different to 50 people engineering team. It is different to operating 100 people engineering team. In this talk, Herry will tell the story on how he scaled the engineering team in Campaign Monitor from 20 to 70 engineers in order to unlock growth.