It's 2017, and everybody now knows how important it is to monitor their services. But less attention gets paid to whether we're monitoring it *well* or not. In this talk we'll cover several extremely common failures and problems that can make your monitoring ineffective, or even worse than no monitoring at all.

  • Psychic monitoring. Does your system only let you check for things you predicted might break, or can you ask new questions?
  • Gatekeepers. Do you have to understand an entire massive infra stack to add a single metric?
  • Fractured reality. Do your teams share tools, or does everyone have a different source of truth?
  • Brittle, cobwebby checks. Is it too hard to make changes?
  • Intuition. Do you have the information you need to make data-driven debugging decisions, or do you rely on intuition and smells?
  • Meta-monitoring. Is anyone keeping an eye on how often people are getting paged or woken up, and noticing when it changes?

All of these things can be overcome with not a terrible amount of effort, but most of us don't learn to avoid it until we've suffered many outages and scars. Let's try to accelerate your experience substantially!


Outline/Structure of the Talk

Coming Soon...

Learning Outcome

Coming Soon...

Target Audience


schedule Submitted 4 years ago