It's 2017, and everybody now knows how important it is to monitor their services. But less attention gets paid to whether we're monitoring it *well* or not. In this talk we'll cover several extremely common failures and problems that can make your monitoring ineffective, or even worse than no monitoring at all.

  • Psychic monitoring. Does your system only let you check for things you predicted might break, or can you ask new questions?
  • Gatekeepers. Do you have to understand an entire massive infra stack to add a single metric?
  • Fractured reality. Do your teams share tools, or does everyone have a different source of truth?
  • Brittle, cobwebby checks. Is it too hard to make changes?
  • Intuition. Do you have the information you need to make data-driven debugging decisions, or do you rely on intuition and smells?
  • Meta-monitoring. Is anyone keeping an eye on how often people are getting paged or woken up, and noticing when it changes?

All of these things can be overcome with not a terrible amount of effort, but most of us don't learn to avoid it until we've suffered many outages and scars. Let's try to accelerate your experience substantially!

 
1 favorite thumb_down thumb_up 0 comments visibility_off  Remove from Watchlist visibility  Add to Watchlist
 

Outline/structure of the Session

Coming Soon...

Learning Outcome

Coming Soon...

Target Audience

DevOps

schedule Submitted 5 months ago

Comments Subscribe to Comments

comment Comment on this Proposal