Alert and escalation, it's almost a case of plan,do,check,act..
Monitoring and alerting form part of the critical operational services for any modern technology savvy business. Whether these services are derived from internal tools, or from offerings such as Logentries and Upguard, the common factor is that an IT operations group needs to put some thought into how to get the best out of them.
Ultimately, monitoring and alerting are only as good as you make them and every organisation will have some specific tuning needed in order to suit the software and systems which are being monitored, as well as for the team or individuals who will receive alerts and log messages.
If you need to get the ball rolling with your colleagues, here are some easy entry points to the discussion:
- When (i.e. what hours)?
- How many repetitions?
- How can we silence alarms?
- When does my C-suite get woken up?
- Using alerts to carry ‘informational’ stuff decreases the impact of warning classes.
- Overly broad audiences i.e. don’t alert EVERYBODY at once unless you really mean it.
- Unstoppable alerting – obtain, derive, cajole, make sure you can stop an alert when dealing with issues to prevent unnecessary escalation and ensure this functionality is not mis-used.
- Tier your alerts, i.e “first responder”, “escalation level 1”, “escalation level 2”, “wake up the CTO”; or whatever roles and responsibilities fit for your organisation.
In short, We can’t expect to just use ‘defaults’ we need to plan how alerting will work.
Simplistic example, spot the holes!
Below I’ve created a fictitious ‘escalation map‘ that shows how alerts might be handled in a company:
Level 1: Front-line support
Incident detected at time T, clock starts, email and dashboard alarm generated; 15 minutes to acknowledge and resolve or escalation alarm to level 2 ; if not, we move to escalated incident state ‘escalated – level 2’
Level 2: Operations engineers / Application engineers + Front-line Team Manager
Clock is ticking until T + 60 minutes, when an escalation alarm is sent to level 3 if the issue not acknowledged and resolved ; if escalated, the incident state is now ‘escalated – level 3’
Level 3: Senior engineers / Product Manager(s)
Clock continues, a further 30 minutes to resolve, acknowledge and suppress etc ; if escalated, we now have incident state modified to ‘escalated – level 4’.
Level 4: Developers + Dev Manager
Get the developers on the line, use text messages, email, telephone, semaphore, whatever works, our hair is on fire and we’re drawing straws to see who calls the boss.
Level 5: CTO / IT Director
We don’t ever really want to be here.
Even scribbling out a short set of escalations like this can help shape the thinking around how we deal with incidents, and also give some focus on classifying them.
I see many people who think the first thing to do is shove all their systems into a shiny new monitoring and alerting mechanism without thinking about how it will be used.
It’s extremely useful to ‘have a plan’ for when things do go wrong, not if, because they will go wrong. When we know how long we’ve got to fix a problem before the CTO is called up, or before we know that n% of customers will see it’s manifestation can sharply focus the mind.