Page only if actionable
Have oncall available and ready to respond if a customer is affected. If an issue can wait until business hours, don’t allow it to page outside business hours.
Invest in runbooks
Make it as easy as possible for an oncall engineer to make the right decision when responding to an incident. These are often high stress and pressure events where we are more prone to make mistakes. We do not want the oncall engineer to need to be a hero to mitigate an issue. We want them to be an effective executor of a predetermined plan.
Escalate when it will make an impact
On the best teams, members have each other’s backs. If an oncall engineer needs help mitigating an issue they were paged for, they should feel comfortable escalating to another engineer who can help. This approach requires a delicate balance. Engineers, particularly more junior ones, need to feel comfortable escalating to ensure issues are mitigated effectively and in a timely manner. However, if escalations are happening too frequently, it can put too much strain on more experienced team members. If escalations are occurring too frequently, you may need to increase your focus on training all team members to be effective at performing oncall duties.
For larger incidents, appoint an incident coordinator
If an issue is high impact enough to declare a broader incident, have a (pretrained) incident coordinator who is responsible for reporting status and pulling the needed group of stakeholders together. The coordinator is directly responsible for determining whether an incident is ongoing and who is participating based on mitigation needs. The coordinator should rely on mitigating and engineers to communicate who and what they need, then makes this happen. The coordinator should also ensure an incident does not drag on past the point that mitigation is still occurring. Knowing how and when to close out an incident is as important as knowing when to open one.