My Experience in Improving On-Call Processes
Since working at startups involves working all hours of the day and night, I have worked on-call in some capacity at each company that I have worked at. There are good and bad processes to take in setting up the process and I hope that my experiences shared here will be of some help in lessening the burden.
I believe it is the responsibility of each team member to ensure that the next on-call shift is an easier process by fine-tuning thresholds, adding more coverage and actioning on lessons learned from postmortems (on this, Atlassian have a good process for when a postmortem should be completed, what it should cover, and what the outcomes should be).
These are the principles I use to make the on-call shift process easier for everyone involved:
1. Have some documentation in place for when an alert goes off
This was by far the most useful thing that I found when starting on-call at a new company, when an alert goes off there should be a runbook to tell the on-call engineer what the cause of the alert is and what action should be taken. In order that all engineers can resolve an issue (regardless of how well they know the system), the runbook should contain details of what the alert is capturing, how to assess the severity from looking at logs or other resources, and who should be informed within the company until the issue is resolved. For example, if a third party platform has an outage, who does the on-call engineer contact to get it fixed and if the outage affects customers do the customer operations team need to be notified to answer any incoming questions on the outage?
2. Measure what is important to the system
Alerts should be based on useful metrics, this could mean a set of metrics that are measuring pure numbers and then also some empirical metrics for expectations of the system from the development perspective. For example, a process running throughout the day, such as a queue processor, having any unprocessed messages for 1 minute should raise an alert as 5 minutes without processing events could be a catastrophic failure. Getting the balance of alerts that require action to ones that make engineers aware of the current state of the system can be difficult, so it will take some adjusting to get it right over time.
3. Gauge thresholds and don’t be afraid to change them as you scale
As the company and systems grow, the expected thresholds for your alerts will change. Take for instance that you could have alerts based on the number of connected clients running at once, having an upper limit is sensible in order to not overload the database. As the system scales the number of connected clients will increase. Where the system could only handle 50 clients at the time of the alert being set up when the system can handle 500 clients it doesn’t mean an alert going off at 51 connected clients is useful anymore. Part of the process of being on-call should be fine-tuning these thresholds when an alert goes off but the system no longer struggles when it’s in the state of the alert. The use of critical and non-critical levelling can be useful here too as sometimes a warning is sufficient to give an indication of potential issues where a critical alert means immediate action is required.
4. Put out fires immediately and leave anything that can wait to be fixed properly
Part of being on-call is triaging when an alert goes off to see if the system is in a good state and how wide the impact is. This could be something simple like a single user in a bad state or it could be that hundreds of users are affected and require an immediate solution, or it could be that a third party dependency is broken and the entire system has ground to a halt. When a single issue arises but there isn’t any urgency on fixing anything immediately, I think it is better to wait until you have time to assess if there is something more that can be done to fix the issue, whether that is adding a patch or refactoring around a whole feature, being too quick to push out a fix during a crisis can cause worse issues down the line.
5. Keep adding alerts to cover future regressions
This could be part of the bug process that when a serious bug has been found and wasn’t caught by an alert that one should be added to ensure the issue doesn’t happen again without warning. When running a post-mortem on an incident or bug, a good outcome is to discover what could have prevented the issue and, while patching the issue, also ensuring that similar scenarios are covered with an alert.
There is certainly a challenge when it comes to making on-call easier, but I’ve found that the more time you put into making the process better the more time on-call engineers have to work on things other than just putting out fires.
Overall, if it’s your first time on-call just remember not to worry about reaching out to the second line for help, everyone goes through the same teething problems when joining a new company and being on-call exposes you to a lot more of the system than you have seen before.
Every team member who joins the on-call rota will add their own experiences to the process in order to make it better, I’ve found that a good measure of a company is how seriously they take the on-call process and put an emphasis on making it better for everyone involved.