2024-09-23 17:55:37
ntietz.com
The first time I went on call as a software engineer, it was exciting—and ultimately traumatic.
Since then, I’ve had on-call experiences at multiple other jobs and have grown to really appreciate it as part of the role.
As I’ve progressed through my career, I’ve gotten to help establish on-call processes and run some related trainings.
Here is some of what I wish I’d known when I started my first on-call shift, and what I try to tell each engineer before theirs.
It’s natural to feel a lot of pressure with on-call responsibilities.
You have a production application that real people need to use!
When that pager goes off, you want to go in and fix the problem yourself.
That’s the job, right?
But it’s not.
It’s not your job to fix every issue by yourself.
It is your job to see that issues get addressed.
The difference can be subtle, but important.
When you get that page, your job is to assess what’s going on.
A few questions I like to ask are:
What systems are affected?
How badly are they impacted?
Does this affect users?
With answers to those questions, you can figure out what a good course of action is.
For simple things, you might just fix it yourself!
If it’s a big outage, you’re putting on your incident commander hat and paging other engineers to help out.
And if it’s a false alarm, then you’re putting in a fix for the noisy alert!
(You’re going to fix it, not just ignore that, right?)
Just remember not to be a hero.
You don’t need to fix it alone, you just need to figure out what’s going on and get a plan.
Related to the previous one, you aren’t going this alone.
Your main job in holding the pager is to assess and make sure things get addressed.
Sometimes you can do that alone, but often you can’t!
Don’t be afraid to call for backup.
People want to be helpful to their teammates, and they want that support available to them, too.
And it’s better to be wake me up a little too much than to let me sleep through times when I was truly needed.
If people are getting woken up a lot, the issue isn’t calling for backup, it’s that you’re having too many true emergencies.
It’s best to figure out that you need backup early, like 10 minutes in, to limit the damage of the incident.
The faster you figure out other people are needed, the faster you can get the situation under control.
In any incident, adrenaline runs and people are stressed out.
The key to good incident response is communication in spite of the adrenaline.
Communicating under pressure is a skill, and it’s one you can learn.
Here are a few of the times and ways of communicating that I think are critical:
- When you get on and respond to an alert, say that you’re there and that you’re assessing the situation
- Once you’ve assessed it, post an update; if the assessment is taking a while, post updates every 15 minutes while you do so (and call for backup)
- After the situation is being handled, update key stakeholders at least every 30 minutes for the first few hours, and then after that slow down to hourly
You are also going to have to communicate within the response team!
There might be a dedicated incident channel or one for each incident.
Either way, try to over communicate about what you’re working on and what you’ve learned.
When you’re debugging weird production stuff at 3am, that’s the time you really need to externalize your memory and thought processes into a notes document.
This helps you keep track of what you’re doing, so you know which experiments you’ve run and which things you’ve ruled out as possibilities or determined as contributing factors.
It also helps when someone else comes up to speed!
That person will be able to use your notes to figure out what has happened, instead of you having to repeat it every time someone gets on.
Plus, the notes doc won’t forget things, but you will.
You will also need these notes later to do a post-mortem.
What was tried, what was found, and how it was fixed are all crucial for the discussion.
Timestamps are critical also for understanding the timeline of the incident and the response!
This document should be in a shared place, since people will use it when they join the response.
It doesn’t need to be shared outside of the engineering organization, though, and likely should not be.
It may contain details that lead to more questions than they answer; sometimes, normal engineering things can seem really scary to external stakeholders!
When you’re on call, you get to see things break in weird and unexpected ways.
And you get to see how other people handle those things!
Both of these are great ways to learn a lot.
You’ll also just get exposure to things you’re not used to seeing.
Some of this will be areas that you don’t usually work in, like ops if you’re a developer, or application code if you’re on the ops side.
Some more of it will be business side things for the impact of incidents.
And some will be about the psychology of humans, as you see the logs of a user clicking a button fifteen hundred times (get that person an esports sponsorship, geez).
My time on call has led to a lot of my professional growth as a software engineer.
It has dramatically changed how I worked on systems.
I don’t want to wake up at 3am to fix my bad code, and I don’t want it to wake you up, either.
Having to respond to pages and fix things will teach you all the ways they can break, so you’ll write more resilient software that doesn’t break.
And it will teach you a lot about the structure of your engineering team, good or bad, in how it’s structured and who’s responding to which things.
No one is born skilled at handling production alerts.
You gain these skills by doing, so get out there and do it—but first, watch someone else do it.
No matter how much experience you have writing code (or responding to incidents), you’ll learn a lot by watching a skilled coworker handle incoming issues.
Before you’re the primary for an on-call shift, you should shadow someone for theirs.
This will let you see how they handle things and what the general vibe is.
This isn’t easy to do!
It means that they’ll have to make sure to loop you in even when blood is pumping, so you may have to remind them periodically.
You’ll probably miss out on some things, but you’ll see a lot, too.
When we get paged, it usually feels like a crisis.
If not to us, it sure does to the person who’s clicking that button in frustration, generating a ton of errors, and somehow causing my pager to go off.
But not all alerts are created equal.
If you assess something and figure out that it’s only affecting one or two customers in something that’s not time sensitive, and it’s currently 4am on a Saturday?
Let people know your assessment (and how to reach you if you’re wrong, which you could be) and go back to bed.
Real critical incidents have to be fixed right away, but some things really need to wait.
You want to let them go until later for two reasons.
First is just the quality of the fix.
You’re going to fix things more completely if you’re rested when you’re doing so!
Second, and more important, is your health.
It’s wrong to sacrifice your health (by being up at 4am fixing things) for something non-critical.
Many of us have had bad on-call experiences.
I sure have.
One regret is that I didn’t quit that on-call experience sooner.
I don’t even necessarily mean quitting the job, but pushing back on it.
If I’d stood up for myself and said “hey, we have five engineers, it should be more than just me on call,” and held firm, maybe I’d have gotten that!
Or maybe I’d have gotten a new job.
What I wouldn’t have gotten is the knowledge that you can develop a rash from being too stressed.
If you’re in a bad on-call situation, please try to get out of it!
And if you can’t get out of it, try to be kind to yourself and protect yourself however you can (you deserve better).
Along with taking great notes, you should make sure that you test hypotheses.
What could be causing this issue?
And before that, what even is the problem?
And how do we make it happen?
Write down your answers to these!
Then go ahead and try to reproduce the issue.
After reproducing it, you can try to go through your hypotheses and test them out to see what’s actually contributing to the issue.
This way, you can bisect problem spaces instead of just eliminating one thing at a time.
And since you know how to reproduce the issue now, you can be confident that you do have a fix at the end of it all!
Above all, the thing I want people new to on-call to do?
Just have fun.
I know this might sound odd, because being on call is a big job responsibility!
But I really do think it can be fun.
There’s a certain kind of joy in going through the on-call response together.
And there’s a fun exhilaration to it all.
And the joy of fixing things and really being the competent engineer who handled it with grace under pressure.
Try to make some jokes (at an appropriate moment!) and remember that whatever happens, it’s going to be okay.
Probably.
If this post was enjoyable or useful for you, please share it!
If you have comments, questions, or feedback, you can email my personal email.
To get new posts and support my work, subscribe to the newsletter. There is also an RSS feed.
Support Techcratic
If you find value in Techcratic’s insights and articles, consider supporting us with Bitcoin. Your support helps me, as a solo operator, continue delivering high-quality content while managing all the technical aspects, from server maintenance to blog writing, future updates, and improvements. Support Innovation! Thank you.
Bitcoin Address:
bc1qlszw7elx2qahjwvaryh0tkgg8y68enw30gpvge
Please verify this address before sending funds.
Bitcoin QR Code
Simply scan the QR code below to support Techcratic.
Please read the Privacy and Security Disclaimer on how Techcratic handles your support.
Disclaimer: As an Amazon Associate, Techcratic may earn from qualifying purchases.