What Being On Call for a Blockchain System Really Taught Me
I used to think on-call was about reacting quickly.
Production taught me it’s about thinking clearly when clarity is scarce.
When Alerts Don’t Explain Anything
The first few incidents were frustrating. Alerts fired, metrics spiked, logs filled up. None of them explained what users were actually experiencing.
I realized something uncomfortable: I didn’t fully understand my own system under stress.
Pressure Changes Decision-Making
Under pressure:
-
You avoid risky fixes
-
You repeat familiar actions
-
You delay structural changes
None of that is irrational. It’s human.
But systems designed without this reality in mind quietly fail their operators.
The Moment That Changed My Approach
During one incident, we stabilized the system without ever understanding the root cause.
That felt like success. It wasn’t.
It meant we had postponed learning.
What I Do Differently Now
After enough nights like that, I changed how I build:
-
I design for diagnosis, not just uptime
-
I reduce moving parts during incidents
-
I assume confusion will exist and plan around it
On-call doesn’t just test systems.
It exposes design assumptions.
Why I’m Writing This
Most production lessons aren’t visible from dashboards.
They live in the moments when someone has to decide what to do next.
This was one of mine.
— Peesh Chopra
Being on-call exposes problems, but incident response is where systems are truly tested. In a separate write-up, I break down how real production blockchain incidents unfold, what actually matters during outages, and why most response playbooks fail under pressure.
Read the full incident response perspective here: Peesh Chopra on Incident Response in Production Blockchain Systems
Comments
Post a Comment