What Being On Call for a Blockchain System Really Taught Me

I used to think on-call was about reacting quickly.

Production taught me it’s about thinking clearly when clarity is scarce.

When Alerts Don’t Explain Anything

The first few incidents were frustrating. Alerts fired, metrics spiked, logs filled up. None of them explained what users were actually experiencing.

I realized something uncomfortable: I didn’t fully understand my own system under stress.

Pressure Changes Decision-Making

Under pressure:

  • You avoid risky fixes

  • You repeat familiar actions

  • You delay structural changes

None of that is irrational. It’s human.

But systems designed without this reality in mind quietly fail their operators.

The Moment That Changed My Approach

During one incident, we stabilized the system without ever understanding the root cause.

That felt like success. It wasn’t.

It meant we had postponed learning.

What I Do Differently Now

After enough nights like that, I changed how I build:

  • I design for diagnosis, not just uptime

  • I reduce moving parts during incidents

  • I assume confusion will exist and plan around it

On-call doesn’t just test systems.
It exposes design assumptions.

Why I’m Writing This

Most production lessons aren’t visible from dashboards.

They live in the moments when someone has to decide what to do next.

This was one of mine.

— Peesh Chopra


Being on-call exposes problems, but incident response is where systems are truly tested. In a separate write-up, I break down how real production blockchain incidents unfold, what actually matters during outages, and why most response playbooks fail under pressure.

Read the full incident response perspective here: Peesh Chopra on Incident Response in Production Blockchain Systems

Comments

Popular posts from this blog

The Journey of Peesh Chopra: Why I Build Scalable, Trust-First Blockchain Systems

When Crypto Meets Reality: The Quiet Revolution of Everyday Trust

Why Local-First Crypto Tools Matter More Than Ever