High Level Goal

To understand how big distributed systems break, how we detect it, and how we recover.


Network failures vs. Business symptoms

Not all failures look the same.

Fault TypeBusiness SymptomsWhy This Matters
Packet loss (50%)latency ↑, occasional failuresReal-world networks degrade gradually, not all-at-once
Latency injection (+1s)latency ↑, no failuresHigh latency doesn’t always mean outages; business logic resilience matters (e.g., retry / timeout)
Machine unreachable / NIC down10–30s of errors during fail + recoveryLoad balancer + client connection pooling behavior becomes visible
Graceful shutdown (supervisorctl stop)No user impactShows healthy service shutdown patterns
Machine restartErrors during shutdown onlyExpected IF connections aren’t drained properly

Observability Metrics

  • Detects sudden network degradation
increase(takumi_auto_cross_ping_failure_total[5m]) > 0
  • Detects slowness even when not failing
takumi_auto_cross_ping_latency_ms > 300
  • Ideal mediary latency values:
    • same region: 70–80ms
    • cross region: 100–200ms

Useful Tips

Graceful Shutdown Behavior

supervisorctl stop / pkill

  • sends SIGTERM, stops accepting new requests (i.e., graceful shutdown)
  • allows in-flight requests to finish
  • usually safe and clean → no business impact

kill -9

  • sends SIGKILL
  • forcefully kills process → partial outages

Use Graceful Shutdown to enable:

  • zero downtime deployment
  • smooth blue/green rollouts
  • safe restarts during production incidents

Network Issue Simulation Tool

tc command: simulate network issues in production-like environments

  • e.g., 50% packet loss: tc qdisc add dev eth0 root netem loss 50%
  • 1s latency: tc qdisc add dev eth0 root netem delay 1000ms

Pro-tip

date; <command> — show precise timing for correlating with monitoring graphs


Infra Intuition Cheatsheet

SymptomLikely Root Cause
High latency, no errorscongestion, latency injection, slow path
High latency + random errorspacket loss / jitter
Errors only at start/endnode leaving/joining, LB failover
Continuous errorsfull outage / route blackhole
No errors during deploygraceful shutdown working

Interview Talking Points