Mediary Fault Drill

High Level Goal

To understand how big distributed systems break, how we detect it, and how we recover.

Network failures vs. Business symptoms

Not all failures look the same.

Fault Type	Business Symptoms	Why This Matters
Packet loss (50%)	latency ↑, occasional failures	Real-world networks degrade gradually, not all-at-once
Latency injection (+1s)	latency ↑, no failures	High latency doesn’t always mean outages; business logic resilience matters (e.g., retry / timeout)
Machine unreachable / NIC down	10–30s of errors during fail + recovery	Load balancer + client connection pooling behavior becomes visible
Graceful shutdown (supervisorctl stop)	No user impact	Shows healthy service shutdown patterns
Machine restart	Errors during shutdown only	Expected IF connections aren’t drained properly

increase(takumi_auto_cross_ping_failure_total[5m]) > 0

takumi_auto_cross_ping_latency_ms > 300

Ideal mediary latency values:
- same region: 70–80ms
- cross region: 100–200ms

supervisorctl stop / pkill

kill -9

Use Graceful Shutdown to enable:

zero downtime deployment

smooth blue/green rollouts

safe restarts during production incidents

tc command: simulate network issues in production-like environments

Pro-tip

date; <command> — show precise timing for correlating with monitoring graphs

Detailed usage: Fault Injection in Distributed Systems

Symptom	Likely Root Cause
High latency, no errors	congestion, latency injection, slow path
High latency + random errors	packet loss / jitter
Errors only at start/end	node leaving/joining, LB failover
Continuous errors	full outage / route blackhole
No errors during deploy	graceful shutdown working

Interview Talking Points

Mediary Fault Drill Interview Points