Debugging System with High Latency

Step 1: Clarify & Gather Context

Before diving into debugging, I’d want to understand the scope and context:”

Questions to ask:

“How high is the latency? What’s normal vs current? (e.g., 100ms → 2s?)”
“When did this start? Sudden spike or gradual increase?”
“Is it affecting all users or specific segments? (region, device, user type)”
“Is it all endpoints or specific ones?”
“Any recent deployments or infrastructure changes?”
“What’s the traffic pattern? Normal load or spike?”

Step 2: Check Monitoring & Metrics

“I’d start by looking at our observability stack to identify WHERE the latency is coming from:“

Application Metrics

Request latency percentiles (p50, p95, p99) - which percentile is affected?
Error rates - are errors correlated with high latency?
Request throughput - has QPS increased?
Response size - are we returning more data?

Infrastructure Metrics

CPU usage - are we CPU bound?
Memory usage - memory leak? GC pressure?
Network I/O - bandwidth saturation?
Disk I/O - slow reads/writes?
Connection pool exhaustion - are we out of DB connections?

Dependency Latency

Database query times - slow queries?
External API calls - third-party service degradation?
Cache hit rates - is cache working?
Message queue lag - backlog building up?

Tool mentions: “I’d use tools like Grafana/Prometheus for metrics, Jaeger/Zipkin for distributed tracing, and application logs.”

Step 3: Analyze Distributed Traces

“Next, I’d look at distributed tracing to see the request flow:”

Example trace breakdown:

Total latency: 2000ms
├─ API Gateway: 5ms
├─ Auth Service: 10ms
├─ Main Service: 50ms
│  ├─ Business Logic: 5ms
│  ├─ Database Query: 1800ms  ← BOTTLENECK FOUND
│  └─ Cache Check: 5ms
└─ Response Serialization: 135ms

“This tells me exactly where time is spent. If database queries are taking 1800ms when they should take 50ms, that’s my smoking gun.”

Step 4: Investigate the Root Cause

Scenario A: Database is Slow

Investigation steps:

Check slow query logs
- Which queries are slow?
- Look at EXPLAIN plans - missing indexes? Full table scans?
Database metrics
- Connection pool utilization - running out of connections?
- Lock contention - queries waiting on locks?
- Replication lag - read replicas behind?
Data growth
- Did table size explode? (e.g., 1M rows → 100M rows)
- Are indexes still effective?

Example answer

I’d run EXPLAIN on the slow queries. Let’s say I find a query doing a full table scan on an orders table that grew from 1M to 50M rows. The query is: SELECT * FROM orders WHERE user_id = ? AND status = 'pending' ORDER BY created_at DESC LIMIT 20 I’d check if there’s a composite index on (user_id, status, created_at). If not, that’s likely the issue.

Scenario B: High CPU/Memory

Investigation steps:

Profile the application
- Use pprof (Go), jProfiler (Java), etc.
- Identify hot paths in code
Check for resource leaks
- Memory leak - growing heap, GC thrashing
- Goroutine/thread leak - too many concurrent operations
Look for inefficient code
- N+1 query problem
- Inefficient algorithms
- Serialization/deserialization bottlenecks

Example answer

If CPU is at 100%, I’d take a CPU profile. Let’s say I find 80% of CPU time is spent in JSON serialization. I’d investigate:

Are we serializing huge objects?

Can we use more efficient serialization (protobuf)?

Can we paginate the response?

Scenario C: External Dependency Issues

Investigation steps:

Check dependency health
- Is the third-party API slow or down?
- Circuit breaker triggered?
Network issues
- DNS resolution slow?
- Network latency increased?
Rate limiting
- Are we being rate limited?

Example answer

If I see our payment gateway calls taking 5 seconds when they normally take 200ms, I’d:

Check their status page

Implement/verify timeouts (don’t wait forever)

Check circuit breaker status

Consider fallback strategies

Scenario D: Traffic Spike

Investigation steps:

Analyze traffic patterns
- Marketing campaign? Flash sale? Bot attack?
- Which endpoints are hot?
Resource saturation
- Need to scale horizontally?
- Auto-scaling working?

Example answer

If traffic increased 10x due to a flash sale, I’d check:

Is auto-scaling keeping up?

Are we hitting rate limits?

Is the database the bottleneck? (can scale app servers but DB is fixed)

Do we need to enable caching or queue requests?

Step 5: Immediate Mitigation

“While investigating root cause, I’d implement quick mitigations:”

Scale up/out - Add more instances if resource-bound
Enable caching - Cache expensive queries/API calls
Rate limiting - Protect the system from overload
Circuit breakers - Fail fast on slow dependencies
Rollback - If recent deployment caused it
Query optimization - Add missing indexes
Increase timeouts - If safe to do so (with caution)

Step 6: Long-term Fixes

“After stabilizing, I’d implement permanent solutions:“

Database optimization

Add proper indexes
Partition large tables
Implement read replicas
Cache frequently accessed data

Code optimization

Fix N+1 queries
Batch API calls
Use async processing for heavy tasks
Implement pagination

Architecture changes

Move heavy operations to background jobs
Implement CQRS (separate read/write paths)
Use CDN for static content
Implement proper caching layers (Redis, CDN)

Complete Answer

Example

First, I’d gather context: When did this start? Is it all users or specific segments? Any recent changes? What’s the actual latency - are we talking 2 seconds instead of 200ms?

Second, I’d check our monitoring. I’d look at application metrics like p95/p99 latency, error rates, and throughput. Then infrastructure metrics - CPU, memory, disk I/O. And finally dependency latency - database, cache, external APIs.

Third, I’d use distributed tracing to see exactly where time is spent in the request flow. For example, if I see a request taking 2 seconds and the trace shows 1.8 seconds in database queries, that’s my bottleneck.

Fourth, I’d drill down. Let’s say database is the issue. I’d check slow query logs, run EXPLAIN on suspicious queries, and check for missing indexes. Maybe I find a query doing a full table scan on a table that grew from 1 million to 50 million rows because we’re missing a composite index.

Fifth, immediate mitigation. While investigating, I’d add the missing index, scale up database resources if needed, or enable query caching to reduce load.

Finally, long-term fixes. I’d review our indexing strategy, implement proper monitoring alerts for slow queries, and maybe consider sharding the table if it’s growing too large.

Throughout this, I’d document findings and communicate with the team, especially if it’s affecting users.”

The Underlying Logic: What They’re Evaluating

Systematic approach - not random guessing
Understanding of full stack - app, database, network, infrastructure
Use of tools - monitoring, tracing, profiling
Prioritization - quick wins vs long-term fixes
Communication - can you explain your process clearly?
Real-world experience - have you actually done this before?

Further Exploration

If they probe deeper, be ready to discuss:

Database Debugging

- EXPLAIN plan analysis
- Index selection (B-tree vs Hash, composite indexes)
- Lock contention (row-level vs table-level)
- Connection pool tuning
- Query plan cache invalidation

Memory Issues

- Heap dump analysis
- GC logs (Stop-the-world pauses?)
- Memory leak detection
- Object retention analysis

Network Issues

- TCP connection exhaustion
- DNS resolution caching
- Keep-alive vs short-lived connections
- Network latency vs application latency

Nature's Digital Garden

Explorer

Debugging System with High Latency

Step 1: Clarify & Gather Context

Step 2: Check Monitoring & Metrics

Application Metrics

Infrastructure Metrics

Dependency Latency

Step 3: Analyze Distributed Traces

Step 4: Investigate the Root Cause

Scenario A: Database is Slow

Scenario B: High CPU/Memory

Scenario C: External Dependency Issues

Scenario D: Traffic Spike

Step 5: Immediate Mitigation

Step 6: Long-term Fixes

Database optimization

Code optimization

Architecture changes

Complete Answer

The Underlying Logic: What They’re Evaluating

Database Debugging

Memory Issues

Network Issues

Graph View

Table of Contents