Common Follow-ups
Q: “How would you handle a flash sale with 100x normal traffic?”
Answer: “For flash sales, I’d implement several strategies:
Pre-warm cache
- Load popular items into Redis before sale starts
- Reduce cache misses during traffic spike
Aggressive rate limiting
- Normal: 1000 req/min per user
- Flash sale: 100 req/min per user
- Protects backend from overload
Queue system
- Add items to queue for async processing
- Process requests in order instead of real-time
- User sees queue position and estimated wait time
Auto-scaling
- Scale API servers from 10 to 50 based on CPU metrics
- Use Kubernetes HPA or AWS Auto Scaling
Read-only mode
- Temporarily disable non-critical writes
- Focus capacity on cart operations
CDN for static content
- Serve product images and details from CDN
- Reduces load on application servers
Most importantly, communicate with users - show queue position, estimated wait time, set expectations.”
Q: “What if two users try to add the last item in stock simultaneously?”
Answer: “This is a classic race condition. Solutions:
Option 1: Reserve inventory first (before adding to cart)
- Try to reserve inventory via inventory service
- If reservation fails, return out of stock error
- If succeeds, add to cart with reservation
- Release reservation after 10 minutes if not checked out
- Prevents overselling at cart level
Option 2: Optimistic locking in inventory service
- UPDATE inventory SET quantity = quantity - 1 WHERE item_id = ? AND quantity >= 1
- Only one update succeeds due to transaction isolation
- Second request gets 0 rows affected
- Return out of stock to second user
Option 3: Accept over-selling, handle at checkout
- Let both users add to cart
- Don’t reserve inventory at cart level
- At checkout, first to pay wins
- Second gets out of stock error with voucher compensation
For Shopee, I’d use option 3 - better user experience, handle edge case at checkout with compensation for disappointed users.”
Q: “How do you prevent bots from adding items to cart and not buying?”
Answer: “Bot protection strategy:
Rate limiting per IP
- Block IPs making >1000 requests/hour
- Use progressive rate limiting (slow down, then block)
CAPTCHA
- Show CAPTCHA after suspicious behavior detected
- Don’t show on every request (bad UX)
Behavioral analysis
- Bots add items in milliseconds
- Real users browse for seconds
- Track time between actions
- Flag accounts with bot-like patterns
Temporary cart holds
- Release cart items after 30 minutes of inactivity
- Run cleanup job to delete old cart items
- Prevents inventory hoarding
Device fingerprinting
- Track device fingerprint, not just IP
- Bots often use same device signature
- Block by fingerprint, not just IP
Require login for high-demand items
- For limited releases, require authentication
- Harder for bots to create accounts at scale”
Q: “Your cache is showing 60% hit rate instead of 95%. How do you debug?”
Answer: “Systematic debugging approach:
Check cache TTL
- Run: redis-cli TTL cart:user_123
- If returning -1 (no expiry) or very short TTL, that’s the issue
- Fix: Adjust TTL to appropriate value
Check cache eviction
- Run: redis-cli INFO stats | grep evicted_keys
- High evictions means not enough memory
- Fix: Add more Redis nodes or increase memory
Check access patterns
- Query: SELECT user_id, COUNT(*) FROM access_logs GROUP BY user_id
- Are we caching the right users? (80/20 rule)
- Fix: Cache only active users, not all users
Check invalidation logic
- Are we invalidating too aggressively?
- Every write = cache delete = next read is miss
- Fix: Reduce unnecessary invalidations
Check cache warming
- For popular users, pre-load cache
- Morning traffic spike = cache cold start
- Fix: Warm cache before peak hours
Monitor cache key distribution
- Run: redis-cli —bigkeys
- Are some keys huge, causing memory issues?
- Fix: Split large keys into smaller ones
Solution depends on root cause:
- Low memory: Add more Redis nodes
- Poor eviction policy: Switch to LRU
- Wrong caching strategy: Rethink what to cache”
Q: “How would you migrate 10M carts from old schema to new schema with zero downtime?”
Answer: “Zero-downtime migration strategy:
Phase 1: Dual writes (Week 1-2)
- Write to both old and new schema simultaneously
- Old schema remains source of truth
- New schema receives copies for testing
- Monitor for errors in new schema writes
Phase 2: Backfill (Week 2-3)
- Background job migrates old data in batches
- Process 1000 carts at a time to avoid overload
- Transform old schema format to new schema
- Mark migrated records in old schema
- Continue until all old data migrated
Phase 3: Dual reads (Week 3-4)
- Try reading from new schema first
- If data found, return it
- If not found, fallback to old schema
- Verify data consistency between schemas
- Monitor error rates closely
Phase 4: Switch (Week 4)
- All reads and writes go to new schema only
- Old schema kept as backup
- Monitor for 1 week for any issues
- Have rollback plan ready
Phase 5: Cleanup (Week 5)
- After verifying everything works
- Drop old schema tables
- Remove dual-write code
- Update documentation
Key principles:
- Always maintain backward compatibility
- Feature flags to enable instant rollback
- Monitor error rates at each phase
- Have rollback plan ready at every step
- Test migration on staging environment first”