Background

The sample answer below is generated by Claude. Chat reference


Step 1: What the System Does (The 3 Core Functions)

Our system needs to:

  1. Accept messages - User sends a message with a delay (e.g., “send this in 2 hours”)
  2. Wait and deliver - System waits exactly that long, then sends the message
  3. Allow cancellation - User can cancel before it’s sent

Step 2: High-Level Architecture (The Big Picture)

┌─────────┐
│  User   │
└────┬────┘
     │
     ▼
┌─────────────────┐
│   API Server    │ ◄── Handles schedule/cancel requests
└────┬────────────┘
     │
     ├──────────┬─────────────┐
     ▼          ▼             ▼
┌─────────┐ ┌──────────┐ ┌─────────┐
│Database │ │  Redis   │ │ Workers │ ◄── Deliver messages
│(Postgres)│ │(Queue)  │ │  Pool   │
└─────────┘ └──────────┘ └─────────┘
                              │
                              ▼
                    ┌──────────────────┐
                    │ 3rd Party API    │
                    │ (Final delivery) │
                    └──────────────────┘

What each component does:

  • API Server: Receives HTTP requests (schedule/cancel)
  • Database: Stores all message data permanently
  • Redis: Quick lookup of “what messages need to be sent NOW”
  • Workers: Background processes that check Redis and deliver messages

Step 3: The Flow - Schedule a Message

User Request:
POST /schedule
{
  "content": "Happy Birthday!",
  "delay_seconds": 7200  // 2 hours
}

What happens:

1. API Server receives request
   ↓
2. Calculate delivery time = now + 7200 seconds = 2:00 PM
   ↓
3. Save to Database:
   {
     id: "msg-123",
     content: "Happy Birthday!",
     scheduled_time: "2:00 PM",
     status: "PENDING"
   }
   ↓
4. Add to Redis sorted set:
   ZADD pending_messages 1699891200 "msg-123"
        ^                ^           ^
        |                |           |
    command         timestamp    message ID
                   (as score)
   ↓
5. Return to user: {"message_id": "msg-123"}

Why Redis Sorted Set? Redis sorted set stores items with a “score”. We use the delivery timestamp as the score. This lets us quickly ask: “Which messages have timestamp now?”


Step 4: The Flow - Worker Delivers Message

Worker runs every 1 second:

1. Get current time: 2:00:00 PM
   ↓
2. Ask Redis: "Give me messages with timestamp <= 2:00:00 PM"
   
   ZRANGEBYSCORE pending_messages 0 1699891200 LIMIT 0 100
   
   Returns: ["msg-123", "msg-456", ...]
   ↓
3. For each message ID:
   
   a. Get full message from Database
   b. Check status is still "PENDING"
   c. Call 3rd party API to deliver
   d. If success: update status to "DELIVERED"
   e. If fail: retry (up to 5 times)
   ↓
4. Remove from Redis:
   ZREM pending_messages "msg-123"

Step 5: The Flow - Cancel a Message

User Request:
DELETE /cancel/msg-123

What happens:

1. Check Database: What's the current status?
   ↓
2. If status = "DELIVERED": Return error "Already sent"
   ↓
3. If status = "PENDING":
   a. Update Database: status = "CANCELLED"
   b. Remove from Redis: ZREM pending_messages "msg-123"
   ↓
4. Return: {"success": true}

Step 6: The Tricky Part - Race Conditions

The Problem: What if cancel and delivery happen at the SAME TIME?

Time: 2:00:00 PM

Thread 1 (User):           Thread 2 (Worker):
Cancel msg-123             Get msg-123 from Redis
  ↓                          ↓
Check status (PENDING)     Check status (PENDING)
  ↓                          ↓
Mark as CANCELLED          Start delivering...
  ↓                          ↓
CONFLICT! Both think they can proceed!

The Solution: Locks

We use a “lock” - only ONE thread can hold the lock at a time.

// Cancel function
func CancelMessage(messageID string) {
    lock := GetLock("lock:msg-123")  // Get the lock
    lock.Acquire()  // Wait until we have it
    
    msg := db.GetMessage(messageID)
    
    if msg.Status == "PENDING" {
        db.UpdateStatus(messageID, "CANCELLED")
        redis.Delete(messageID)
    }
    
    lock.Release()  // Let others use the lock
}
 
// Delivery function
func DeliverMessage(messageID string) {
    lock := GetLock("lock:msg-123")  // Get SAME lock
    lock.Acquire()
    
    msg := db.GetMessage(messageID)
    
    if msg.Status == "PENDING" {
        db.UpdateStatus(messageID, "PROCESSING")
    } else {
        lock.Release()
        return  // Was cancelled!
    }
    
    lock.Release()
    
    // Now deliver (without lock)
    CallThirdPartyAPI(msg)
}

Now they can’t conflict:

  • If cancel gets lock first → marks CANCELLED → delivery sees CANCELLED and stops
  • If delivery gets lock first → marks PROCESSING → cancel sees PROCESSING and fails

Step 7: Handling Errors

What if the 3rd party API fails?

func DeliverWithRetry(msg Message) bool {
    for attempt := 0; attempt < 5; attempt++ {
        
        response := CallThirdPartyAPI(msg.content)
        
        if response.Success {
            return true  // Delivered!
        }
        
        if response.StatusCode == 500 {
            // Server error - wait and retry
            time.Sleep(2^attempt seconds)  // 1s, 2s, 4s, 8s, 16s
            continue
        }
        
        if response.StatusCode == 400 {
            // Bad request - don't retry
            return false
        }
    }
    
    // Failed after 5 tries
    return false
}

Exponential backoff: Wait 1s, then 2s, then 4s, then 8s, then 16s between retries.


Step 8: Scaling to 10,000 Messages/Second

Problem: One Redis sorted set + one worker = bottleneck

Solution: Partition the work

Instead of:
  pending_messages  (1 sorted set)
  
Use:
  pending_messages:0   (partition 0)
  pending_messages:1   (partition 1)
  pending_messages:2   (partition 2)
  ...
  pending_messages:9   (partition 9)

Schedule message:
  partition = hash(message_id) % 10
  ZADD pending_messages:{partition} {timestamp} {message_id}

Workers:
  Worker 1 → polls partitions 0, 1
  Worker 2 → polls partitions 2, 3
  Worker 3 → polls partitions 4, 5
  Worker 4 → polls partitions 6, 7
  Worker 5 → polls partitions 8, 9

Now 5 workers can work in parallel without conflicting!



Summary - The Key Concepts

  1. Redis Sorted Set - Fast time-based indexing
  2. Database - Permanent storage with status tracking
  3. Workers - Background processes that poll and deliver
  4. Locks - Prevent race conditions between cancel and deliver
  5. Retry Logic - Exponential backoff for failures
  6. Partitioning - Split work across multiple workers for scale