Design a Delayed Message Queue

Background

From 20251113 Shopee R2
See Problem Statement

The sample answer below is generated by Claude. Chat reference

Step 1: What the System Does (The 3 Core Functions)

Our system needs to:

Accept messages - User sends a message with a delay (e.g., “send this in 2 hours”)
Wait and deliver - System waits exactly that long, then sends the message
Allow cancellation - User can cancel before it’s sent

Step 2: High-Level Architecture (The Big Picture)

┌─────────┐
│  User   │
└────┬────┘
     │
     ▼
┌─────────────────┐
│   API Server    │ ◄── Handles schedule/cancel requests
└────┬────────────┘
     │
     ├──────────┬─────────────┐
     ▼          ▼             ▼
┌─────────┐ ┌──────────┐ ┌─────────┐
│Database │ │  Redis   │ │ Workers │ ◄── Deliver messages
│(Postgres)│ │(Queue)  │ │  Pool   │
└─────────┘ └──────────┘ └─────────┘
                              │
                              ▼
                    ┌──────────────────┐
                    │ 3rd Party API    │
                    │ (Final delivery) │
                    └──────────────────┘

What each component does:

API Server: Receives HTTP requests (schedule/cancel)
Database: Stores all message data permanently
Redis: Quick lookup of “what messages need to be sent NOW”
Workers: Background processes that check Redis and deliver messages

Step 3: The Flow - Schedule a Message

User Request:
POST /schedule
{
  "content": "Happy Birthday!",
  "delay_seconds": 7200  // 2 hours
}

What happens:

1. API Server receives request
   ↓
2. Calculate delivery time = now + 7200 seconds = 2:00 PM
   ↓
3. Save to Database:
   {
     id: "msg-123",
     content: "Happy Birthday!",
     scheduled_time: "2:00 PM",
     status: "PENDING"
   }
   ↓
4. Add to Redis sorted set:
   ZADD pending_messages 1699891200 "msg-123"
        ^                ^           ^
        |                |           |
    command         timestamp    message ID
                   (as score)
   ↓
5. Return to user: {"message_id": "msg-123"}

Why Redis Sorted Set? Redis sorted set stores items with a “score”. We use the delivery timestamp as the score. This lets us quickly ask: “Which messages have timestamp ⇐ now?”

Step 4: The Flow - Worker Delivers Message

Worker runs every 1 second:

1. Get current time: 2:00:00 PM
   ↓
2. Ask Redis: "Give me messages with timestamp <= 2:00:00 PM"
   
   ZRANGEBYSCORE pending_messages 0 1699891200 LIMIT 0 100
   
   Returns: ["msg-123", "msg-456", ...]
   ↓
3. For each message ID:
   
   a. Get full message from Database
   b. Check status is still "PENDING"
   c. Call 3rd party API to deliver
   d. If success: update status to "DELIVERED"
   e. If fail: retry (up to 5 times)
   ↓
4. Remove from Redis:
   ZREM pending_messages "msg-123"

Step 5: The Flow - Cancel a Message

User Request:
DELETE /cancel/msg-123

What happens:

1. Check Database: What's the current status?
   ↓
2. If status = "DELIVERED": Return error "Already sent"
   ↓
3. If status = "PENDING":
   a. Update Database: status = "CANCELLED"
   b. Remove from Redis: ZREM pending_messages "msg-123"
   ↓
4. Return: {"success": true}

Step 6: The Tricky Part - Race Conditions

The Problem: What if cancel and delivery happen at the SAME TIME?

Time: 2:00:00 PM

Thread 1 (User):           Thread 2 (Worker):
Cancel msg-123             Get msg-123 from Redis
  ↓                          ↓
Check status (PENDING)     Check status (PENDING)
  ↓                          ↓
Mark as CANCELLED          Start delivering...
  ↓                          ↓
CONFLICT! Both think they can proceed!

The Solution: Locks

We use a “lock” - only ONE thread can hold the lock at a time.

// Cancel function
func CancelMessage(messageID string) {
    lock := GetLock("lock:msg-123")  // Get the lock
    lock.Acquire()  // Wait until we have it
    
    msg := db.GetMessage(messageID)
    
    if msg.Status == "PENDING" {
        db.UpdateStatus(messageID, "CANCELLED")
        redis.Delete(messageID)
    }
    
    lock.Release()  // Let others use the lock
}
 
// Delivery function
func DeliverMessage(messageID string) {
    lock := GetLock("lock:msg-123")  // Get SAME lock
    lock.Acquire()
    
    msg := db.GetMessage(messageID)
    
    if msg.Status == "PENDING" {
        db.UpdateStatus(messageID, "PROCESSING")
    } else {
        lock.Release()
        return  // Was cancelled!
    }
    
    lock.Release()
    
    // Now deliver (without lock)
    CallThirdPartyAPI(msg)
}

Now they can’t conflict:

If cancel gets lock first → marks CANCELLED → delivery sees CANCELLED and stops
If delivery gets lock first → marks PROCESSING → cancel sees PROCESSING and fails

Step 7: Handling Errors

What if the 3rd party API fails?

func DeliverWithRetry(msg Message) bool {
    for attempt := 0; attempt < 5; attempt++ {
        
        response := CallThirdPartyAPI(msg.content)
        
        if response.Success {
            return true  // Delivered!
        }
        
        if response.StatusCode == 500 {
            // Server error - wait and retry
            time.Sleep(2^attempt seconds)  // 1s, 2s, 4s, 8s, 16s
            continue
        }
        
        if response.StatusCode == 400 {
            // Bad request - don't retry
            return false
        }
    }
    
    // Failed after 5 tries
    return false
}

Exponential backoff: Wait 1s, then 2s, then 4s, then 8s, then 16s between retries.

Step 8: Scaling to 10,000 Messages/Second

Problem: One Redis sorted set + one worker = bottleneck

Solution: Partition the work

Instead of:
  pending_messages  (1 sorted set)
  
Use:
  pending_messages:0   (partition 0)
  pending_messages:1   (partition 1)
  pending_messages:2   (partition 2)
  ...
  pending_messages:9   (partition 9)

Schedule message:
  partition = hash(message_id) % 10
  ZADD pending_messages:{partition} {timestamp} {message_id}

Workers:
  Worker 1 → polls partitions 0, 1
  Worker 2 → polls partitions 2, 3
  Worker 3 → polls partitions 4, 5
  Worker 4 → polls partitions 6, 7
  Worker 5 → polls partitions 8, 9

Now 5 workers can work in parallel without conflicting!

Complete Go Code Example

package main
 
import (
    "context"
    "fmt"
    "time"
    "github.com/go-redis/redis/v8"
    "github.com/go-redsync/redsync/v4"
)
 
type Message struct {
    ID            string
    Content       string
    ScheduledTime time.Time
    Status        string
    RetryCount    int
}
 
type Scheduler struct {
    redis    *redis.Client
    redsync  *redsync.Redsync
    db       *Database
    ctx      context.Context
}
 
// Main worker loop - runs every second
func (s *Scheduler) Start() {
    ticker := time.NewTicker(1 * time.Second)
    
    for {
        select {
        case <-ticker.C:
            s.pollAndDeliver()
        case <-s.ctx.Done():
            return
        }
    }
}
 
// Poll Redis for ready messages
func (s *Scheduler) pollAndDeliver() {
    now := time.Now().Unix()
    
    // Get messages that are due (score <= now)
    messageIDs, err := s.redis.ZRangeByScore(s.ctx, "pending_messages", &redis.ZRangeBy{
        Min:    "0",
        Max:    fmt.Sprintf("%d", now),
        Offset: 0,
        Count:  100,
    }).Result()
    
    if err != nil {
        return
    }
    
    // Process each message
    for _, msgID := range messageIDs {
        go s.processMessage(msgID)  // Process in parallel
    }
}
 
// Process and deliver a single message
func (s *Scheduler) processMessage(messageID string) {
    // 1. Acquire lock
    mutex := s.redsync.NewMutex(fmt.Sprintf("lock:msg:%s", messageID))
    if err := mutex.Lock(); err != nil {
        return
    }
    
    // 2. Check status from database
    msg, err := s.db.GetMessage(messageID)
    if err != nil || msg.Status != "PENDING" {
        mutex.Unlock()
        return
    }
    
    // 3. Mark as processing
    s.db.UpdateStatus(messageID, "PROCESSING")
    mutex.Unlock()
    
    // 4. Remove from Redis
    s.redis.ZRem(s.ctx, "pending_messages", messageID)
    
    // 5. Deliver with retries
    success := s.deliverWithRetry(msg)
    
    if success {
        s.db.UpdateStatus(messageID, "DELIVERED")
    } else {
        s.db.UpdateStatus(messageID, "FAILED")
    }
}
 
// Deliver with exponential backoff retry
func (s *Scheduler) deliverWithRetry(msg Message) bool {
    for attempt := 0; attempt < 5; attempt++ {
        err := s.callThirdPartyAPI(msg.Content)
        
        if err == nil {
            return true
        }
        
        // Wait before retry: 1s, 2s, 4s, 8s, 16s
        waitTime := time.Duration(1<<attempt) * time.Second
        time.Sleep(waitTime)
    }
    
    return false
}
 
// API: Schedule a message
func (s *Scheduler) ScheduleMessage(content string, delaySeconds int) (string, error) {
    messageID := generateID()
    scheduledTime := time.Now().Add(time.Duration(delaySeconds) * time.Second)
    
    // 1. Save to database
    msg := Message{
        ID:            messageID,
        Content:       content,
        ScheduledTime: scheduledTime,
        Status:        "PENDING",
    }
    s.db.SaveMessage(msg)
    
    // 2. Add to Redis
    s.redis.ZAdd(s.ctx, "pending_messages", &redis.Z{
        Score:  float64(scheduledTime.Unix()),
        Member: messageID,
    })
    
    return messageID, nil
}
 
// API: Cancel a message
func (s *Scheduler) CancelMessage(messageID string) error {
    // 1. Acquire lock
    mutex := s.redsync.NewMutex(fmt.Sprintf("lock:msg:%s", messageID))
    if err := mutex.Lock(); err != nil {
        return err
    }
    defer mutex.Unlock()
    
    // 2. Check current status
    msg, err := s.db.GetMessage(messageID)
    if err != nil {
        return err
    }
    
    if msg.Status == "DELIVERED" {
        return fmt.Errorf("message already delivered")
    }
    
    if msg.Status == "CANCELLED" {
        return nil  // Already cancelled - idempotent
    }
    
    // 3. Update status and remove from Redis
    s.db.UpdateStatus(messageID, "CANCELLED")
    s.redis.ZRem(s.ctx, "pending_messages", messageID)
    
    return nil
}
 
func (s *Scheduler) callThirdPartyAPI(content string) error {
    // Call external API
    // Return nil if success, error if failure
    return nil
}
 
type Database struct {
    // Database connection
}
 
func (db *Database) GetMessage(id string) (Message, error) {
    // Query database
    return Message{}, nil
}
 
func (db *Database) SaveMessage(msg Message) error {
    // Insert to database
    return nil
}
 
func (db *Database) UpdateStatus(id string, status string) error {
    // UPDATE messages SET status = ? WHERE id = ?
    return nil
}
 
func generateID() string {
    return fmt.Sprintf("msg-%d", time.Now().UnixNano())
}
 
func main() {
    ctx := context.Background()
    
    rdb := redis.NewClient(&redis.Options{
        Addr: "localhost:6379",
    })
    
    scheduler := &Scheduler{
        redis: rdb,
        ctx:   ctx,
    }
    
    // Start the worker
    go scheduler.Start()
    
    // Keep running
    select {}
}

Summary - The Key Concepts

Redis Sorted Set - Fast time-based indexing
Database - Permanent storage with status tracking
Workers - Background processes that poll and deliver
Locks - Prevent race conditions between cancel and deliver
Retry Logic - Exponential backoff for failures
Partitioning - Split work across multiple workers for scale

Nature's Digital Garden

Explorer

Design a Delayed Message Queue

Background

Step 1: What the System Does (The 3 Core Functions)

Step 2: High-Level Architecture (The Big Picture)

Step 3: The Flow - Schedule a Message

Step 4: The Flow - Worker Delivers Message

Step 5: The Flow - Cancel a Message

Step 6: The Tricky Part - Race Conditions

Step 7: Handling Errors

Step 8: Scaling to 10,000 Messages/Second

Summary - The Key Concepts

Graph View

Table of Contents

Backlinks