PLAYBOOK-001: Transactional Outbox Pattern¶
Pattern Category: Data Consistency
Complexity: Medium
Maturity: Proven
Last Updated: 2024-01-25
1. Pattern Name¶
Transactional Outbox Pattern (also known as "Application Events Pattern")
2. Problem¶
How do you reliably publish events to a message broker (e.g., Kafka, RabbitMQ) after updating a database, ensuring atomicity between the database write and the event publish?
Common Failure Scenarios¶
❌ Scenario 1: Publish-First
1. Publish event to Kafka ✓
2. Write to database ✗ (fails)
Result: Event published but state not persisted → inconsistency
❌ Scenario 2: Write-First, Synchronous Publish
1. Write to database ✓
2. Publish event to Kafka ✗ (Kafka down)
Result: State persisted but event never published → downstream services miss update
❌ Scenario 3: Distributed Transaction (2PC)
1. Prepare: Database + Kafka
2. Commit: Both or neither
Result: High latency, complex coordination, single point of failure
Core Challenge¶
You cannot use distributed transactions across a database and a message broker because: - Kafka, RabbitMQ, and most brokers don't support two-phase commit (2PC) - Distributed transactions have high latency and poor scalability - Introduces tight coupling and single points of failure
3. Context¶
When This Problem Occurs¶
- Microservices Architecture: Services communicate via events/messages
- Event-Driven Systems: State changes must trigger downstream actions
- CQRS: Command side publishes events to update read models
- Domain Events: Aggregate state changes must notify other bounded contexts
Example Use Cases¶
✅ Order placed → Notify inventory service
✅ User registered → Send welcome email
✅ Payment processed → Update analytics
✅ Booking cancelled → Send cancellation notification
Requirements That Make This Pattern Necessary¶
- Atomicity: Either both database write and event publish succeed, or neither
- Reliability: No event loss even during broker downtime
- Performance: Low latency for user-facing operations
- Scalability: Handle high throughput
4. Solution¶
Pattern Overview¶
Store events in an outbox table within the same database as your primary data. A separate relay worker polls the outbox and publishes events to the message broker.
sequenceDiagram
participant API as Service API
participant DB as Database
participant Outbox as Outbox Table
participant Worker as Relay Worker
participant Broker as Message Broker
participant Consumer as Downstream Service
API->>API: Business logic
rect rgb(200, 220, 250)
note right of API: Single ACID Transaction
API->>DB: Write business data
API->>Outbox: Write event
API->>DB: COMMIT
end
API-->>API: Return success to caller
Worker->>Outbox: Poll for unprocessed events
Outbox-->>Worker: Event(s)
Worker->>Broker: Publish event
Broker-->>Worker: ACK
Worker->>Outbox: Mark as processed
Broker->>Consumer: Event
Consumer->>Consumer: Process event
Key Components¶
- Outbox Table: Database table that stores events to be published
- Transaction: Single database transaction writes both business data and outbox event
- Relay Worker: Background process that polls outbox and publishes events
- Message Broker: Kafka, RabbitMQ, etc. (downstream event distribution)
5. Implementation Details¶
5.1 Database Schema¶
Outbox Table Structure¶
CREATE TABLE outbox (
id BIGSERIAL PRIMARY KEY,
-- Event identification
aggregate_id UUID NOT NULL, -- ID of the aggregate that changed
event_type VARCHAR(100) NOT NULL, -- Type of event (e.g., "OrderPlaced")
-- Event data
event_payload JSONB NOT NULL, -- Full event data as JSON
-- Metadata
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
processed_at TIMESTAMP NULL, -- NULL = unprocessed
-- Retry handling
retry_count INTEGER NOT NULL DEFAULT 0,
last_error TEXT NULL
);
-- Index for efficient polling (only unprocessed events)
CREATE INDEX idx_outbox_unprocessed
ON outbox(created_at)
WHERE processed_at IS NULL;
-- Index for cleanup queries (processed events)
CREATE INDEX idx_outbox_processed
ON outbox(processed_at)
WHERE processed_at IS NOT NULL;
Example: Booking Cancellation¶
-- Business table
CREATE TABLE bookings (
id UUID PRIMARY KEY,
user_id UUID NOT NULL,
status VARCHAR(20) NOT NULL,
cancelled_at TIMESTAMP NULL,
-- other fields...
);
-- Outbox table (shared across all events)
CREATE TABLE booking_outbox (
id BIGSERIAL PRIMARY KEY,
booking_id UUID NOT NULL,
event_type VARCHAR(100) NOT NULL,
event_payload JSONB NOT NULL,
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
processed_at TIMESTAMP NULL
);
5.2 Writing Events (Producer Side)¶
Go Implementation¶
// Service method with transaction
func (s *BookingService) CancelBooking(
ctx context.Context,
bookingID uuid.UUID,
userID uuid.UUID,
reason string,
) (*Booking, error) {
// Start database transaction
tx, err := s.db.BeginTx(ctx, nil)
if err != nil {
return nil, err
}
defer tx.Rollback() // Rollback if not committed
// 1. Load and update business entity
booking, err := s.repo.FindByID(ctx, tx, bookingID)
if err != nil {
return nil, err
}
booking.Cancel(reason)
if err := s.repo.Update(ctx, tx, booking); err != nil {
return nil, err
}
// 2. Write event to outbox (same transaction)
event := BookingCancelledEvent{
EventID: uuid.New(),
BookingID: bookingID,
UserID: userID,
CancelledAt: time.Now(),
Reason: reason,
}
if err := s.outbox.Insert(ctx, tx, OutboxEntry{
AggregateID: bookingID,
EventType: "BookingCancelled",
EventPayload: event.ToJSON(),
}); err != nil {
return nil, err
}
// 3. Commit transaction (atomic write)
if err := tx.Commit(); err != nil {
return nil, err
}
// Success - event will be published by relay worker
return booking, nil
}
Python Implementation¶
def cancel_booking(
booking_id: UUID,
user_id: UUID,
reason: str,
db: Session
) -> Booking:
# Start transaction
with db.begin():
# 1. Load and update business entity
booking = db.query(Booking).filter_by(id=booking_id).first()
if not booking:
raise BookingNotFound()
booking.cancel(reason)
db.add(booking)
# 2. Write event to outbox (same transaction)
event = BookingCancelledEvent(
event_id=uuid4(),
booking_id=booking_id,
user_id=user_id,
cancelled_at=datetime.now(),
reason=reason
)
outbox_entry = OutboxEntry(
aggregate_id=booking_id,
event_type="BookingCancelled",
event_payload=event.to_json()
)
db.add(outbox_entry)
# 3. Commit transaction (atomic write)
db.commit()
return booking
5.3 Publishing Events (Relay Worker)¶
Polling-Based Relay Worker (Go)¶
type OutboxRelayWorker struct {
outbox OutboxRepository
broker MessageBroker
logger Logger
pollRate time.Duration // e.g., 5 seconds
batchSize int // e.g., 100 events
}
func (w *OutboxRelayWorker) Run(ctx context.Context) error {
ticker := time.NewTicker(w.pollRate)
defer ticker.Stop()
for {
select {
case <-ctx.Done():
return ctx.Err()
case <-ticker.C:
if err := w.processOutbox(ctx); err != nil {
w.logger.Error("outbox processing failed", err)
}
}
}
}
func (w *OutboxRelayWorker) processOutbox(ctx context.Context) error {
// Fetch oldest unprocessed events (FIFO order)
events, err := w.outbox.FetchUnprocessed(ctx, w.batchSize)
if err != nil {
return err
}
if len(events) == 0 {
return nil // No work to do
}
for _, event := range events {
if err := w.publishEvent(ctx, event); err != nil {
w.logger.Error("failed to publish event",
"event_id", event.ID,
"error", err,
)
// Increment retry count
w.outbox.IncrementRetry(ctx, event.ID, err.Error())
// Move to dead-letter queue after max retries
if event.RetryCount >= 5 {
w.outbox.MoveToDLQ(ctx, event.ID)
}
continue // Continue processing other events
}
// Mark as processed (idempotent)
w.outbox.MarkProcessed(ctx, event.ID)
}
return nil
}
func (w *OutboxRelayWorker) publishEvent(
ctx context.Context,
event OutboxEntry,
) error {
message := Message{
Topic: "events",
Key: event.AggregateID.String(),
Value: event.EventPayload,
Headers: map[string]string{
"event_type": event.EventType,
"event_id": uuid.New().String(),
},
}
// Publish with timeout
publishCtx, cancel := context.WithTimeout(ctx, 10*time.Second)
defer cancel()
return w.broker.Publish(publishCtx, message)
}
SQL Query for Fetching Unprocessed Events¶
-- Fetch oldest unprocessed events (FIFO order)
SELECT id, aggregate_id, event_type, event_payload, retry_count
FROM outbox
WHERE processed_at IS NULL
ORDER BY created_at ASC
LIMIT 100;
5.4 Alternative: Change Data Capture (CDC)¶
Instead of polling, use database change streams (CDC) to detect new outbox entries.
Using Debezium (Kafka Connect)¶
# Debezium connector configuration
name: outbox-connector
config:
connector.class: io.debezium.connector.postgresql.PostgresConnector
database.hostname: postgres
database.port: 5432
database.user: user
database.password: pass
database.dbname: bookings
table.include.list: public.outbox
transforms: outbox
transforms.outbox.type: io.debezium.transforms.outbox.EventRouter
Pros: - Lower latency (near real-time) - No polling overhead - Built-in at-least-once delivery
Cons: - Additional infrastructure (Debezium, Kafka Connect) - More complex setup - Database-specific (requires WAL for PostgreSQL)
6. When to Use¶
✅ Use This Pattern When¶
- Event-driven architecture: Services communicate via events
- Microservices: Need to publish domain events to other services
- Reliability required: Cannot tolerate event loss
- Database + broker: Using both transactional DB and message broker
- Asynchronous processing: Downstream consumers don't need immediate processing
❌ Do NOT Use This Pattern When¶
- Synchronous responses needed: If caller needs immediate downstream result
-
Alternative: Use request/response (HTTP, gRPC)
-
No database: If you're not using a transactional database
-
Alternative: Use broker's built-in transactions (if available)
-
Simple pub/sub: If you don't need transactional guarantees
-
Alternative: Publish directly to broker (accept risk of event loss)
-
Low complexity tolerance: If team can't maintain additional infrastructure
- Alternative: Accept eventual consistency gaps or use simpler patterns
7. Trade-offs¶
Advantages¶
✅ Atomicity: Guarantees either both database write and event publish succeed, or neither
✅ Reliability: Zero event loss (events persisted in database)
✅ Performance: Fast API response (async event publishing)
✅ Resilience: System continues working even if broker is down
✅ Ordering: Events processed in FIFO order per aggregate
✅ Retry logic: Built-in retry mechanism for failed publishes
Disadvantages¶
⚠️ Eventual consistency: Slight delay between database write and event delivery
⚠️ Complexity: Additional infrastructure (outbox table + relay worker)
⚠️ Operational overhead: Need to monitor outbox backlog and DLQ
⚠️ Storage growth: Outbox table grows (requires cleanup job)
⚠️ At-least-once delivery: Events may be published multiple times (consumers must be idempotent)
8. Edge Cases and Gotchas¶
Edge Case 1: Duplicate Event Publishing¶
Problem: Relay worker crashes after publishing event but before marking as processed.
Impact: Event published twice to broker → downstream consumers receive duplicate.
Solution:
- Add unique event_id to each event
- Consumers implement idempotency (deduplicate by event_id)
// Consumer-side deduplication
func (c *Consumer) HandleEvent(event Event) error {
// Check if event already processed
if c.cache.Contains(event.EventID) {
log.Info("duplicate event detected", "event_id", event.EventID)
return nil // Skip processing
}
// Process event
if err := c.processEvent(event); err != nil {
return err
}
// Mark as processed
c.cache.Set(event.EventID, true, 24*time.Hour)
return nil
}
Edge Case 2: Outbox Table Growth¶
Problem: Outbox table grows indefinitely if processed events aren't deleted.
Impact: Slower queries, increased storage costs.
Solution: Periodic cleanup job
-- Delete processed events older than 30 days
DELETE FROM outbox
WHERE processed_at IS NOT NULL
AND processed_at < NOW() - INTERVAL '30 days';
// Cleanup job (cron)
func (j *OutboxCleanupJob) Run(ctx context.Context) error {
result, err := j.db.ExecContext(ctx, `
DELETE FROM outbox
WHERE processed_at IS NOT NULL
AND processed_at < NOW() - INTERVAL '30 days'
`)
if err != nil {
return err
}
deleted, _ := result.RowsAffected()
j.logger.Info("outbox cleanup completed", "rows_deleted", deleted)
return nil
}
Edge Case 3: Poisoned Events (Malformed Data)¶
Problem: Event payload is malformed and fails to publish repeatedly.
Impact: Blocks outbox processing, causes backlog.
Solution: Dead-letter queue (DLQ)
// Move to DLQ after 5 failed retries
if event.RetryCount >= 5 {
// Move to separate DLQ table
if err := w.outbox.MoveToDLQ(ctx, event.ID); err != nil {
return err
}
// Alert operations team
w.alerting.Send("Outbox event moved to DLQ", event)
}
-- Dead-letter queue table
CREATE TABLE outbox_dlq (
id BIGINT PRIMARY KEY,
original_event JSONB NOT NULL,
error_message TEXT NOT NULL,
moved_at TIMESTAMP NOT NULL DEFAULT NOW()
);
Edge Case 4: Relay Worker Crashes¶
Problem: Relay worker process crashes or is killed.
Impact: Events not published until worker restarts.
Solution: Deploy multiple worker instances (redundancy)
# Kubernetes deployment with 3 replicas
apiVersion: apps/v1
kind: Deployment
metadata:
name: outbox-relay-worker
spec:
replicas: 3 # Multiple workers for redundancy
template:
spec:
containers:
- name: worker
image: outbox-relay:v1.0
Note: Multiple workers are safe because:
- Each worker fetches different batch of events
- processed_at prevents duplicate processing
- Database-level locks prevent race conditions
Edge Case 5: Broker Unavailability¶
Problem: Kafka/RabbitMQ is down for extended period.
Impact: Outbox backlog grows.
Solution: - Monitoring: Alert when outbox size > threshold - Auto-scaling: Scale up worker replicas to handle backlog - Graceful degradation: API continues working (events queued)
# Prometheus alert
- alert: OutboxBacklogHigh
expr: outbox_unprocessed_count > 10000
for: 10m
annotations:
summary: "Outbox backlog > 10k events - check broker health"
9. Checklist for Implementation¶
Before Writing Code¶
- Identify aggregates that publish events
- Define event schemas (payload structure)
- Choose event serialization format (JSON, Protobuf, Avro)
- Design outbox table schema
- Plan cleanup strategy for processed events
During Implementation¶
- Create outbox table with indexes
- Update service layer to write to outbox in transactions
- Implement relay worker with retry logic
- Add dead-letter queue for poisoned events
- Implement idempotency in consumers (deduplicate by event_id)
- Add unique event_id to each event
After Implementation¶
- Add monitoring metrics (outbox size, processing lag)
- Set up alerts (backlog, DLQ growth)
- Test failure scenarios (broker down, worker crash)
- Load test with high throughput
- Document operational runbooks
Production Readiness¶
- Deploy multiple relay worker instances (redundancy)
- Configure auto-scaling for workers
- Set up outbox cleanup job (cron)
- Verify at-least-once delivery guarantees
- Test rollback procedure
10. Monitoring and Observability¶
Key Metrics¶
# Prometheus metrics
outbox_unprocessed_count - Gauge (current backlog)
outbox_processing_lag_seconds - Gauge (age of oldest unprocessed event)
outbox_events_published_total{status="success|error"} - Counter
outbox_retry_count - Histogram (distribution of retry counts)
outbox_dlq_size - Gauge (dead-letter queue size)
Recommended Alerts¶
# High backlog
- alert: OutboxBacklogHigh
expr: outbox_unprocessed_count > 10000
for: 10m
# Processing lag
- alert: OutboxProcessingLag
expr: outbox_processing_lag_seconds > 60
for: 5m
# DLQ growth
- alert: OutboxDLQGrowing
expr: increase(outbox_dlq_size[1h]) > 100
Observability Best Practices¶
- Correlation IDs: Include request_id in logs for tracing
- Distributed tracing: Trace event flow (API → Outbox → Broker → Consumer)
- Dashboard: Grafana dashboard showing outbox health metrics
11. Example References¶
Real-World Implementations¶
- ADR-023: Booking Cancellation (this codebase)
- See
/docs/examples/example-adr.md -
Cancels booking and publishes
BookingCancelledevent -
Order Service: Order placement
- Writes order to
orderstable - Writes
OrderPlacedevent to outbox -
Triggers inventory reservation and payment processing
-
User Service: User registration
- Writes user to
userstable - Writes
UserRegisteredevent to outbox - Triggers welcome email and analytics tracking
12. Alternatives Considered¶
Alternative 1: Direct Broker Publish (No Outbox)¶
Approach: Publish to broker immediately after database write
Why Rejected: No atomicity guarantee - event can be lost if broker is down
Alternative 2: Two-Phase Commit (2PC)¶
Approach: Distributed transaction across database and broker
Why Rejected: - Most brokers don't support 2PC - High latency and complexity - Single point of failure
Alternative 3: Saga Pattern¶
Approach: Compensating transactions for rollback
Why Rejected: Overkill for simple event publishing (Saga better for multi-step workflows)
13. Further Reading¶
Academic Papers¶
- "Making Reliable Distributed Systems in the Presence of Software Errors" - Joe Armstrong (Erlang)
Articles¶
- Transactional Outbox Pattern - Chris Richardson
- The Outbox Pattern - Debezium Blog
- Achieving Exactly-Once Delivery - Confluent
Tools¶
- Debezium - CDC-based outbox pattern
- Kafka Connect - Connector framework
- Outbox Pattern Library (Java) - Reference implementation
14. Version History¶
| Version | Date | Changes | Author |
|---|---|---|---|
| 1.0 | 2024-01-25 | Initial playbook entry | Jane Smith |
Pattern Status: Proven in Production
Recommended: Yes
Review Date: 2025-01-25