Skip to content

Production Readiness Explanation

This document explains the comprehensive considerations for deploying ComponentActor systems to production environments. It provides context and rationale for production deployment decisions, monitoring strategies, performance tuning, and operational best practices.

Production Deployment Considerations

Why Production Deployment Differs from Development

Development Environment:

  • Single node, local execution
  • Limited concurrency (10-100 operations)
  • Forgiving error handling (panics visible in console)
  • Manual restarts acceptable
  • No performance requirements

Production Environment:

  • Distributed deployment (multiple nodes)
  • High concurrency (1000+ components, millions of messages)
  • Resilient error handling (automatic recovery required)
  • Updates applied during runtime without system restart
  • Strict performance SLAs (P99 latency < 100ms)

Key Differences: 1. Scale: Production handles 10-100x more load 2. Reliability: Production requires 99.9%+ uptime 3. Observability: Production needs comprehensive monitoring 4. Security: Production enforces strict capability-based security 5. Operations: Production requires deployment automation and rollback capability

Architecture for Production

Production ComponentActor System
├─ Load Balancer (traffic distribution)
├─ ActorSystem Cluster (multiple nodes)
│   ├─ Node 1: 100 components
│   ├─ Node 2: 100 components
│   └─ Node 3: 100 components
├─ Shared Registry (component discovery)
├─ Monitoring System (Prometheus + Grafana)
├─ Logging Aggregation (structured logs)
└─ Distributed Tracing (request flow analysis)

Design Rationale:

  • Multiple Nodes: Horizontal scalability and fault tolerance
  • Shared Registry: O(1) component lookup across nodes (36ns measured in Task 6.2)
  • Monitoring: Real-time performance tracking against baselines
  • Logging: Centralized debugging and audit trails
  • Tracing: End-to-end request flow visibility

Monitoring and Observability

Why Comprehensive Monitoring is Critical

Without Monitoring:

  • Performance degradation unnoticed until user complaints
  • Errors silently accumulate, causing cascading failures
  • Resource leaks go undetected, leading to crashes
  • No data for capacity planning or optimization

With Monitoring:

  • Early detection of performance regressions (P99 latency trending up)
  • Proactive alerting before user impact (component spawn > 1ms)
  • Data-driven capacity planning (current utilization vs capacity)
  • Evidence-based optimization (measure before/after improvements)

Three Pillars of Observability

1. Metrics (What is happening?)

Track quantitative performance data:

use prometheus::{IntCounter, Histogram, Registry, register_histogram_with_registry};

// Component lifecycle metrics
let spawn_duration = register_histogram_with_registry!(
    "component_spawn_duration_seconds",
    "Time to spawn component",
    vec![0.0001, 0.0005, 0.001, 0.005, 0.01],  // Buckets: 100µs to 10ms
    registry,
)?;

// Baseline: 286ns (Task 6.2 actor_lifecycle_benchmarks.rs)
// Alert: > 1ms P99 (3,496x degradation from baseline)

Key Metrics to Track:

Category Metric Baseline Alert Threshold Source
Lifecycle Component spawn 286ns >1ms P99 actor_lifecycle_benchmarks.rs
Lifecycle Full lifecycle 1.49µs >10µs P99 actor_lifecycle_benchmarks.rs
Messaging Message routing 1.05µs >100µs P99 messaging_benchmarks.rs
Messaging Throughput 6.12M msg/sec <100k msg/sec messaging_benchmarks.rs
Messaging Request-response 3.18µs >1ms P99 messaging_benchmarks.rs
Messaging Pub-sub fanout (100) 85.2µs >1ms P99 messaging_benchmarks.rs
Registry Lookup time 36ns O(1) >1µs P99 scalability_benchmarks.rs
System Active components - >1000 (capacity limit) -
System Memory usage - >80% of limit -
System CPU usage - >80% of cores -

2. Logging (What happened?)

Capture structured event logs:

use tracing::{info, warn, error};

// Lifecycle events
info!(
    component_id = %component_id,
    duration_ns = spawn_duration.as_nanos(),
    "Component spawned"
);

// Error events
error!(
    component_id = %component_id,
    error = %err,
    "Component spawn failed"
);

// Security events
warn!(
    component_id = %component_id,
    capability = %requested_capability,
    "Capability violation detected"
);

Log Levels in Production:

  • ERROR: Failures requiring attention (spawn failures, capability violations)
  • WARN: Degraded conditions (slow spawns, high error rates)
  • INFO: Normal operations (component started, message sent)
  • DEBUG: Detailed troubleshooting (disabled in production by default)

3. Tracing (How did it happen?)

Track request flow across components:

use tracing::{info_span, instrument};

#[instrument(skip(self, context))]
async fn handle_message(
    &mut self,
    message: Self::Message,
    context: &ActorContext,
) -> Result<(), Self::Error> {
    let span = info_span!(
        "handle_message",
        component_id = %context.component_id,
        message_type = %std::any::type_name::<Self::Message>(),
    );

    let _enter = span.enter();

    // Message processing (automatically traced)
    Ok(())
}

Tracing Benefits:

  • Identify bottlenecks in multi-component pipelines
  • Measure end-to-end latency (ingress → processing → egress)
  • Correlate errors across component boundaries
  • Visualize request flow (Jaeger, Zipkin)

Performance Tuning

Understanding Performance Baselines (Task 6.2)

Baseline Performance (macOS M1, 100 samples, 95% CI, measured in Task 6.2):

Lifecycle Operations:

  • Component construction: 286ns (2.65 million/sec capacity)
  • Full lifecycle (start+stop): 1.49µs
  • State access (read): 37ns
  • State access (write): 39ns

Messaging Operations:

  • Message routing: 1.05µs (952k msg/sec per component)
  • Request-response cycle: 3.18µs (314k req/sec per component)
  • Message throughput: 6.12 million msg/sec (system-wide)
  • Pub-sub fanout (100): 85.2µs (11,737 fanouts/sec)

Scalability:

  • Registry lookup: 36ns O(1) (constant from 10-1,000 components)
  • Component spawn rate: 2.65 million/sec
  • Concurrent operations (100): 120µs (833k ops/sec)

Implications for Production:

  • Single node can handle 1000+ components with O(1) lookup
  • Message throughput supports 6M msg/sec before bottleneck
  • Component spawn is nearly instantaneous (286ns)

Optimization Strategies

1. Component Spawn Optimization

Target: <500ns P99 (current: 286ns baseline)

Already Optimal - No optimization needed. Current performance exceeds target by 1.7x.

If degradation occurs (>500ns):

// Pre-allocate component pools (reduce allocation overhead)
pub struct ComponentPool {
    available: Vec<ComponentInstance>,
}

impl ComponentPool {
    pub async fn acquire(&mut self) -> ComponentInstance {
        // Reuse pre-allocated instance (avoids 286ns spawn)
        self.available.pop().unwrap_or_else(|| {
            ComponentInstance::new()  // Fallback to new allocation
        })
    }
}

2. Message Throughput Optimization

Target: >5M msg/sec (current: 6.12M baseline, exceeds target)

Optimization: Batch Message Processing

// Instead of processing one message at a time
for message in messages {
    process_message(message).await;  // Await per message
}

// Batch processing (reduce async overhead)
let futures: Vec<_> = messages.into_iter()
    .map(|msg| process_message(msg))
    .collect();

futures::future::join_all(futures).await;  // Parallel execution

Measured Impact:

  • Single message: 1.05µs per message
  • Batch of 100: ~105µs total (1.05µs per message maintained)
  • Benefit: Lower latency variance, higher throughput consistency

3. Registry Lookup Optimization

Target: <500ns (current: 36ns, already 13.8x better)

Already Optimal - HashMap-based registry achieves O(1) constant time (36ns from 10-1,000 components, validated in Task 6.2 scalability_benchmarks.rs).

Why it's fast:

use dashmap::DashMap;

pub struct ComponentRegistry {
    components: Arc<DashMap<ComponentId, ComponentInstance>>,
}

impl ComponentRegistry {
    pub fn lookup(&self, component_id: &ComponentId) -> Option<ComponentInstance> {
        // O(1) HashMap lookup: ~36ns
        self.components.get(component_id).map(|entry| entry.clone())
    }
}

No optimization needed - Performance already exceptional.

When to Optimize (Data-Driven Approach)

Step 1: Measure Current Performance

# Run production benchmarks
cargo bench --bench actor_lifecycle_benchmarks
cargo bench --bench messaging_benchmarks
cargo bench --bench scalability_benchmarks

Step 2: Compare Against Baselines - Component spawn: Current vs 286ns baseline - Message throughput: Current vs 6.12M msg/sec baseline - Registry lookup: Current vs 36ns baseline

Step 3: Optimize Only If:

  • Current performance < 50% of baseline (e.g., spawn > 572ns)
  • Performance degrading over time (trending analysis)
  • SLA violations occurring (P99 latency > threshold)

Step 4: Validate Optimization

# Re-run benchmarks after optimization
cargo bench --bench actor_lifecycle_benchmarks -- --baseline before_optimization

# Compare results
# Expected: Performance improvement without regression in other areas

Troubleshooting Common Production Issues

Issue 1: High Lock Contention (State Access Bottlenecks)

Symptom:

  • State access latency > 100ns (baseline: 37-39ns)
  • Component message handling slowing down
  • CPU utilization low despite high load

Cause: Multiple components holding state locks for extended periods:

// ❌ BAD: Lock held across await point
let mut state = self.state.write().await;
let result = expensive_computation(&state).await;  // Lock held during await
state.update(result);

Solution: Minimize lock duration:

// ✅ GOOD: Lock held briefly
let data = {
    let state = self.state.read().await;
    state.data.clone()  // Clone needed data
}; // Lock released

let result = expensive_computation(&data).await;  // Await outside lock

{
    let mut state = self.state.write().await;
    state.update(result);
} // Lock released immediately

Validation:

  • State access returns to 37-39ns baseline
  • Message throughput returns to expected rate

Issue 2: Memory Leaks (Component Cleanup Issues)

Symptom:

  • Memory usage grows over time (never decreases)
  • Eventually OOM (Out of Memory) crash
  • Component count correct but memory usage high

Cause: Components not properly cleaned up on stop:

// ❌ BAD: Resources not released
impl Child for LeakyComponent {
    fn post_stop(&mut self, _context: &ChildContext) -> Result<(), ChildError> {
        // File handles, network connections not closed
        Ok(())
    }
}

Solution: Explicit cleanup in post_stop:

// ✅ GOOD: Explicit resource cleanup
impl Child for CleanComponent {
    fn post_stop(&mut self, context: &ChildContext) -> Result<(), ChildError> {
        // Close file handles
        if let Some(file) = self.file_handle.take() {
            drop(file);
        }

        // Close network connections
        if let Some(conn) = self.network_connection.take() {
            tokio::task::block_in_place(|| {
                let runtime = tokio::runtime::Handle::current();
                runtime.block_on(async {
                    conn.close().await.ok();
                });
            });
        }

        // Clear large data structures
        self.buffer.clear();
        self.buffer.shrink_to_fit();

        Ok(())
    }
}

Validation:

  • Memory usage stable over time
  • Memory drops after component stop
  • Use tools: heaptrack, valgrind --tool=massif

Issue 3: Message Queue Growth (Backpressure Handling)

Symptom:

  • Message queues growing unbounded
  • Latency increasing over time
  • Eventually OOM or timeout failures

Cause: Components receiving messages faster than processing:

// Message rate: 10k msg/sec
// Processing rate: 5k msg/sec
// Queue growth: +5k msg/sec (unbounded)

Solution: Implement backpressure:

use tokio::sync::mpsc;

// Bounded channel (backpressure via channel capacity)
let (tx, rx) = mpsc::channel(1000);  // Max 1000 queued messages

// Sender blocks when queue full (backpressure applied)
tx.send(message).await?;  // Blocks if queue at capacity

// Alternative: Drop messages when overloaded
match tx.try_send(message) {
    Ok(()) => { /* Message queued */ }
    Err(mpsc::error::TrySendError::Full(_)) => {
        // Queue full - drop message and log
        tracing::warn!("Message dropped due to queue full");
    }
    Err(e) => { /* Channel closed */ }
}

Validation:

  • Queue size bounded (monitored via metrics)
  • Latency stable under load
  • No OOM failures

Security Considerations

WASM Sandboxing

ComponentActor leverages WebAssembly sandboxing for security:

Memory Isolation:

  • Each component has separate linear memory
  • Components cannot access host memory directly
  • Memory bounds checked by WASM runtime

Capability-Based Security:

  • Components granted explicit capabilities (file:read, network:outbound)
  • All system calls require capability check
  • Deny-by-default security model

Example:

use airssys_wasm::security::CapabilitySet;

// Component granted minimal capabilities
let capabilities = CapabilitySet::new()
    .with_file_read("/data/input")      // Only read from /data/input
    .with_network_outbound("api.example.com:443");  // Only call specific API

// Component attempts unauthorized access
component.read_file("/etc/passwd").await?;  // ❌ Denied - no capability

// Component attempts authorized access
component.read_file("/data/input/data.json").await?;  // ✅ Allowed

Threat Model:

  • Malicious Components: Assume components may be adversarial
  • Resource Exhaustion: Limit CPU time, memory, I/O per component
  • Data Exfiltration: Prevent unauthorized data access via capabilities
  • Privilege Escalation: Components cannot gain additional capabilities at runtime

Audit Logging

Comprehensive audit logging for security and compliance:

use tracing::{info, warn};

// Successful operations
info!(
    component_id = %component_id,
    operation = "file_read",
    path = %path,
    timestamp = %chrono::Utc::now(),
    "File access granted"
);

// Capability violations
warn!(
    component_id = %component_id,
    operation = "network_outbound",
    attempted_host = %host,
    granted_capabilities = ?capabilities,
    timestamp = %chrono::Utc::now(),
    "Capability violation detected"
);

// Component lifecycle
info!(
    component_id = %component_id,
    event = "component_spawned",
    capabilities = ?capabilities,
    timestamp = %chrono::Utc::now(),
    "Component spawned with capabilities"
);

Audit Log Storage:

  • Structured logs (JSON format)
  • Centralized storage (Elasticsearch, Splunk)
  • Immutable (append-only)
  • Retention policy (e.g., 90 days for compliance)

Operational Best Practices

Deployment Patterns

Blue-Green Deployment:

Step 1: Deploy new version (Green) alongside old (Blue)
Step 2: Smoke test Green environment
Step 3: Switch traffic to Green
Step 4: Monitor metrics for 10 minutes
Step 5: Decommission Blue (or rollback if issues)

Benefits:

  • Deployment without system restart
  • Instant rollback capability (switch traffic back to Blue)
  • Parallel testing (smoke test before user traffic)

Canary Deployment:

Step 1: Deploy new version to 5% of nodes
Step 2: Monitor error rates and latency
Step 3: Gradually increase to 25%, 50%, 100%
Step 4: Rollback if metrics degrade

Benefits:

  • Gradual rollout minimizes blast radius
  • Early detection of issues (only 5% of users affected initially)
  • Data-driven rollout (metrics-based decision making)

Rollback Strategies

Automatic Rollback Triggers:

  • Error rate > 5% (baseline: <1%)
  • P99 latency > 100ms (baseline: 1-10µs)
  • Component crash rate > 1/minute
  • Health check failures > 50%

Manual Rollback Process:

# Step 1: Revert to previous version
git checkout previous-release-tag

# Step 2: Rebuild binary
cargo build --release

# Step 3: Deploy previous version
kubectl apply -f deployment-previous.yaml

# Step 4: Verify health
curl http://production/health
# Expected: 200 OK with "status": "healthy"

# Step 5: Monitor metrics for 10 minutes
# Verify error rate, latency back to normal

Capacity Planning

Resource Requirements per Component

Small Component (Stateless):

  • Memory: 64-128 MB
  • CPU: 0.1-0.5 cores (10-50% of one core)
  • Message rate: 1k-10k msg/sec
  • Example: JSON parser, data transformer

Medium Component (Stateful):

  • Memory: 128-512 MB
  • CPU: 0.5-2 cores
  • Message rate: 10k-100k msg/sec
  • Example: Request handler, cache manager

Large Component (Data Processing):

  • Memory: 512 MB - 2 GB
  • CPU: 2-8 cores
  • Message rate: 100k-1M msg/sec
  • Example: Machine learning inference, video encoding

Node Capacity Calculation

Example: 16 GB RAM, 8 cores

Memory Capacity:

Total RAM: 16 GB
System overhead: 2 GB (ActorSystem, OS)
Available: 14 GB

Small components (128 MB avg): 14 GB / 0.128 GB = ~109 components
Medium components (256 MB avg): 14 GB / 0.256 GB = ~55 components
Large components (1 GB avg): 14 GB / 1 GB = ~14 components

CPU Capacity:

Total cores: 8
System overhead: 1 core (monitoring, logging)
Available: 7 cores

Small components (0.3 core avg): 7 / 0.3 = ~23 components
Medium components (1 core avg): 7 / 1 = ~7 components
Large components (4 cores avg): 7 / 4 = ~1 component

Recommended Capacity:

  • Conservative: Use minimum of memory or CPU limit (avoid overcommitment)
  • Monitor utilization: Stay below 80% of capacity (headroom for bursts)

Horizontal Scaling Triggers

Scale Out (Add Nodes) When:

  • CPU utilization > 80% sustained for 5 minutes
  • Memory utilization > 80%
  • Message queue depth > 10,000
  • Component spawn latency > 1ms P99

Scale In (Remove Nodes) When:

  • CPU utilization < 40% sustained for 15 minutes
  • Memory utilization < 40%
  • Spare capacity > 50%

Summary

Production readiness requires comprehensive attention to:

  1. Monitoring: Track lifecycle, messaging, and system metrics against baselines
  2. Performance: Tune based on Task 6.2 benchmarks (6.12M msg/sec, 286ns spawn)
  3. Troubleshooting: Address lock contention, memory leaks, queue growth
  4. Security: Enforce capability-based security and audit logging
  5. Operations: Use blue-green or canary deployments with automatic rollback
  6. Capacity Planning: Calculate node capacity based on component resource needs

Production Readiness Validation:

  • ✅ Monitoring configured (metrics, logs, traces)
  • ✅ Performance meets SLAs (P99 < 100ms, throughput > 100k msg/sec)
  • ✅ Security enforced (capability-based, audit logging)
  • ✅ Deployment automated (blue-green or canary)
  • ✅ Rollback tested (automatic triggers configured)
  • ✅ Capacity planned (resource limits set, scaling triggers defined)

Performance Baseline: Task 6.2 benchmarks establish production baseline (6.12M msg/sec throughput, 286ns spawn, 36ns O(1) registry lookup). Monitor for degradation beyond 2x baseline.

Next Steps