Production Readiness Explanation¶

This document explains the comprehensive considerations for deploying ComponentActor systems to production environments. It provides context and rationale for production deployment decisions, monitoring strategies, performance tuning, and operational best practices.

Production Deployment Considerations¶

Why Production Deployment Differs from Development¶

Development Environment:

Single node, local execution
Limited concurrency (10-100 operations)
Forgiving error handling (panics visible in console)
Manual restarts acceptable
No performance requirements

Production Environment:

Distributed deployment (multiple nodes)
High concurrency (1000+ components, millions of messages)
Resilient error handling (automatic recovery required)
Updates applied during runtime without system restart
Strict performance SLAs (P99 latency < 100ms)

Key Differences: 1. Scale: Production handles 10-100x more load 2. Reliability: Production requires 99.9%+ uptime 3. Observability: Production needs comprehensive monitoring 4. Security: Production enforces strict capability-based security 5. Operations: Production requires deployment automation and rollback capability

Architecture for Production¶

Production ComponentActor System
├─ Load Balancer (traffic distribution)
├─ ActorSystem Cluster (multiple nodes)
│   ├─ Node 1: 100 components
│   ├─ Node 2: 100 components
│   └─ Node 3: 100 components
├─ Shared Registry (component discovery)
├─ Monitoring System (Prometheus + Grafana)
├─ Logging Aggregation (structured logs)
└─ Distributed Tracing (request flow analysis)

Design Rationale:

Multiple Nodes: Horizontal scalability and fault tolerance
Shared Registry: O(1) component lookup across nodes (36ns measured in Task 6.2)
Monitoring: Real-time performance tracking against baselines
Logging: Centralized debugging and audit trails
Tracing: End-to-end request flow visibility

Monitoring and Observability¶

Why Comprehensive Monitoring is Critical¶

Without Monitoring:

Performance degradation unnoticed until user complaints
Errors silently accumulate, causing cascading failures
Resource leaks go undetected, leading to crashes
No data for capacity planning or optimization

With Monitoring:

Early detection of performance regressions (P99 latency trending up)
Proactive alerting before user impact (component spawn > 1ms)
Data-driven capacity planning (current utilization vs capacity)
Evidence-based optimization (measure before/after improvements)

Three Pillars of Observability¶

1. Metrics (What is happening?)

Track quantitative performance data:

use prometheus::{IntCounter, Histogram, Registry, register_histogram_with_registry};

// Component lifecycle metrics
let spawn_duration = register_histogram_with_registry!(
    "component_spawn_duration_seconds",
    "Time to spawn component",
    vec![0.0001, 0.0005, 0.001, 0.005, 0.01],  // Buckets: 100µs to 10ms
    registry,
)?;

// Baseline: 286ns (Task 6.2 actor_lifecycle_benchmarks.rs)
// Alert: > 1ms P99 (3,496x degradation from baseline)

Key Metrics to Track:

Category	Metric	Baseline	Alert Threshold	Source
Lifecycle	Component spawn	286ns	>1ms P99	actor_lifecycle_benchmarks.rs
Lifecycle	Full lifecycle	1.49µs	>10µs P99	actor_lifecycle_benchmarks.rs
Messaging	Message routing	1.05µs	>100µs P99	messaging_benchmarks.rs
Messaging	Throughput	6.12M msg/sec	<100k msg/sec	messaging_benchmarks.rs
Messaging	Request-response	3.18µs	>1ms P99	messaging_benchmarks.rs
Messaging	Pub-sub fanout (100)	85.2µs	>1ms P99	messaging_benchmarks.rs
Registry	Lookup time	36ns O(1)	>1µs P99	scalability_benchmarks.rs
System	Active components	-	>1000 (capacity limit)	-
System	Memory usage	-	>80% of limit	-
System	CPU usage	-	>80% of cores	-

2. Logging (What happened?)

Capture structured event logs:

use tracing::{info, warn, error};

// Lifecycle events
info!(
    component_id = %component_id,
    duration_ns = spawn_duration.as_nanos(),
    "Component spawned"
);

// Error events
error!(
    component_id = %component_id,
    error = %err,
    "Component spawn failed"
);

// Security events
warn!(
    component_id = %component_id,
    capability = %requested_capability,
    "Capability violation detected"
);

Log Levels in Production:

ERROR: Failures requiring attention (spawn failures, capability violations)
WARN: Degraded conditions (slow spawns, high error rates)
INFO: Normal operations (component started, message sent)
DEBUG: Detailed troubleshooting (disabled in production by default)

3. Tracing (How did it happen?)

Track request flow across components:

use tracing::{info_span, instrument};

#[instrument(skip(self, context))]
async fn handle_message(
    &mut self,
    message: Self::Message,
    context: &ActorContext,
) -> Result<(), Self::Error> {
    let span = info_span!(
        "handle_message",
        component_id = %context.component_id,
        message_type = %std::any::type_name::<Self::Message>(),
    );

    let _enter = span.enter();

    // Message processing (automatically traced)
    Ok(())
}

Tracing Benefits:

Identify bottlenecks in multi-component pipelines
Measure end-to-end latency (ingress → processing → egress)
Correlate errors across component boundaries
Visualize request flow (Jaeger, Zipkin)

Performance Tuning¶

Understanding Performance Baselines (Task 6.2)¶

Baseline Performance (macOS M1, 100 samples, 95% CI, measured in Task 6.2):

Lifecycle Operations:

Component construction: 286ns (2.65 million/sec capacity)
Full lifecycle (start+stop): 1.49µs
State access (read): 37ns
State access (write): 39ns

Messaging Operations:

Message routing: 1.05µs (952k msg/sec per component)
Request-response cycle: 3.18µs (314k req/sec per component)
Message throughput: 6.12 million msg/sec (system-wide)
Pub-sub fanout (100): 85.2µs (11,737 fanouts/sec)

Scalability:

Registry lookup: 36ns O(1) (constant from 10-1,000 components)
Component spawn rate: 2.65 million/sec
Concurrent operations (100): 120µs (833k ops/sec)

Implications for Production:

Single node can handle 1000+ components with O(1) lookup
Message throughput supports 6M msg/sec before bottleneck
Component spawn is nearly instantaneous (286ns)

Optimization Strategies¶

1. Component Spawn Optimization

Target: <500ns P99 (current: 286ns baseline)

Already Optimal - No optimization needed. Current performance exceeds target by 1.7x.

If degradation occurs (>500ns):

// Pre-allocate component pools (reduce allocation overhead)
pub struct ComponentPool {
    available: Vec<ComponentInstance>,
}

impl ComponentPool {
    pub async fn acquire(&mut self) -> ComponentInstance {
        // Reuse pre-allocated instance (avoids 286ns spawn)
        self.available.pop().unwrap_or_else(|| {
            ComponentInstance::new()  // Fallback to new allocation
        })
    }
}

2. Message Throughput Optimization

Target: >5M msg/sec (current: 6.12M baseline, exceeds target)

Optimization: Batch Message Processing

// Instead of processing one message at a time
for message in messages {
    process_message(message).await;  // Await per message
}

// Batch processing (reduce async overhead)
let futures: Vec<_> = messages.into_iter()
    .map(|msg| process_message(msg))
    .collect();

futures::future::join_all(futures).await;  // Parallel execution

Measured Impact:

Single message: 1.05µs per message
Batch of 100: ~105µs total (1.05µs per message maintained)
Benefit: Lower latency variance, higher throughput consistency

3. Registry Lookup Optimization

Target: <500ns (current: 36ns, already 13.8x better)

Already Optimal - HashMap-based registry achieves O(1) constant time (36ns from 10-1,000 components, validated in Task 6.2 scalability_benchmarks.rs).

Why it's fast:

use dashmap::DashMap;

pub struct ComponentRegistry {
    components: Arc<DashMap<ComponentId, ComponentInstance>>,
}

impl ComponentRegistry {
    pub fn lookup(&self, component_id: &ComponentId) -> Option<ComponentInstance> {
        // O(1) HashMap lookup: ~36ns
        self.components.get(component_id).map(|entry| entry.clone())
    }
}

No optimization needed - Performance already exceptional.

When to Optimize (Data-Driven Approach)¶

Step 1: Measure Current Performance

# Run production benchmarks
cargo bench --bench actor_lifecycle_benchmarks
cargo bench --bench messaging_benchmarks
cargo bench --bench scalability_benchmarks

Step 2: Compare Against Baselines - Component spawn: Current vs 286ns baseline - Message throughput: Current vs 6.12M msg/sec baseline - Registry lookup: Current vs 36ns baseline

Step 3: Optimize Only If:

Current performance < 50% of baseline (e.g., spawn > 572ns)
Performance degrading over time (trending analysis)
SLA violations occurring (P99 latency > threshold)

Step 4: Validate Optimization

# Re-run benchmarks after optimization
cargo bench --bench actor_lifecycle_benchmarks -- --baseline before_optimization

# Compare results
# Expected: Performance improvement without regression in other areas

Troubleshooting Common Production Issues¶

Issue 1: High Lock Contention (State Access Bottlenecks)¶

Symptom:

State access latency > 100ns (baseline: 37-39ns)
Component message handling slowing down
CPU utilization low despite high load

Cause: Multiple components holding state locks for extended periods:

// ❌ BAD: Lock held across await point
let mut state = self.state.write().await;
let result = expensive_computation(&state).await;  // Lock held during await
state.update(result);

Solution: Minimize lock duration:

// ✅ GOOD: Lock held briefly
let data = {
    let state = self.state.read().await;
    state.data.clone()  // Clone needed data
}; // Lock released

let result = expensive_computation(&data).await;  // Await outside lock

{
    let mut state = self.state.write().await;
    state.update(result);
} // Lock released immediately

Validation:

State access returns to 37-39ns baseline
Message throughput returns to expected rate

Issue 2: Memory Leaks (Component Cleanup Issues)¶

Symptom:

Memory usage grows over time (never decreases)
Eventually OOM (Out of Memory) crash
Component count correct but memory usage high

Cause: Components not properly cleaned up on stop:

// ❌ BAD: Resources not released
impl Child for LeakyComponent {
    fn post_stop(&mut self, _context: &ChildContext) -> Result<(), ChildError> {
        // File handles, network connections not closed
        Ok(())
    }
}

Solution: Explicit cleanup in post_stop:

// ✅ GOOD: Explicit resource cleanup
impl Child for CleanComponent {
    fn post_stop(&mut self, context: &ChildContext) -> Result<(), ChildError> {
        // Close file handles
        if let Some(file) = self.file_handle.take() {
            drop(file);
        }

        // Close network connections
        if let Some(conn) = self.network_connection.take() {
            tokio::task::block_in_place(|| {
                let runtime = tokio::runtime::Handle::current();
                runtime.block_on(async {
                    conn.close().await.ok();
                });
            });
        }

        // Clear large data structures
        self.buffer.clear();
        self.buffer.shrink_to_fit();

        Ok(())
    }
}

Validation:

Memory usage stable over time
Memory drops after component stop
Use tools: heaptrack, valgrind --tool=massif

Issue 3: Message Queue Growth (Backpressure Handling)¶

Symptom:

Message queues growing unbounded
Latency increasing over time
Eventually OOM or timeout failures

Cause: Components receiving messages faster than processing:

// Message rate: 10k msg/sec
// Processing rate: 5k msg/sec
// Queue growth: +5k msg/sec (unbounded)

Solution: Implement backpressure:

use tokio::sync::mpsc;

// Bounded channel (backpressure via channel capacity)
let (tx, rx) = mpsc::channel(1000);  // Max 1000 queued messages

// Sender blocks when queue full (backpressure applied)
tx.send(message).await?;  // Blocks if queue at capacity

// Alternative: Drop messages when overloaded
match tx.try_send(message) {
    Ok(()) => { /* Message queued */ }
    Err(mpsc::error::TrySendError::Full(_)) => {
        // Queue full - drop message and log
        tracing::warn!("Message dropped due to queue full");
    }
    Err(e) => { /* Channel closed */ }
}

Validation:

Queue size bounded (monitored via metrics)
Latency stable under load
No OOM failures

Security Considerations¶

WASM Sandboxing¶

ComponentActor leverages WebAssembly sandboxing for security:

Memory Isolation:

Each component has separate linear memory
Components cannot access host memory directly
Memory bounds checked by WASM runtime

Capability-Based Security:

Components granted explicit capabilities (file:read, network:outbound)
All system calls require capability check
Deny-by-default security model

Example:

use airssys_wasm::security::CapabilitySet;

// Component granted minimal capabilities
let capabilities = CapabilitySet::new()
    .with_file_read("/data/input")      // Only read from /data/input
    .with_network_outbound("api.example.com:443");  // Only call specific API

// Component attempts unauthorized access
component.read_file("/etc/passwd").await?;  // ❌ Denied - no capability

// Component attempts authorized access
component.read_file("/data/input/data.json").await?;  // ✅ Allowed

Threat Model:

Malicious Components: Assume components may be adversarial
Resource Exhaustion: Limit CPU time, memory, I/O per component
Data Exfiltration: Prevent unauthorized data access via capabilities
Privilege Escalation: Components cannot gain additional capabilities at runtime

Audit Logging¶

Comprehensive audit logging for security and compliance:

use tracing::{info, warn};

// Successful operations
info!(
    component_id = %component_id,
    operation = "file_read",
    path = %path,
    timestamp = %chrono::Utc::now(),
    "File access granted"
);

// Capability violations
warn!(
    component_id = %component_id,
    operation = "network_outbound",
    attempted_host = %host,
    granted_capabilities = ?capabilities,
    timestamp = %chrono::Utc::now(),
    "Capability violation detected"
);

// Component lifecycle
info!(
    component_id = %component_id,
    event = "component_spawned",
    capabilities = ?capabilities,
    timestamp = %chrono::Utc::now(),
    "Component spawned with capabilities"
);

Audit Log Storage:

Structured logs (JSON format)
Centralized storage (Elasticsearch, Splunk)
Immutable (append-only)
Retention policy (e.g., 90 days for compliance)

Operational Best Practices¶

Deployment Patterns¶

Blue-Green Deployment:

Step 1: Deploy new version (Green) alongside old (Blue)
Step 2: Smoke test Green environment
Step 3: Switch traffic to Green
Step 4: Monitor metrics for 10 minutes
Step 5: Decommission Blue (or rollback if issues)

Benefits:

Deployment without system restart
Instant rollback capability (switch traffic back to Blue)
Parallel testing (smoke test before user traffic)

Canary Deployment:

Step 1: Deploy new version to 5% of nodes
Step 2: Monitor error rates and latency
Step 3: Gradually increase to 25%, 50%, 100%
Step 4: Rollback if metrics degrade

Benefits:

Gradual rollout minimizes blast radius
Early detection of issues (only 5% of users affected initially)
Data-driven rollout (metrics-based decision making)

Rollback Strategies¶

Automatic Rollback Triggers:

Error rate > 5% (baseline: <1%)
P99 latency > 100ms (baseline: 1-10µs)
Component crash rate > 1/minute
Health check failures > 50%

Manual Rollback Process:

# Step 1: Revert to previous version
git checkout previous-release-tag

# Step 2: Rebuild binary
cargo build --release

# Step 3: Deploy previous version
kubectl apply -f deployment-previous.yaml

# Step 4: Verify health
curl http://production/health
# Expected: 200 OK with "status": "healthy"

# Step 5: Monitor metrics for 10 minutes
# Verify error rate, latency back to normal

Capacity Planning¶

Resource Requirements per Component¶

Small Component (Stateless):

Memory: 64-128 MB
CPU: 0.1-0.5 cores (10-50% of one core)
Message rate: 1k-10k msg/sec
Example: JSON parser, data transformer

Medium Component (Stateful):

Memory: 128-512 MB
CPU: 0.5-2 cores
Message rate: 10k-100k msg/sec
Example: Request handler, cache manager

Large Component (Data Processing):

Memory: 512 MB - 2 GB
CPU: 2-8 cores
Message rate: 100k-1M msg/sec
Example: Machine learning inference, video encoding

Node Capacity Calculation¶

Example: 16 GB RAM, 8 cores

Memory Capacity:

Total RAM: 16 GB
System overhead: 2 GB (ActorSystem, OS)
Available: 14 GB

Small components (128 MB avg): 14 GB / 0.128 GB = ~109 components
Medium components (256 MB avg): 14 GB / 0.256 GB = ~55 components
Large components (1 GB avg): 14 GB / 1 GB = ~14 components

CPU Capacity:

Total cores: 8
System overhead: 1 core (monitoring, logging)
Available: 7 cores

Small components (0.3 core avg): 7 / 0.3 = ~23 components
Medium components (1 core avg): 7 / 1 = ~7 components
Large components (4 cores avg): 7 / 4 = ~1 component

Recommended Capacity:

Conservative: Use minimum of memory or CPU limit (avoid overcommitment)
Monitor utilization: Stay below 80% of capacity (headroom for bursts)

Horizontal Scaling Triggers¶

Scale Out (Add Nodes) When:

CPU utilization > 80% sustained for 5 minutes
Memory utilization > 80%
Message queue depth > 10,000
Component spawn latency > 1ms P99

Scale In (Remove Nodes) When:

CPU utilization < 40% sustained for 15 minutes
Memory utilization < 40%
Spare capacity > 50%

Summary¶

Production readiness requires comprehensive attention to:

Monitoring: Track lifecycle, messaging, and system metrics against baselines
Performance: Tune based on Task 6.2 benchmarks (6.12M msg/sec, 286ns spawn)
Troubleshooting: Address lock contention, memory leaks, queue growth
Security: Enforce capability-based security and audit logging
Operations: Use blue-green or canary deployments with automatic rollback
Capacity Planning: Calculate node capacity based on component resource needs

Production Readiness Validation:

✅ Monitoring configured (metrics, logs, traces)
✅ Performance meets SLAs (P99 < 100ms, throughput > 100k msg/sec)
✅ Security enforced (capability-based, audit logging)
✅ Deployment automated (blue-green or canary)
✅ Rollback tested (automatic triggers configured)
✅ Capacity planned (resource limits set, scaling triggers defined)

Performance Baseline: Task 6.2 benchmarks establish production baseline (6.12M msg/sec throughput, 286ns spawn, 36ns O(1) registry lookup). Monitor for degradation beyond 2x baseline.

Next Steps¶

Production Deployment Guide - Step-by-step deployment
Troubleshooting Guide - Common issues and solutions
Best Practices - Production-tested patterns