Supervisor Trees¶

The supervision system in airssys-rt provides Erlang/OTP-inspired fault tolerance with builder-based configuration and automatic health monitoring.

Note: All code examples are from actual implementation (RT-TASK-013). See examples directory for complete working code.

Architecture Overview¶

Design Principles¶

The supervision system is built on three core concepts:

Child Trait - Supervised entity contract
Restart Strategies - Failure propagation control (OneForOne, OneForAll, RestForOne)
Builder Pattern - Type-safe supervisor configuration (RT-TASK-013)

Performance Characteristics (from BENCHMARKING.md): - Child spawn: 5-20 µs per child (builder API) - OneForOne restart: 10-50 µs (stop → start cycle) - OneForAll restart: 30-150 µs for 3 children (~3x OneForOne) - Supervision tree creation: 20-100 µs for supervisor + 3 children

Child Trait¶

Definition¶

Any entity can be supervised by implementing the Child trait (from src/supervisor/child.rs):

#[async_trait]
pub trait Child: Send + Sync {
    async fn start(&mut self) -> Result<(), Box<dyn Error + Send + Sync>>;
    async fn stop(&mut self) -> Result<(), Box<dyn Error + Send + Sync>>;
    async fn health_check(&self) -> ChildHealth;
}

Design Rationale:

start/stop: Lifecycle control for supervisor restarts
health_check: Automatic monitoring integration
Send + Sync: Multi-threaded supervision support
Actors automatically implement Child via blanket implementation

Health Status¶

#[derive(Debug, Clone, PartialEq, Eq)]
pub enum ChildHealth {
    Healthy,
    Unhealthy(String),  // Reason for unhealthy state
    Unknown,            // Health check failed or not implemented
}

Example Implementation¶

use airssys_rt::supervisor::{Child, ChildHealth};

struct DatabaseWorker {
    connection: Option<Connection>,
}

#[async_trait]
impl Child for DatabaseWorker {
    async fn start(&mut self) -> Result<(), Box<dyn Error + Send + Sync>> {
        println!("DatabaseWorker starting...");
        self.connection = Some(Connection::new().await?);
        Ok(())
    }

    async fn stop(&mut self) -> Result<(), Box<dyn Error + Send + Sync>> {
        println!("DatabaseWorker stopping...");
        if let Some(conn) = self.connection.take() {
            conn.close().await?;
        }
        Ok(())
    }

    async fn health_check(&self) -> ChildHealth {
        match &self.connection {
            Some(conn) if conn.is_alive() => ChildHealth::Healthy,
            Some(_) => ChildHealth::Unhealthy("Connection lost".to_string()),
            None => ChildHealth::Unhealthy("Not connected".to_string()),
        }
    }
}

Restart Strategies¶

Strategy Types¶

Three BEAM-inspired strategies (from src/supervisor/strategy.rs):

pub enum RestartStrategy {
    OneForOne,   // Restart only the failed child
    OneForAll,   // Restart all children when one fails
    RestForOne,  // Restart failed child and those started after it
}

OneForOne Strategy¶

Behavior: Only the failed child is restarted.

Use case: Independent children where one failure doesn't affect others.

Before failure:        After restart:
┌─────────────┐       ┌─────────────┐
│ Supervisor  │       │ Supervisor  │
└──┬──┬──┬───┘       └──┬──┬──┬───┘
   │  │  │               │  │  │
   A  B  C               A  B' C
      ✗                     ↻

Example: Worker pool where each worker processes independent tasks.

use airssys_rt::supervisor::{SupervisorNode, OneForOne};

let mut supervisor = SupervisorNode::builder()
    .with_strategy(OneForOne::new())
    .build()?;

See examples/worker_pool.rs for complete implementation.

OneForAll Strategy¶

Behavior: All children are restarted when any child fails.

Use case: Dependent children where all must be in consistent state.

Before failure:        After restart:
┌─────────────┐       ┌─────────────┐
│ Supervisor  │       │ Supervisor  │
└──┬──┬──┬───┘       └──┬──┬──┬───┘
   │  │  │               │  │  │
   A  B  C               A' B' C'
      ✗                  ↻  ↻  ↻

Example: Cache, database connection pool, configuration loader that must stay synchronized.

let mut supervisor = SupervisorNode::builder()
    .with_strategy(OneForAll::new())
    .build()?;

RestForOne Strategy¶

Behavior: Restart failed child and all children started after it.

Use case: Pipeline where later stages depend on earlier stages.

Before failure:        After restart:
┌─────────────┐       ┌─────────────┐
│ Supervisor  │       │ Supervisor  │
└──┬──┬──┬───┘       └──┬──┬──┬───┘
   │  │  │               │  │  │
   A  B  C               A  B' C'
      ✗                     ↻  ↻

Example: Event processing pipeline: Collector → Transformer → Writer.

let mut supervisor = SupervisorNode::builder()
    .with_strategy(RestForOne::new())
    .build()?;

See examples/event_pipeline.rs for complete RestForOne implementation.

Performance Comparison¶

From benches/supervisor_benchmarks.rs:

Strategy	Benchmark	Latency	Notes
OneForOne	1 child restart	10-50 µs	Baseline
OneForAll	3 children restart	30-150 µs	~3x OneForOne
RestForOne	2 children restart	20-100 µs	Between OneForOne and OneForAll

Overhead hierarchy: OneForOne < RestForOne < OneForAll

Restart Policies¶

Policy Types¶

Control when children should be restarted (from src/supervisor/child.rs):

pub enum RestartPolicy {
    Permanent,   // Always restart on failure
    Transient,   // Restart only on abnormal termination
    Temporary,   // Never restart
}

Policy Selection Guide¶

Policy	Restart Conditions	Use Case
Permanent	Always restart	Core services that must run continuously
Transient	Only on error (not normal shutdown)	Optional services, retryable operations
Temporary	Never restart	One-shot tasks, event handlers

Example Configuration¶

use airssys_rt::supervisor::{ChildSpec, RestartPolicy, ShutdownPolicy};

// Critical worker - must always run
let spec_permanent = ChildSpec {
    id: ChildId::new(),
    restart_policy: RestartPolicy::Permanent,
    shutdown_policy: ShutdownPolicy::default(),
    significant: true,  // Failure affects supervisor
};

// Optional background task
let spec_transient = ChildSpec {
    id: ChildId::new(),
    restart_policy: RestartPolicy::Transient,
    shutdown_policy: ShutdownPolicy::default(),
    significant: false,  // Failure doesn't affect supervisor
};

// One-shot initialization
let spec_temporary = ChildSpec {
    id: ChildId::new(),
    restart_policy: RestartPolicy::Temporary,
    shutdown_policy: ShutdownPolicy::default(),
    significant: false,
};

Child Specification¶

ChildSpec Structure¶

Configure supervised children (from src/supervisor/child.rs):

pub struct ChildSpec {
    pub id: ChildId,
    pub restart_policy: RestartPolicy,
    pub shutdown_policy: ShutdownPolicy,
    pub significant: bool,  // Does failure affect supervisor?
}

Fields:

id: Unique child identifier
restart_policy: When to restart (Permanent/Transient/Temporary)
shutdown_policy: How to stop gracefully
significant: If true, child failure propagates to supervisor

Significance Flag¶

The significant flag controls fault propagation:

// Significant child - failure propagates up
ChildSpec {
    significant: true,  // Supervisor health depends on this child
    // ...
}

// Non-significant child - failure contained
ChildSpec {
    significant: false,  // Supervisor stays healthy even if child fails
    // ...
}

Use cases:

significant: true - Core services (database, auth, cache)
significant: false - Optional services (metrics, logs, monitoring)

Builder Pattern (RT-TASK-013)¶

Supervisor Builder¶

Type-safe supervisor construction with builder API:

use airssys_rt::supervisor::{SupervisorNode, SupervisorBuilder, OneForOne};

// Create supervisor with builder
let supervisor = SupervisorNode::builder()
    .with_strategy(OneForOne::new())
    .add_child(
        ChildSpec {
            id: ChildId::new(),
            restart_policy: RestartPolicy::Permanent,
            shutdown_policy: ShutdownPolicy::default(),
            significant: true,
        },
        Box::new(MyWorker::new()),
    )
    .add_child(
        ChildSpec {
            id: ChildId::new(),
            restart_policy: RestartPolicy::Transient,
            shutdown_policy: ShutdownPolicy::default(),
            significant: false,
        },
        Box::new(MonitorWorker::new()),
    )
    .build()?;

Builder Methods¶

impl SupervisorBuilder {
    pub fn new() -> Self;

    pub fn with_strategy<S: RestartStrategy + 'static>(
        mut self, 
        strategy: S
    ) -> Self;

    pub fn add_child(
        mut self,
        spec: ChildSpec,
        child: Box<dyn Child>,
    ) -> Self;

    pub fn build(self) -> Result<SupervisorNode, SupervisorError>;
}

Benefits:

Type-safe configuration (compile-time checks)
Fluent API (method chaining)
Minimal overhead (5-20 µs spawn latency)
Clear child declaration

See examples/supervisor_builder_phase1.rs and examples/supervisor_builder_phase2.rs for complete examples.

Supervisor Node¶

Structure¶

Core supervisor implementation (from src/supervisor/supervisor.rs):

pub struct SupervisorNode {
    id: SupervisorId,
    strategy: Box<dyn RestartStrategy>,
    children: Vec<(ChildSpec, Box<dyn Child>)>,
    lifecycle: ActorLifecycle,
}

Key Methods¶

impl SupervisorNode {
    // Create with builder (recommended)
    pub fn builder() -> SupervisorBuilder;

    // Legacy constructor (still supported)
    pub fn new<S: RestartStrategy + 'static>(
        id: SupervisorId,
        strategy: S,
    ) -> Self;

    // Add child at runtime
    pub async fn add_child(
        &mut self,
        spec: ChildSpec,
        child: Box<dyn Child>,
    ) -> Result<(), SupervisorError>;

    // Start all children
    pub async fn start_all(&mut self) -> Result<(), SupervisorError>;

    // Stop all children
    pub async fn stop_all(&mut self) -> Result<(), SupervisorError>;

    // Handle child failure
    pub async fn handle_child_failure(
        &mut self,
        child_id: &ChildId,
    ) -> Result<(), SupervisorError>;
}

Automatic Health Monitoring¶

Health Monitoring System¶

Supervisors can automatically monitor child health (from RT-TASK-009):

use airssys_rt::monitoring::{HealthMonitor, HealthConfig};

// Create supervisor with health monitoring
let mut supervisor = SupervisorNode::builder()
    .with_strategy(OneForOne::new())
    .build()?;

// Add health monitor
let monitor = HealthMonitor::new(
    HealthConfig {
        check_interval: Duration::from_secs(5),
        unhealthy_threshold: 3,  // Restart after 3 consecutive failures
        auto_restart: true,
    }
);

monitor.monitor_supervisor(&mut supervisor).await?;

Features:

Periodic health checks via Child::health_check()
Configurable thresholds
Automatic restart on unhealthy children
Health status aggregation

See examples/supervisor_automatic_health.rs for complete implementation.

Supervision Trees¶

Hierarchical Supervisors¶

Supervisors can supervise other supervisors:

┌─────────────────────────┐
│  Root Supervisor        │
│  (OneForAll)            │
└──┬──────────────────┬───┘
   │                  │
   │                  │
┌──▼─────────┐   ┌───▼──────────┐
│ Worker Pool│   │ Cache Manager│
│ Supervisor │   │ Supervisor   │
│ (OneForOne)│   │ (OneForAll)  │
└──┬──┬──┬──┘   └──┬───────┬───┘
   │  │  │         │       │
   W1 W2 W3     Cache   Persistence

Example:

// Create worker pool supervisor
let worker_supervisor = SupervisorNode::builder()
    .with_strategy(OneForOne::new())
    .add_child(spec1, Box::new(Worker1::new()))
    .add_child(spec2, Box::new(Worker2::new()))
    .build()?;

// Create cache supervisor
let cache_supervisor = SupervisorNode::builder()
    .with_strategy(OneForAll::new())
    .add_child(cache_spec, Box::new(Cache::new()))
    .add_child(db_spec, Box::new(Database::new()))
    .build()?;

// Create root supervisor (supervisors implement Child trait)
let root_supervisor = SupervisorNode::builder()
    .with_strategy(OneForAll::new())
    .add_child(
        worker_spec,
        Box::new(worker_supervisor),
    )
    .add_child(
        cache_spec,
        Box::new(cache_supervisor),
    )
    .build()?;

Error Handling¶

Supervisor Errors¶

#[derive(Debug)]
pub enum SupervisorError {
    ChildStartFailed(ChildId, Box<dyn Error + Send + Sync>),
    ChildStopFailed(ChildId, Box<dyn Error + Send + Sync>),
    ChildNotFound(ChildId),
    StrategyError(String),
}

Error Flow¶

Child fails during operation
Child's start() or stop() returns Err
Supervisor catches error
Supervisor applies restart strategy
On restart failure, error propagates to parent supervisor (if significant child)

Recovery Strategies¶

async fn handle_child_failure(
    &mut self,
    child_id: &ChildId,
) -> Result<(), SupervisorError> {
    match self.restart_policy {
        RestartPolicy::Permanent => {
            // Always restart
            self.restart_child(child_id).await?;
        }
        RestartPolicy::Transient => {
            // Check if abnormal termination
            if self.is_abnormal_termination(child_id) {
                self.restart_child(child_id).await?;
            }
        }
        RestartPolicy::Temporary => {
            // Just cleanup, no restart
            self.remove_child(child_id).await?;
        }
    }
    Ok(())
}

Best Practices¶

Strategy Selection¶

OneForOne:

✅ Use for: Independent workers, pools, parallel tasks
✅ Example: HTTP request handlers, background jobs
❌ Avoid for: Stateful, interdependent services

OneForAll:

✅ Use for: Tightly coupled services, consistent state requirements
✅ Example: Cache + database, auth + session store
❌ Avoid for: Large numbers of independent children (restart overhead)

RestForOne:

✅ Use for: Pipelines, sequential dependencies
✅ Example: Data ingestion → transform → storage
❌ Avoid for: Parallel, independent processing

Child Design¶

DO:

✅ Implement proper start()/stop() lifecycle
✅ Make health_check() fast and accurate (<1ms)
✅ Use significant flag appropriately
✅ Keep child state minimal for fast restarts

DON'T:

❌ Block in start()/stop() (use async properly)
❌ Ignore errors in lifecycle methods
❌ Make all children significant (limits fault isolation)
❌ Store unrecoverable state in children

Performance Tuning¶

Minimize restart latency:

Keep start() logic simple (<10ms ideal)
Preallocate resources where possible
Use connection pools instead of per-child connections

Monitor restart patterns:

Track restart counts (via ActorLifecycle)
Alert on excessive restarts (possible underlying issue)
Use HealthMonitor for automatic detection

Testing Patterns¶

Unit Testing Children¶

#[tokio::test]
async fn test_child_lifecycle() {
    let mut child = MyWorker::new();

    // Test start
    child.start().await.expect("start failed");
    assert_eq!(child.health_check().await, ChildHealth::Healthy);

    // Test stop
    child.stop().await.expect("stop failed");
}

Integration Testing Supervisors¶

#[tokio::test]
async fn test_one_for_one_restart() {
    let mut supervisor = SupervisorNode::builder()
        .with_strategy(OneForOne::new())
        .add_child(spec, Box::new(FailingWorker::new()))
        .build()
        .unwrap();

    supervisor.start_all().await.unwrap();

    // Trigger failure and verify restart
    let child_id = supervisor.children()[0].0.id.clone();
    supervisor.handle_child_failure(&child_id).await.unwrap();

    // Verify child was restarted
    assert!(supervisor.is_child_running(&child_id));
}

Working Examples¶

Explore supervision in these examples:

Example	Demonstrates	Command
`supervisor_basic.rs`	Basic supervision setup	`cargo run --example supervisor_basic`
`supervisor_strategies.rs`	All three strategies	`cargo run --example supervisor_strategies`
`supervisor_builder_phase1.rs`	Builder API basics	`cargo run --example supervisor_builder_phase1`
`supervisor_builder_phase2.rs`	Advanced builder patterns	`cargo run --example supervisor_builder_phase2`
`supervisor_automatic_health.rs`	Health monitoring	`cargo run --example supervisor_automatic_health`
`worker_pool.rs`	Production worker pool	`cargo run --example worker_pool`
`event_pipeline.rs`	RestForOne pipeline	`cargo run --example event_pipeline`

All examples are in the examples/ directory with complete, runnable implementations.