Supervisor Trees¶
The supervision system in airssys-rt provides Erlang/OTP-inspired fault tolerance with builder-based configuration and automatic health monitoring.
Note: All code examples are from actual implementation (RT-TASK-013). See examples directory for complete working code.
Architecture Overview¶
Design Principles¶
The supervision system is built on three core concepts:
- Child Trait - Supervised entity contract
- Restart Strategies - Failure propagation control (OneForOne, OneForAll, RestForOne)
- Builder Pattern - Type-safe supervisor configuration (RT-TASK-013)
Performance Characteristics (from BENCHMARKING.md): - Child spawn: 5-20 µs per child (builder API) - OneForOne restart: 10-50 µs (stop → start cycle) - OneForAll restart: 30-150 µs for 3 children (~3x OneForOne) - Supervision tree creation: 20-100 µs for supervisor + 3 children
Child Trait¶
Definition¶
Any entity can be supervised by implementing the Child trait (from src/supervisor/child.rs):
#[async_trait]
pub trait Child: Send + Sync {
async fn start(&mut self) -> Result<(), Box<dyn Error + Send + Sync>>;
async fn stop(&mut self) -> Result<(), Box<dyn Error + Send + Sync>>;
async fn health_check(&self) -> ChildHealth;
}
Design Rationale:
start/stop: Lifecycle control for supervisor restartshealth_check: Automatic monitoring integrationSend + Sync: Multi-threaded supervision support- Actors automatically implement
Childvia blanket implementation
Health Status¶
#[derive(Debug, Clone, PartialEq, Eq)]
pub enum ChildHealth {
Healthy,
Unhealthy(String), // Reason for unhealthy state
Unknown, // Health check failed or not implemented
}
Example Implementation¶
use airssys_rt::supervisor::{Child, ChildHealth};
struct DatabaseWorker {
connection: Option<Connection>,
}
#[async_trait]
impl Child for DatabaseWorker {
async fn start(&mut self) -> Result<(), Box<dyn Error + Send + Sync>> {
println!("DatabaseWorker starting...");
self.connection = Some(Connection::new().await?);
Ok(())
}
async fn stop(&mut self) -> Result<(), Box<dyn Error + Send + Sync>> {
println!("DatabaseWorker stopping...");
if let Some(conn) = self.connection.take() {
conn.close().await?;
}
Ok(())
}
async fn health_check(&self) -> ChildHealth {
match &self.connection {
Some(conn) if conn.is_alive() => ChildHealth::Healthy,
Some(_) => ChildHealth::Unhealthy("Connection lost".to_string()),
None => ChildHealth::Unhealthy("Not connected".to_string()),
}
}
}
Restart Strategies¶
Strategy Types¶
Three BEAM-inspired strategies (from src/supervisor/strategy.rs):
pub enum RestartStrategy {
OneForOne, // Restart only the failed child
OneForAll, // Restart all children when one fails
RestForOne, // Restart failed child and those started after it
}
OneForOne Strategy¶
Behavior: Only the failed child is restarted.
Use case: Independent children where one failure doesn't affect others.
Before failure: After restart:
┌─────────────┐ ┌─────────────┐
│ Supervisor │ │ Supervisor │
└──┬──┬──┬───┘ └──┬──┬──┬───┘
│ │ │ │ │ │
A B C A B' C
✗ ↻
Example: Worker pool where each worker processes independent tasks.
use airssys_rt::supervisor::{SupervisorNode, OneForOne};
let mut supervisor = SupervisorNode::builder()
.with_strategy(OneForOne::new())
.build()?;
See examples/worker_pool.rs for complete implementation.
OneForAll Strategy¶
Behavior: All children are restarted when any child fails.
Use case: Dependent children where all must be in consistent state.
Before failure: After restart:
┌─────────────┐ ┌─────────────┐
│ Supervisor │ │ Supervisor │
└──┬──┬──┬───┘ └──┬──┬──┬───┘
│ │ │ │ │ │
A B C A' B' C'
✗ ↻ ↻ ↻
Example: Cache, database connection pool, configuration loader that must stay synchronized.
RestForOne Strategy¶
Behavior: Restart failed child and all children started after it.
Use case: Pipeline where later stages depend on earlier stages.
Before failure: After restart:
┌─────────────┐ ┌─────────────┐
│ Supervisor │ │ Supervisor │
└──┬──┬──┬───┘ └──┬──┬──┬───┘
│ │ │ │ │ │
A B C A B' C'
✗ ↻ ↻
Example: Event processing pipeline: Collector → Transformer → Writer.
See examples/event_pipeline.rs for complete RestForOne implementation.
Performance Comparison¶
From benches/supervisor_benchmarks.rs:
| Strategy | Benchmark | Latency | Notes |
|---|---|---|---|
| OneForOne | 1 child restart | 10-50 µs | Baseline |
| OneForAll | 3 children restart | 30-150 µs | ~3x OneForOne |
| RestForOne | 2 children restart | 20-100 µs | Between OneForOne and OneForAll |
Overhead hierarchy: OneForOne < RestForOne < OneForAll
Restart Policies¶
Policy Types¶
Control when children should be restarted (from src/supervisor/child.rs):
pub enum RestartPolicy {
Permanent, // Always restart on failure
Transient, // Restart only on abnormal termination
Temporary, // Never restart
}
Policy Selection Guide¶
| Policy | Restart Conditions | Use Case |
|---|---|---|
| Permanent | Always restart | Core services that must run continuously |
| Transient | Only on error (not normal shutdown) | Optional services, retryable operations |
| Temporary | Never restart | One-shot tasks, event handlers |
Example Configuration¶
use airssys_rt::supervisor::{ChildSpec, RestartPolicy, ShutdownPolicy};
// Critical worker - must always run
let spec_permanent = ChildSpec {
id: ChildId::new(),
restart_policy: RestartPolicy::Permanent,
shutdown_policy: ShutdownPolicy::default(),
significant: true, // Failure affects supervisor
};
// Optional background task
let spec_transient = ChildSpec {
id: ChildId::new(),
restart_policy: RestartPolicy::Transient,
shutdown_policy: ShutdownPolicy::default(),
significant: false, // Failure doesn't affect supervisor
};
// One-shot initialization
let spec_temporary = ChildSpec {
id: ChildId::new(),
restart_policy: RestartPolicy::Temporary,
shutdown_policy: ShutdownPolicy::default(),
significant: false,
};
Child Specification¶
ChildSpec Structure¶
Configure supervised children (from src/supervisor/child.rs):
pub struct ChildSpec {
pub id: ChildId,
pub restart_policy: RestartPolicy,
pub shutdown_policy: ShutdownPolicy,
pub significant: bool, // Does failure affect supervisor?
}
Fields:
id: Unique child identifierrestart_policy: When to restart (Permanent/Transient/Temporary)shutdown_policy: How to stop gracefullysignificant: Iftrue, child failure propagates to supervisor
Significance Flag¶
The significant flag controls fault propagation:
// Significant child - failure propagates up
ChildSpec {
significant: true, // Supervisor health depends on this child
// ...
}
// Non-significant child - failure contained
ChildSpec {
significant: false, // Supervisor stays healthy even if child fails
// ...
}
Use cases:
significant: true- Core services (database, auth, cache)significant: false- Optional services (metrics, logs, monitoring)
Builder Pattern (RT-TASK-013)¶
Supervisor Builder¶
Type-safe supervisor construction with builder API:
use airssys_rt::supervisor::{SupervisorNode, SupervisorBuilder, OneForOne};
// Create supervisor with builder
let supervisor = SupervisorNode::builder()
.with_strategy(OneForOne::new())
.add_child(
ChildSpec {
id: ChildId::new(),
restart_policy: RestartPolicy::Permanent,
shutdown_policy: ShutdownPolicy::default(),
significant: true,
},
Box::new(MyWorker::new()),
)
.add_child(
ChildSpec {
id: ChildId::new(),
restart_policy: RestartPolicy::Transient,
shutdown_policy: ShutdownPolicy::default(),
significant: false,
},
Box::new(MonitorWorker::new()),
)
.build()?;
Builder Methods¶
impl SupervisorBuilder {
pub fn new() -> Self;
pub fn with_strategy<S: RestartStrategy + 'static>(
mut self,
strategy: S
) -> Self;
pub fn add_child(
mut self,
spec: ChildSpec,
child: Box<dyn Child>,
) -> Self;
pub fn build(self) -> Result<SupervisorNode, SupervisorError>;
}
Benefits:
- Type-safe configuration (compile-time checks)
- Fluent API (method chaining)
- Minimal overhead (5-20 µs spawn latency)
- Clear child declaration
See examples/supervisor_builder_phase1.rs and examples/supervisor_builder_phase2.rs for complete examples.
Supervisor Node¶
Structure¶
Core supervisor implementation (from src/supervisor/supervisor.rs):
pub struct SupervisorNode {
id: SupervisorId,
strategy: Box<dyn RestartStrategy>,
children: Vec<(ChildSpec, Box<dyn Child>)>,
lifecycle: ActorLifecycle,
}
Key Methods¶
impl SupervisorNode {
// Create with builder (recommended)
pub fn builder() -> SupervisorBuilder;
// Legacy constructor (still supported)
pub fn new<S: RestartStrategy + 'static>(
id: SupervisorId,
strategy: S,
) -> Self;
// Add child at runtime
pub async fn add_child(
&mut self,
spec: ChildSpec,
child: Box<dyn Child>,
) -> Result<(), SupervisorError>;
// Start all children
pub async fn start_all(&mut self) -> Result<(), SupervisorError>;
// Stop all children
pub async fn stop_all(&mut self) -> Result<(), SupervisorError>;
// Handle child failure
pub async fn handle_child_failure(
&mut self,
child_id: &ChildId,
) -> Result<(), SupervisorError>;
}
Automatic Health Monitoring¶
Health Monitoring System¶
Supervisors can automatically monitor child health (from RT-TASK-009):
use airssys_rt::monitoring::{HealthMonitor, HealthConfig};
// Create supervisor with health monitoring
let mut supervisor = SupervisorNode::builder()
.with_strategy(OneForOne::new())
.build()?;
// Add health monitor
let monitor = HealthMonitor::new(
HealthConfig {
check_interval: Duration::from_secs(5),
unhealthy_threshold: 3, // Restart after 3 consecutive failures
auto_restart: true,
}
);
monitor.monitor_supervisor(&mut supervisor).await?;
Features:
- Periodic health checks via
Child::health_check() - Configurable thresholds
- Automatic restart on unhealthy children
- Health status aggregation
See examples/supervisor_automatic_health.rs for complete implementation.
Supervision Trees¶
Hierarchical Supervisors¶
Supervisors can supervise other supervisors:
┌─────────────────────────┐
│ Root Supervisor │
│ (OneForAll) │
└──┬──────────────────┬───┘
│ │
│ │
┌──▼─────────┐ ┌───▼──────────┐
│ Worker Pool│ │ Cache Manager│
│ Supervisor │ │ Supervisor │
│ (OneForOne)│ │ (OneForAll) │
└──┬──┬──┬──┘ └──┬───────┬───┘
│ │ │ │ │
W1 W2 W3 Cache Persistence
Example:
// Create worker pool supervisor
let worker_supervisor = SupervisorNode::builder()
.with_strategy(OneForOne::new())
.add_child(spec1, Box::new(Worker1::new()))
.add_child(spec2, Box::new(Worker2::new()))
.build()?;
// Create cache supervisor
let cache_supervisor = SupervisorNode::builder()
.with_strategy(OneForAll::new())
.add_child(cache_spec, Box::new(Cache::new()))
.add_child(db_spec, Box::new(Database::new()))
.build()?;
// Create root supervisor (supervisors implement Child trait)
let root_supervisor = SupervisorNode::builder()
.with_strategy(OneForAll::new())
.add_child(
worker_spec,
Box::new(worker_supervisor),
)
.add_child(
cache_spec,
Box::new(cache_supervisor),
)
.build()?;
Error Handling¶
Supervisor Errors¶
#[derive(Debug)]
pub enum SupervisorError {
ChildStartFailed(ChildId, Box<dyn Error + Send + Sync>),
ChildStopFailed(ChildId, Box<dyn Error + Send + Sync>),
ChildNotFound(ChildId),
StrategyError(String),
}
Error Flow¶
- Child fails during operation
- Child's
start()orstop()returnsErr - Supervisor catches error
- Supervisor applies restart strategy
- On restart failure, error propagates to parent supervisor (if significant child)
Recovery Strategies¶
async fn handle_child_failure(
&mut self,
child_id: &ChildId,
) -> Result<(), SupervisorError> {
match self.restart_policy {
RestartPolicy::Permanent => {
// Always restart
self.restart_child(child_id).await?;
}
RestartPolicy::Transient => {
// Check if abnormal termination
if self.is_abnormal_termination(child_id) {
self.restart_child(child_id).await?;
}
}
RestartPolicy::Temporary => {
// Just cleanup, no restart
self.remove_child(child_id).await?;
}
}
Ok(())
}
Best Practices¶
Strategy Selection¶
OneForOne:
- ✅ Use for: Independent workers, pools, parallel tasks
- ✅ Example: HTTP request handlers, background jobs
- ❌ Avoid for: Stateful, interdependent services
OneForAll:
- ✅ Use for: Tightly coupled services, consistent state requirements
- ✅ Example: Cache + database, auth + session store
- ❌ Avoid for: Large numbers of independent children (restart overhead)
RestForOne:
- ✅ Use for: Pipelines, sequential dependencies
- ✅ Example: Data ingestion → transform → storage
- ❌ Avoid for: Parallel, independent processing
Child Design¶
DO:
- ✅ Implement proper
start()/stop()lifecycle - ✅ Make
health_check()fast and accurate (<1ms) - ✅ Use
significantflag appropriately - ✅ Keep child state minimal for fast restarts
DON'T:
- ❌ Block in
start()/stop()(use async properly) - ❌ Ignore errors in lifecycle methods
- ❌ Make all children significant (limits fault isolation)
- ❌ Store unrecoverable state in children
Performance Tuning¶
Minimize restart latency:
- Keep
start()logic simple (<10ms ideal) - Preallocate resources where possible
- Use connection pools instead of per-child connections
Monitor restart patterns:
- Track restart counts (via
ActorLifecycle) - Alert on excessive restarts (possible underlying issue)
- Use
HealthMonitorfor automatic detection
Testing Patterns¶
Unit Testing Children¶
#[tokio::test]
async fn test_child_lifecycle() {
let mut child = MyWorker::new();
// Test start
child.start().await.expect("start failed");
assert_eq!(child.health_check().await, ChildHealth::Healthy);
// Test stop
child.stop().await.expect("stop failed");
}
Integration Testing Supervisors¶
#[tokio::test]
async fn test_one_for_one_restart() {
let mut supervisor = SupervisorNode::builder()
.with_strategy(OneForOne::new())
.add_child(spec, Box::new(FailingWorker::new()))
.build()
.unwrap();
supervisor.start_all().await.unwrap();
// Trigger failure and verify restart
let child_id = supervisor.children()[0].0.id.clone();
supervisor.handle_child_failure(&child_id).await.unwrap();
// Verify child was restarted
assert!(supervisor.is_child_running(&child_id));
}
Working Examples¶
Explore supervision in these examples:
| Example | Demonstrates | Command |
|---|---|---|
supervisor_basic.rs |
Basic supervision setup | cargo run --example supervisor_basic |
supervisor_strategies.rs |
All three strategies | cargo run --example supervisor_strategies |
supervisor_builder_phase1.rs |
Builder API basics | cargo run --example supervisor_builder_phase1 |
supervisor_builder_phase2.rs |
Advanced builder patterns | cargo run --example supervisor_builder_phase2 |
supervisor_automatic_health.rs |
Health monitoring | cargo run --example supervisor_automatic_health |
worker_pool.rs |
Production worker pool | cargo run --example worker_pool |
event_pipeline.rs |
RestForOne pipeline | cargo run --example event_pipeline |
All examples are in the examples/ directory with complete, runnable implementations.