Monitoring API Reference¶
This reference documents the monitoring and health check system for actors and supervisors.
Module: monitoring¶
Health monitoring and metrics collection for actors.
Trait: HealthCheck¶
pub trait HealthCheck: Send + Sync {
async fn check_health(&self, actor_id: ActorId) -> HealthStatus;
fn health_check_interval(&self) -> Duration;
fn health_check_timeout(&self) -> Duration;
}
Trait for implementing health checks on actors.
Required Methods:
check_health(): Performs health check and returns statushealth_check_interval(): How often to perform checkshealth_check_timeout(): Maximum time to wait for response
Trait Bounds:
Send + Sync: Can be safely shared across threads
Example:
use airssys_rt::monitoring::{HealthCheck, HealthStatus};
use airssys_rt::util::ActorId;
use std::time::Duration;
struct PingHealthCheck;
#[async_trait::async_trait]
impl HealthCheck for PingHealthCheck {
async fn check_health(&self, actor_id: ActorId) -> HealthStatus {
// Send ping message to actor
match send_ping(actor_id).await {
Ok(()) => HealthStatus::Healthy,
Err(e) => HealthStatus::Unhealthy {
reason: format!("Ping failed: {}", e),
},
}
}
fn health_check_interval(&self) -> Duration {
Duration::from_secs(30)
}
fn health_check_timeout(&self) -> Duration {
Duration::from_secs(5)
}
}
Enum: HealthStatus¶
pub enum HealthStatus {
Healthy,
Degraded {
reason: String,
},
Unhealthy {
reason: String,
},
Unknown,
}
Represents the health state of an actor.
Variants:
Healthy: Actor is functioning normallyDegraded { reason }: Actor is operational but with reduced performanceUnhealthy { reason }: Actor is not functioning correctlyUnknown: Health status cannot be determined
State Transitions:
Unknown -> Healthy (successful health check)
Unknown -> Unhealthy (failed health check)
Healthy -> Degraded (performance issue detected)
Healthy -> Unhealthy (critical failure)
Degraded -> Healthy (issue resolved)
Degraded -> Unhealthy (issue worsened)
Unhealthy -> Degraded (partial recovery)
Unhealthy -> Healthy (full recovery)
Example:
use airssys_rt::monitoring::HealthStatus;
fn handle_health_status(status: HealthStatus) {
match status {
HealthStatus::Healthy => {
println!("✓ Actor is healthy");
}
HealthStatus::Degraded { reason } => {
eprintln!("⚠ Actor degraded: {}", reason);
// Maybe reduce load
}
HealthStatus::Unhealthy { reason } => {
eprintln!("✗ Actor unhealthy: {}", reason);
// Trigger restart or failover
}
HealthStatus::Unknown => {
eprintln!("? Actor health unknown");
// Retry health check
}
}
}
Health Monitoring¶
Struct: HealthMonitor¶
Monitors actor health and triggers recovery actions.
Features:
- Periodic health checks
- Automatic unhealthy actor detection
- Integration with supervisor for restarts
- Configurable check intervals and timeouts
- Health history tracking
Constructors¶
new()¶
Creates a new health monitor.
Parameters:
system: The actor system to monitor
Example:
use airssys_rt::monitoring::HealthMonitor;
use std::sync::Arc;
let monitor = HealthMonitor::new(Arc::clone(&system));
with_config()¶
Creates a health monitor with custom configuration.
Parameters:
system: Actor system to monitorconfig: Health monitor configuration
Example:
use airssys_rt::monitoring::{HealthMonitor, HealthMonitorConfig};
use std::time::Duration;
let config = HealthMonitorConfig {
check_interval: Duration::from_secs(10),
check_timeout: Duration::from_secs(2),
failure_threshold: 3,
recovery_threshold: 2,
};
let monitor = HealthMonitor::with_config(Arc::clone(&system), config);
Methods¶
monitor_actor()¶
pub async fn monitor_actor<H>(&self, actor_id: ActorId, health_check: H)
where
H: HealthCheck + 'static,
Starts monitoring an actor with a custom health check.
Type Parameters:
H: The health check implementation
Parameters:
actor_id: Actor to monitorhealth_check: Health check implementation
Behavior:
- Spawns background task for periodic checks
- Continues until actor stops or monitor is stopped
- Reports status changes to supervisor (if supervised)
Example:
use airssys_rt::monitoring::PingHealthCheck;
monitor.monitor_actor(actor_id, PingHealthCheck).await;
stop_monitoring()¶
Stops monitoring an actor.
Parameters:
actor_id: Actor to stop monitoring
Returns:
Ok(()): Monitoring stopped successfullyErr(MonitoringError::NotMonitored): Actor was not being monitored
Example:
get_health_status()¶
Gets the current health status of an actor.
Parameters:
actor_id: Actor to query
Returns:
Some(HealthStatus): Current health statusNone: Actor not being monitored
Example:
if let Some(status) = monitor.get_health_status(actor_id) {
println!("Actor health: {:?}", status);
}
get_health_history()¶
Gets the health check history for an actor.
Parameters:
actor_id: Actor to querylimit: Maximum number of records to return
Returns:
Vec<HealthRecord>: Health check history (most recent first)
Example:
let history = monitor.get_health_history(actor_id, 10);
for record in history {
println!("{}: {:?}", record.timestamp, record.status);
}
Health Monitoring Configuration¶
Struct: HealthMonitorConfig¶
pub struct HealthMonitorConfig {
pub check_interval: Duration,
pub check_timeout: Duration,
pub failure_threshold: u32,
pub recovery_threshold: u32,
}
Configuration for health monitoring behavior.
Fields:
check_interval: Time between health checkscheck_timeout: Maximum time to wait for health check responsefailure_threshold: Number of consecutive failures before marking unhealthyrecovery_threshold: Number of consecutive successes before marking healthy
Default Values:
impl Default for HealthMonitorConfig {
fn default() -> Self {
Self {
check_interval: Duration::from_secs(30),
check_timeout: Duration::from_secs(5),
failure_threshold: 3,
recovery_threshold: 2,
}
}
}
Example:
use airssys_rt::monitoring::HealthMonitorConfig;
use std::time::Duration;
// Aggressive monitoring for critical service
let critical_config = HealthMonitorConfig {
check_interval: Duration::from_secs(5),
check_timeout: Duration::from_secs(1),
failure_threshold: 2,
recovery_threshold: 3,
};
// Relaxed monitoring for background worker
let worker_config = HealthMonitorConfig {
check_interval: Duration::from_secs(60),
check_timeout: Duration::from_secs(10),
failure_threshold: 5,
recovery_threshold: 2,
};
Built-in Health Checks¶
Struct: PingHealthCheck¶
Simple ping-based health check.
Behavior:
- Sends ping message to actor
- Expects pong response within timeout
- Marks healthy if response received
Example:
use airssys_rt::monitoring::PingHealthCheck;
monitor.monitor_actor(actor_id, PingHealthCheck).await;
Struct: MessageRateHealthCheck¶
Health check based on message processing rate.
Fields:
min_messages_per_sec: Minimum expected message processing rate
Behavior:
- Tracks actor's message processing rate
- Marks degraded if below minimum rate
- Marks unhealthy if processing stopped
Example:
use airssys_rt::monitoring::MessageRateHealthCheck;
let health_check = MessageRateHealthCheck {
min_messages_per_sec: 100.0,
};
monitor.monitor_actor(worker_id, health_check).await;
Struct: MemoryHealthCheck¶
Health check based on actor memory usage.
Fields:
max_memory_mb: Maximum acceptable memory usage in MB
Behavior:
- Monitors actor's memory footprint
- Marks degraded if approaching limit (>80%)
- Marks unhealthy if exceeding limit
Example:
use airssys_rt::monitoring::MemoryHealthCheck;
let health_check = MemoryHealthCheck {
max_memory_mb: 100, // 100 MB limit
};
monitor.monitor_actor(actor_id, health_check).await;
Struct: CompositeHealthCheck¶
Combines multiple health checks with AND logic.
Behavior:
- Runs all health checks in parallel
- Healthy only if all checks are healthy
- Degraded if any check is degraded
- Unhealthy if any check is unhealthy
Example:
use airssys_rt::monitoring::{CompositeHealthCheck, PingHealthCheck, MessageRateHealthCheck};
let composite = CompositeHealthCheck::new()
.add_check(PingHealthCheck)
.add_check(MessageRateHealthCheck { min_messages_per_sec: 50.0 })
.add_check(MemoryHealthCheck { max_memory_mb: 200 });
monitor.monitor_actor(actor_id, composite).await;
Health Records¶
Struct: HealthRecord¶
pub struct HealthRecord {
pub timestamp: DateTime<Utc>,
pub status: HealthStatus,
pub check_duration: Duration,
}
Record of a single health check execution.
Fields:
timestamp: When the health check was performed (UTC)status: The health status resultcheck_duration: How long the health check took
Example:
use chrono::{DateTime, Utc};
let history = monitor.get_health_history(actor_id, 5);
for record in history {
println!("[{}] {:?} (took {:?})",
record.timestamp.format("%Y-%m-%d %H:%M:%S"),
record.status,
record.check_duration
);
}
Supervisor Integration¶
Automatic Health Monitoring¶
Supervisors can automatically monitor child actors.
use airssys_rt::{Supervisor, ChildSpec};
use airssys_rt::monitoring::{HealthMonitor, PingHealthCheck};
use std::time::Duration;
impl Supervisor for MonitoredSupervisor {
fn child_specs(&self) -> Vec<ChildSpec> {
vec![
ChildSpec::new("worker", || Worker::new())
.with_health_check(PingHealthCheck)
.with_health_interval(Duration::from_secs(30))
.with_restart_on_unhealthy(true),
]
}
fn restart_strategy(&self) -> RestartStrategy {
RestartStrategy::OneForOne
}
}
Health-Based Restart Policy¶
use airssys_rt::monitoring::HealthBasedRestartPolicy;
let policy = HealthBasedRestartPolicy {
restart_on_unhealthy: true,
restart_on_degraded: false,
max_restarts: 3,
restart_window: Duration::from_secs(60),
};
Metrics and Reporting¶
Struct: HealthMetrics¶
pub struct HealthMetrics {
pub total_checks: u64,
pub healthy_checks: u64,
pub degraded_checks: u64,
pub unhealthy_checks: u64,
pub avg_check_duration: Duration,
}
Aggregated health check metrics.
Fields:
total_checks: Total number of health checks performedhealthy_checks: Number of healthy resultsdegraded_checks: Number of degraded resultsunhealthy_checks: Number of unhealthy resultsavg_check_duration: Average time per health check
Methods¶
health_percentage()¶
Calculates percentage of healthy checks.
Example:
let metrics = monitor.get_metrics(actor_id);
println!("Health: {:.1}%", metrics.health_percentage());
Performance Characteristics¶
Health Check Overhead¶
| Check Type | Latency | Frequency | Overhead |
|---|---|---|---|
| Ping | 0.5-2ms | 30s | Negligible |
| MessageRate | 50-100µs | 30s | <0.01% |
| Memory | 100-500µs | 60s | <0.01% |
| Composite (3 checks) | 1-3ms | 30s | <0.1% |
Memory Usage¶
| Component | Size | Per Actor | Notes |
|---|---|---|---|
| HealthMonitor | ~512 bytes | - | Base structure |
| Per-actor state | ~256 bytes | Yes | Status + history |
| Health history (10 records) | ~480 bytes | Yes | Circular buffer |
Recommended Check Intervals¶
| Actor Type | Check Interval | Timeout | Failure Threshold |
|---|---|---|---|
| Critical service | 10s | 2s | 2 |
| Standard actor | 30s | 5s | 3 |
| Background worker | 60s | 10s | 5 |
| Batch processor | 120s | 30s | 3 |
Error Types¶
Enum: MonitoringError¶
Errors specific to monitoring operations.
Variants:
NotMonitored: Actor is not being monitoredCheckFailed(String): Health check execution failedTimeout: Health check exceeded timeoutSystemError(String): System-level monitoring error
Example:
use airssys_rt::monitoring::MonitoringError;
match monitor.stop_monitoring(actor_id) {
Ok(()) => println!("Stopped monitoring"),
Err(MonitoringError::NotMonitored) => {
println!("Actor wasn't being monitored");
}
Err(e) => eprintln!("Error: {:?}", e),
}
Testing Utilities¶
Struct: MockHealthCheck¶
Mock health check for testing.
Available in: Test builds only (#[cfg(test)])
Methods¶
new()¶
Creates a new mock health check.
set_status()¶
Sets the status this health check will return.
Example:
#[cfg(test)]
mod tests {
use super::*;
use airssys_rt::monitoring::{MockHealthCheck, HealthStatus};
#[tokio::test]
async fn test_unhealthy_actor_restart() {
let mut health_check = MockHealthCheck::new();
health_check.set_status(HealthStatus::Unhealthy {
reason: "Test failure".to_string(),
});
monitor.monitor_actor(actor_id, health_check).await;
// Wait for health check
tokio::time::sleep(Duration::from_secs(1)).await;
// Verify actor was restarted
assert!(supervisor_probe.was_restarted(actor_id));
}
}
Best Practices¶
Health Check Design¶
// ✅ Good - Lightweight and focused
struct QuickHealthCheck;
impl HealthCheck for QuickHealthCheck {
async fn check_health(&self, actor_id: ActorId) -> HealthStatus {
// Simple ping, returns quickly
ping_actor(actor_id).await
}
}
// ❌ Bad - Expensive operations
struct SlowHealthCheck;
impl HealthCheck for SlowHealthCheck {
async fn check_health(&self, actor_id: ActorId) -> HealthStatus {
// Complex database query (too slow)
database.complex_query().await;
HealthStatus::Healthy
}
}
Monitoring Configuration¶
// ✅ Good - Reasonable intervals and thresholds
let config = HealthMonitorConfig {
check_interval: Duration::from_secs(30),
check_timeout: Duration::from_secs(5),
failure_threshold: 3, // Avoid false positives
recovery_threshold: 2,
};
// ❌ Bad - Too aggressive, overhead too high
let bad_config = HealthMonitorConfig {
check_interval: Duration::from_millis(100), // Too frequent!
check_timeout: Duration::from_secs(30), // Timeout > interval!
failure_threshold: 1, // No tolerance for transients
recovery_threshold: 10, // Takes too long to recover
};
See Also¶
- Core API Reference - Core types and system
- Actors API Reference - Actor lifecycle
- Supervisors API Reference - Supervision integration
- Architecture: Supervision - Design overview
- How-To: Supervisor Patterns - Usage patterns