Skip to content

AirsDSP Product Differentiation Strategy

Document Type: Knowledge Base - Product Strategy
Created: 2025-10-20
Last Updated: 2025-10-20
Confidence Level: High
Source: Product analysis, market differentiation, and feature planning
Purpose: Define features that make AirsDSP more valuable than DSPy for production use

Overview

This document outlines the strategic features that differentiate AirsDSP from DSPy, focusing on production-readiness, explicit control, and developer experience. While DSPy focuses on automated optimization, AirsDSP targets developers who value transparency, flexibility, and comprehensive tooling.

Core Differentiation Philosophy

AirsDSP Value Proposition

Target Users: Developers and organizations who prioritize: - ✅ Explicit Control: Full visibility and control over pipeline behavior - ✅ Production Readiness: Comprehensive tooling for real-world deployment - ✅ Flexibility: Not locked into specific models or providers - ✅ Debuggability: Ability to understand and fix issues quickly - ✅ Cost Management: Tools to track and optimize expenses - ✅ Quality Assurance: Comprehensive evaluation and testing

Strategic Positioning

DSPy Focus:
- Automated optimization
- Declarative programming
- Compiler-driven improvements
- Black-box optimization

AirsDSP Focus:
- Explicit architecture
- Manual optimization
- Full transparency
- Production tooling

Critical Differentiation Features

1. Multiple Model Support

Vision

Provide true model flexibility - not locked into any single provider or model, with support for hybrid architectures.

Core Capabilities

A. Multiple Provider Support

pub enum LanguageModelProvider {
    // Commercial APIs
    OpenAI(OpenAIConfig),
    Anthropic(AnthropicConfig),
    Cohere(CohereConfig),
    GoogleVertexAI(VertexConfig),
    Azure(AzureConfig),

    // Open Source / Local
    Ollama(OllamaConfig),
    LlamaCpp(LlamaCppConfig),
    HuggingFace(HuggingFaceConfig),
    VLLM(VLLMConfig),

    // Custom
    Custom(Box<dyn LanguageModel>),
}

pub struct ModelConfig {
    pub provider: LanguageModelProvider,
    pub model_name: String,
    pub temperature: Option<f32>,
    pub max_tokens: Option<usize>,
    pub timeout: Option<Duration>,
}

B. Hybrid Pipeline (Different Models per Stage)

// Use expensive smart model for reasoning, cheap fast model for formatting
let hybrid_pipeline = Pipeline::new()
    .demonstrate(examples)
    .predict_with_model(
        PredictStage::new("complex_reasoning"),
        LanguageModel::from_config(ModelConfig {
            provider: LanguageModelProvider::OpenAI(OpenAIConfig::default()),
            model_name: "gpt-4".to_string(),
            temperature: Some(0.7),
            ..Default::default()
        })
    )
    .search(SearchStage::new("retrieval"))
    .predict_with_model(
        PredictStage::new("format_output"),
        LanguageModel::from_config(ModelConfig {
            provider: LanguageModelProvider::Ollama(OllamaConfig::default()),
            model_name: "llama3".to_string(),
            temperature: Some(0.3),
            ..Default::default()
        })
    );

C. Model Fallback Chain

pub struct ModelFallbackChain {
    primary: Box<dyn LanguageModel>,
    fallbacks: Vec<Box<dyn LanguageModel>>,
    retry_config: RetryConfig,
}

impl ModelFallbackChain {
    pub async fn generate(&self, prompt: &str) -> Result<String> {
        // Try primary model
        match timeout(
            self.retry_config.timeout,
            self.primary.generate(prompt)
        ).await {
            Ok(Ok(result)) => return Ok(result),
            Ok(Err(e)) => log::warn!("Primary model failed: {}", e),
            Err(_) => log::warn!("Primary model timeout"),
        }

        // Try fallbacks
        for (i, fallback) in self.fallbacks.iter().enumerate() {
            match fallback.generate(prompt).await {
                Ok(result) => {
                    log::info!("Fallback {} succeeded", i);
                    return Ok(result);
                }
                Err(e) => log::warn!("Fallback {} failed: {}", i, e),
            }
        }

        Err(Error::AllModelsFailed)
    }
}

D. Model Ensemble (Multi-Model Voting)

pub struct ModelEnsemble {
    models: Vec<Box<dyn LanguageModel>>,
    aggregation: AggregationStrategy,
}

pub enum AggregationStrategy {
    MajorityVote,
    WeightedVote(HashMap<String, f32>),
    BestOfN { judge: Box<dyn LanguageModel> },
    Consensus { threshold: f32 },
}

impl ModelEnsemble {
    pub async fn predict_with_ensemble(
        &self,
        prompt: &str
    ) -> Result<EnsembleResult> {
        // Get predictions from all models
        let predictions = join_all(
            self.models.iter()
                .map(|m| m.generate(prompt))
        ).await;

        // Aggregate based on strategy
        let final_result = match &self.aggregation {
            AggregationStrategy::MajorityVote => {
                self.majority_vote(&predictions)
            }
            AggregationStrategy::BestOfN { judge } => {
                self.judge_best(&predictions, judge).await?
            }
            // ... other strategies
        };

        Ok(EnsembleResult {
            final_answer: final_result,
            individual_predictions: predictions,
            agreement_score: self.compute_agreement(&predictions),
        })
    }
}

E. Intelligent Model Routing

pub struct ModelRouter {
    classifier: Box<dyn LanguageModel>,  // Small, fast model
    task_models: HashMap<TaskType, Box<dyn LanguageModel>>,
    cost_optimizer: CostOptimizer,
}

impl ModelRouter {
    pub async fn route_and_execute(&self, input: &str) -> Result<String> {
        // Classify task with small model
        let classification = self.classifier.classify_task(input).await?;

        // Select appropriate model based on task
        let model = self.select_optimal_model(&classification)?;

        // Execute with selected model
        model.generate(input).await
    }

    fn select_optimal_model(
        &self,
        classification: &TaskClassification
    ) -> Result<&Box<dyn LanguageModel>> {
        // Consider task complexity, cost constraints, latency requirements
        let model = self.cost_optimizer.optimize_selection(
            &classification,
            &self.task_models,
        )?;

        Ok(model)
    }
}

Why This Matters

Benefit Description Business Impact
Cost Optimization Use cheap models for simple tasks 50-80% cost reduction
Performance Use fast models for latency-sensitive operations Better UX
Reliability Fallback when primary fails Higher uptime
Quality Ensemble for critical decisions Better accuracy
Flexibility Not vendor-locked Negotiation power
Privacy Local models for sensitive data Compliance

Implementation Priority

Phase 1 (Months 1-2): - ✅ Basic provider support (OpenAI, Anthropic, local) - ✅ Single model per pipeline - ✅ Simple model configuration

Phase 2 (Months 3-4): - ✅ Hybrid pipelines (different models per stage) - ✅ Model fallback chain - ✅ Basic cost tracking

Phase 3 (Months 5-6): - ✅ Model ensemble - ✅ Intelligent routing - ✅ Advanced cost optimization

2. AI-Powered Evaluation (AI-as-a-Judge)

Vision

Provide comprehensive, automated evaluation following G-Eval methodology, similar to DeepEval but integrated with DSP pipelines.

G-Eval Methodology

G-Eval (GPT-based Evaluation) uses LLMs to evaluate LLM outputs based on specific criteria.

Core Principles: 1. Use LLM as judge with explicit criteria 2. Chain-of-thought evaluation reasoning 3. Normalized scoring (0-1 or 1-5 scale) 4. Multiple evaluation dimensions

Core Evaluation Framework

A. Base Judge Interface

pub trait LLMJudge {
    async fn evaluate(
        &self,
        input: &EvaluationInput,
        criteria: &EvaluationCriteria,
    ) -> Result<JudgmentResult>;
}

pub struct EvaluationInput {
    pub task_description: String,
    pub input: String,
    pub output: String,
    pub reference: Option<String>,  // Ground truth if available
    pub context: Option<String>,    // Retrieved context if applicable
}

pub struct EvaluationCriteria {
    pub name: String,
    pub description: String,
    pub scale: ScaleType,
    pub instructions: String,
}

pub enum ScaleType {
    Binary,           // 0 or 1
    Likert5,          // 1-5
    Percentage,       // 0-100
    Continuous,       // 0.0-1.0
}

pub struct JudgmentResult {
    pub score: f32,
    pub reasoning: String,
    pub criterion: String,
    pub confidence: Option<f32>,
}

B. G-Eval Implementation

pub struct GEvalJudge {
    judge_model: Box<dyn LanguageModel>,
    criteria: Vec<EvaluationCriteria>,
    use_cot: bool,  // Chain-of-thought reasoning
}

impl GEvalJudge {
    pub fn new(judge_model: Box<dyn LanguageModel>) -> Self {
        Self {
            judge_model,
            criteria: Vec::new(),
            use_cot: true,
        }
    }

    pub fn with_criterion(mut self, criterion: EvaluationCriteria) -> Self {
        self.criteria.push(criterion);
        self
    }

    async fn evaluate_single_criterion(
        &self,
        input: &EvaluationInput,
        criterion: &EvaluationCriteria,
    ) -> Result<JudgmentResult> {
        let prompt = self.build_evaluation_prompt(input, criterion);
        let response = self.judge_model.generate(&prompt).await?;
        self.parse_judgment(response, criterion)
    }

    fn build_evaluation_prompt(
        &self,
        input: &EvaluationInput,
        criterion: &EvaluationCriteria,
    ) -> String {
        format!(
            "Task: {task_description}\n\n\
             Evaluation Criterion: {criterion_name}\n\
             Description: {criterion_desc}\n\
             Scale: {scale}\n\n\
             Input: {input}\n\
             Output: {output}\n\
             {reference}\n\
             {context}\n\n\
             Instructions:\n\
             {instructions}\n\n\
             {cot_instruction}\n\n\
             Please provide your evaluation:",
            task_description = input.task_description,
            criterion_name = criterion.name,
            criterion_desc = criterion.description,
            scale = self.format_scale(&criterion.scale),
            input = input.input,
            output = input.output,
            reference = input.reference.as_ref()
                .map(|r| format!("Reference Answer: {}", r))
                .unwrap_or_default(),
            context = input.context.as_ref()
                .map(|c| format!("Context: {}", c))
                .unwrap_or_default(),
            instructions = criterion.instructions,
            cot_instruction = if self.use_cot {
                "First, explain your reasoning step by step. Then provide your score."
            } else {
                "Provide your score with brief justification."
            },
        )
    }
}

impl LLMJudge for GEvalJudge {
    async fn evaluate(
        &self,
        input: &EvaluationInput,
        criteria: &EvaluationCriteria,
    ) -> Result<JudgmentResult> {
        self.evaluate_single_criterion(input, criteria).await
    }
}

C. Common Evaluation Criteria

pub mod criteria {
    use super::*;

    // Coherence: How well does the output flow logically?
    pub fn coherence() -> EvaluationCriteria {
        EvaluationCriteria {
            name: "Coherence".to_string(),
            description: "Logical flow and organization of the response".to_string(),
            scale: ScaleType::Likert5,
            instructions: "\
                Score 1: Completely incoherent, random statements\n\
                Score 2: Mostly incoherent with some related ideas\n\
                Score 3: Somewhat coherent but disorganized\n\
                Score 4: Mostly coherent with good flow\n\
                Score 5: Perfectly coherent and well-organized".to_string(),
        }
    }

    // Relevance: How relevant is the output to the input?
    pub fn relevance() -> EvaluationCriteria {
        EvaluationCriteria {
            name: "Relevance".to_string(),
            description: "How well the output addresses the input question".to_string(),
            scale: ScaleType::Likert5,
            instructions: "\
                Score 1: Completely irrelevant\n\
                Score 2: Mostly irrelevant, tangentially related\n\
                Score 3: Somewhat relevant but missing key points\n\
                Score 4: Mostly relevant with minor issues\n\
                Score 5: Perfectly relevant and on-topic".to_string(),
        }
    }

    // Faithfulness: Is the output faithful to the provided context?
    pub fn faithfulness() -> EvaluationCriteria {
        EvaluationCriteria {
            name: "Faithfulness".to_string(),
            description: "Whether the output is grounded in provided context".to_string(),
            scale: ScaleType::Likert5,
            instructions: "\
                Score 1: Contains hallucinations, contradicts context\n\
                Score 2: Mostly unfaithful with some accurate info\n\
                Score 3: Mix of faithful and unfaithful statements\n\
                Score 4: Mostly faithful with minor extrapolations\n\
                Score 5: Completely faithful to context".to_string(),
        }
    }

    // Correctness: Is the output factually correct?
    pub fn correctness() -> EvaluationCriteria {
        EvaluationCriteria {
            name: "Correctness".to_string(),
            description: "Factual accuracy of the output".to_string(),
            scale: ScaleType::Likert5,
            instructions: "\
                Compare output against reference answer.\n\
                Score 1: Completely incorrect\n\
                Score 2: Mostly incorrect\n\
                Score 3: Partially correct\n\
                Score 4: Mostly correct with minor errors\n\
                Score 5: Completely correct".to_string(),
        }
    }

    // Conciseness: Is the output appropriately concise?
    pub fn conciseness() -> EvaluationCriteria {
        EvaluationCriteria {
            name: "Conciseness".to_string(),
            description: "Whether output is appropriately brief".to_string(),
            scale: ScaleType::Likert5,
            instructions: "\
                Score 1: Extremely verbose, excessive repetition\n\
                Score 2: Too verbose with unnecessary details\n\
                Score 3: Acceptable length but could be more concise\n\
                Score 4: Mostly concise with minor verbosity\n\
                Score 5: Perfectly concise, every word adds value".to_string(),
        }
    }

    // Helpfulness: How helpful is the output to the user?
    pub fn helpfulness() -> EvaluationCriteria {
        EvaluationCriteria {
            name: "Helpfulness".to_string(),
            description: "Overall usefulness of the response".to_string(),
            scale: ScaleType::Likert5,
            instructions: "\
                Score 1: Not helpful at all\n\
                Score 2: Minimally helpful\n\
                Score 3: Somewhat helpful but incomplete\n\
                Score 4: Very helpful with minor gaps\n\
                Score 5: Extremely helpful and comprehensive".to_string(),
        }
    }
}

D. Pipeline Evaluator

pub struct PipelineEvaluator {
    pipeline: Pipeline,
    judge: Box<dyn LLMJudge>,
    test_dataset: Vec<TestCase>,
}

pub struct TestCase {
    pub task_description: String,
    pub input: String,
    pub expected_output: Option<String>,
    pub metadata: HashMap<String, Value>,
}

impl PipelineEvaluator {
    pub async fn evaluate(&self, criteria: Vec<EvaluationCriteria>) -> EvaluationReport {
        let mut results = Vec::new();

        for test_case in &self.test_dataset {
            // Execute pipeline
            let output = self.pipeline.execute(&test_case.input).await?;

            // Evaluate across all criteria
            let mut scores = HashMap::new();
            for criterion in &criteria {
                let eval_input = EvaluationInput {
                    task_description: test_case.task_description.clone(),
                    input: test_case.input.clone(),
                    output: output.clone(),
                    reference: test_case.expected_output.clone(),
                    context: None,
                };

                let judgment = self.judge.evaluate(&eval_input, criterion).await?;
                scores.insert(criterion.name.clone(), judgment);
            }

            results.push(TestResult {
                input: test_case.input.clone(),
                output,
                expected: test_case.expected_output.clone(),
                scores,
            });
        }

        EvaluationReport::new(results, criteria)
    }
}

E. Evaluation Report

pub struct EvaluationReport {
    pub results: Vec<TestResult>,
    pub summary: EvaluationSummary,
    pub criteria_used: Vec<EvaluationCriteria>,
}

pub struct EvaluationSummary {
    pub total_tests: usize,
    pub average_scores: HashMap<String, f32>,
    pub score_distributions: HashMap<String, ScoreDistribution>,
    pub pass_rate: f32,  // If pass threshold is defined
}

impl EvaluationReport {
    pub fn to_json(&self) -> Result<String>;
    pub fn to_markdown(&self) -> String;
    pub fn to_html(&self) -> String;

    pub fn filter_by_score(&self, criterion: &str, min_score: f32) -> Vec<&TestResult>;
    pub fn get_worst_cases(&self, criterion: &str, n: usize) -> Vec<&TestResult>;
    pub fn get_best_cases(&self, criterion: &str, n: usize) -> Vec<&TestResult>;
}

Pre-built Evaluation Suites

pub mod eval_suites {
    use super::*;

    // RAG System Evaluation
    pub fn rag_evaluation_suite() -> Vec<EvaluationCriteria> {
        vec![
            criteria::faithfulness(),
            criteria::relevance(),
            criteria::correctness(),
        ]
    }

    // Question Answering Evaluation
    pub fn qa_evaluation_suite() -> Vec<EvaluationCriteria> {
        vec![
            criteria::correctness(),
            criteria::relevance(),
            criteria::conciseness(),
        ]
    }

    // Conversational Agent Evaluation
    pub fn conversational_suite() -> Vec<EvaluationCriteria> {
        vec![
            criteria::helpfulness(),
            criteria::relevance(),
            criteria::coherence(),
        ]
    }

    // Code Generation Evaluation
    pub fn code_generation_suite() -> Vec<EvaluationCriteria> {
        vec![
            criteria::correctness(),
            criteria::conciseness(),
            // Could add code-specific criteria
        ]
    }
}

Why This Matters

Benefit Description Impact
Quality Assurance Automated quality checking Catch issues early
Regression Detection Track performance over time Prevent degradation
Objective Metrics LLM-based scoring Reduce human eval cost
Multi-Dimensional Evaluate multiple aspects Comprehensive view
Production Confidence Deploy with measured quality Risk mitigation

Implementation Priority

Phase 1 (Months 3-4): - ✅ G-Eval judge implementation - ✅ Common criteria (faithfulness, relevance, correctness) - ✅ Basic evaluation report

Phase 2 (Months 5-6): - ✅ Pre-built evaluation suites - ✅ Pipeline evaluator integration - ✅ Report visualization

Phase 3 (Months 7-8): - ✅ Custom criteria builder - ✅ Continuous evaluation - ✅ A/B testing framework

3. DSP Debugging Capabilities

Vision

Provide comprehensive debugging tools that leverage DSP's explicit architecture for deep introspection and troubleshooting.

Core Debugging Features

A. Stage-by-Stage Execution Inspector

pub struct ExecutionInspector {
    pipeline: Pipeline,
    breakpoints: Vec<Breakpoint>,
    capture_level: CaptureLevel,
}

pub enum CaptureLevel {
    Minimal,     // Only stage outputs
    Standard,    // Outputs + timings
    Detailed,    // Outputs + timings + context
    Verbose,     // Everything including prompts
}

pub struct Breakpoint {
    pub stage_index: usize,
    pub condition: Option<BreakCondition>,
}

impl ExecutionInspector {
    pub fn execute_with_inspection(&self, input: &str) -> Result<InspectionReport> {
        let mut trace = ExecutionTrace::new();

        for (i, stage) in self.pipeline.stages.iter().enumerate() {
            // Pre-execution capture
            let pre_state = self.capture_state(i, &trace)?;
            trace.add_pre_state(i, pre_state);

            // Execute stage
            let start = Instant::now();
            let output = stage.execute(&trace.context())?;
            let duration = start.elapsed();

            // Post-execution capture
            let post_state = self.capture_state(i, &trace)?;
            trace.add_stage_result(StageResult {
                index: i,
                name: stage.name().to_string(),
                input: trace.get_input_for_stage(i),
                output: output.clone(),
                duration,
                pre_state,
                post_state,
            });

            // Check breakpoint
            if self.should_break(i, &output) {
                return Ok(InspectionReport::Paused {
                    trace,
                    paused_at: i,
                });
            }
        }

        Ok(InspectionReport::Completed(trace))
    }
}

B. Prompt Inspector

pub struct PromptInspector {
    capture_prompts: bool,
}

pub struct PromptCapture {
    pub stage_name: String,
    pub stage_index: usize,

    // Prompt construction steps
    pub base_template: String,
    pub with_demonstrations: String,
    pub with_context: String,
    pub final_prompt: String,

    // Model interaction
    pub model_used: String,
    pub model_config: ModelConfig,
    pub model_response: String,
    pub tokens_used: TokenUsage,

    // Timing
    pub prompt_construction_time: Duration,
    pub model_call_time: Duration,
}

impl PromptInspector {
    pub fn capture_predict_stage(
        &mut self,
        stage: &PredictStage,
        context: &Context,
    ) -> PromptCapture {
        let start = Instant::now();

        let base = stage.get_base_template();
        let with_demos = stage.add_demonstrations(base, context);
        let with_context = stage.add_context(with_demos, context);
        let final_prompt = stage.finalize_prompt(with_context);

        let construction_time = start.elapsed();

        PromptCapture {
            stage_name: stage.name().to_string(),
            stage_index: stage.index(),
            base_template: base,
            with_demonstrations: with_demos,
            with_context: with_context,
            final_prompt: final_prompt,
            prompt_construction_time: construction_time,
            // model_response filled after execution
            ..Default::default()
        }
    }

    pub fn export_to_markdown(&self, captures: &[PromptCapture]) -> String {
        // Generate markdown report with all prompts
    }
}

C. Context Visualizer

pub struct ContextVisualizer;

impl ContextVisualizer {
    pub fn visualize_at_stage(
        context: &Context,
        stage_index: usize,
    ) -> String {
        format!(
            "📍 Context at Stage {}\n\
             \n\
             📚 Demonstrations:\n{}\n\
             \n\
             📜 History:\n{}\n\
             \n\
             🏷️  Metadata:\n{}\n",
            stage_index,
            Self::format_demonstrations(&context.demonstrations),
            Self::format_history(&context.history),
            Self::format_metadata(&context.metadata),
        )
    }

    pub fn visualize_diff(
        before: &Context,
        after: &Context,
    ) -> String {
        // Show what changed in context after stage execution
    }
}

D. Execution Trace Visualization

pub struct TraceVisualizer;

impl TraceVisualizer {
    pub fn visualize_as_graph(trace: &ExecutionTrace) -> String {
        // ASCII art or Mermaid diagram of execution flow
        format!(
            "Pipeline Execution Trace:\n\
             \n\
             Input\n\
\n\
             {}\n\
\n\
             Output",
            trace.stages.iter()
                .map(|s| format!(
                    "[{}] {} ({}ms)",
                    s.index,
                    s.name,
                    s.duration.as_millis()
                ))
                .collect::<Vec<_>>()
                .join("\n\n")
        )
    }

    pub fn export_to_html(trace: &ExecutionTrace) -> String {
        // Interactive HTML visualization
    }

    pub fn export_to_json(trace: &ExecutionTrace) -> String {
        // JSON format for external tools
    }
}

E. Performance Profiler

pub struct PerformanceProfiler {
    enable_profiling: bool,
}

pub struct ProfileReport {
    pub total_duration: Duration,
    pub stage_durations: Vec<StageDuration>,
    pub model_call_times: Vec<ModelCallProfile>,
    pub token_usage: TokenUsageStats,
    pub bottlenecks: Vec<Bottleneck>,
    pub optimization_suggestions: Vec<OptimizationSuggestion>,
}

pub struct Bottleneck {
    pub stage_index: usize,
    pub stage_name: String,
    pub duration: Duration,
    pub percentage_of_total: f32,
    pub reason: BottleneckReason,
}

impl PerformanceProfiler {
    pub fn profile(
        &self,
        pipeline: &Pipeline,
        input: &str,
    ) -> Result<ProfileReport> {
        // Profile execution and identify bottlenecks
        let trace = pipeline.execute_with_profiling(input)?;

        let bottlenecks = self.identify_bottlenecks(&trace);
        let suggestions = self.generate_suggestions(&bottlenecks);

        Ok(ProfileReport {
            total_duration: trace.total_duration,
            stage_durations: trace.stage_durations,
            model_call_times: trace.model_calls,
            token_usage: trace.token_stats,
            bottlenecks,
            optimization_suggestions: suggestions,
        })
    }
}

F. Interactive Debugger (REPL-style)

pub struct InteractiveDebugger {
    pipeline: Pipeline,
    current_stage: usize,
    execution_state: ExecutionState,
    command_history: Vec<String>,
}

impl InteractiveDebugger {
    pub fn new(pipeline: Pipeline) -> Self {
        Self {
            pipeline,
            current_stage: 0,
            execution_state: ExecutionState::NotStarted,
            command_history: Vec::new(),
        }
    }

    // Debugger commands
    pub fn step(&mut self) -> Result<StepResult>;
    pub fn continue_execution(&mut self) -> Result<ExecutionResult>;
    pub fn step_back(&mut self) -> Result<()>;  // If history maintained
    pub fn goto_stage(&mut self, index: usize) -> Result<()>;

    // Inspection commands
    pub fn inspect_context(&self) -> Context;
    pub fn inspect_stage(&self, index: usize) -> StageInfo;
    pub fn show_prompt(&self, stage_index: usize) -> String;

    // Modification commands (for experimentation)
    pub fn modify_context(&mut self, modifications: ContextMods) -> Result<()>;
    pub fn modify_stage_output(&mut self, stage: usize, new_output: String);

    // Breakpoint commands
    pub fn set_breakpoint(&mut self, stage: usize);
    pub fn remove_breakpoint(&mut self, stage: usize);
    pub fn list_breakpoints(&self) -> Vec<usize>;

    // Evaluation commands
    pub fn evaluate_expression(&self, expr: &str) -> Result<Value>;
}

Debugging Workflow Example

// Create debugger
let debugger = InteractiveDebugger::new(pipeline);

// Set breakpoints
debugger.set_breakpoint(2)?;  // Break after stage 2

// Start execution
debugger.step()?;  // Execute first stage

// Inspect what happened
let context = debugger.inspect_context();
println!("Context: {}", context);

let prompt = debugger.show_prompt(0)?;
println!("Prompt used: {}", prompt);

// Continue to breakpoint
debugger.continue_execution()?;  // Stops at stage 2

// Inspect intermediate state
let stage_2_output = debugger.inspect_stage(2)?;
println!("Stage 2 output: {}", stage_2_output);

// Modify and re-run (for experimentation)
debugger.modify_stage_output(1, "Modified output".to_string())?;
debugger.goto_stage(2)?;
debugger.continue_execution()?;

Why This Matters

Benefit Description Impact
Fast Troubleshooting Quickly identify issues Reduced debug time
Understanding See exactly what happens Better intuition
Optimization Identify bottlenecks Performance gains
Verification Ensure expected behavior Quality assurance
Experimentation Try modifications easily Faster iteration

Implementation Priority

Phase 1 (Months 2-3): - ✅ Execution inspector - ✅ Basic trace visualization - ✅ Prompt inspector

Phase 2 (Months 4-5): - ✅ Context visualizer - ✅ Performance profiler - ✅ HTML/JSON export

Phase 3 (Months 6-7): - ✅ Interactive debugger - ✅ Breakpoint system - ✅ Modification capabilities

Additional High-Value Features

4. Pipeline Versioning and Serialization

Why Critical: Production systems need reproducibility and version control.

pub struct PipelineVersion {
    pub id: PipelineId,
    pub version: semver::Version,
    pub pipeline: Pipeline,
    pub metadata: VersionMetadata,
    pub created_at: DateTime<Utc>,
    pub created_by: String,
}

impl Pipeline {
    pub fn to_json(&self) -> Result<String>;
    pub fn from_json(json: &str) -> Result<Self>;
    pub fn to_rust_code(&self) -> Result<String>;
    pub fn compute_content_hash(&self) -> String;
}

5. Caching System

Why Critical: Reduce costs and latency.

pub struct CachedPipeline {
    pipeline: Pipeline,
    cache: Box<dyn Cache>,
    cache_strategy: CacheStrategy,
}

pub enum CacheStrategy {
    Stage(Vec<usize>),  // Cache specific stages
    Full,                // Cache full pipeline
    Adaptive,           // Smart caching
}

6. Cost Tracking

Why Critical: LM calls are expensive.

pub struct CostTracker {
    token_costs: HashMap<ModelProvider, TokenCost>,
    total_cost: f64,
    budget_alerts: Vec<BudgetAlert>,
}

impl CostTracker {
    pub fn track_execution(&mut self, trace: &ExecutionTrace) -> CostReport;
    pub fn estimate_cost(&self, pipeline: &Pipeline, input: &str) -> f64;
    pub fn optimize_for_budget(&self, budget: f64) -> OptimizationPlan;
}

7. Observability Integration

Why Critical: Production monitoring.

pub struct ObservablePipeline {
    pipeline: Pipeline,
    metrics: MetricsCollector,
    tracer: Tracer,
    logger: Logger,
}

// OpenTelemetry integration
impl ObservablePipeline {
    pub async fn execute_with_telemetry(&self, input: &str) -> Result<String> {
        let span = self.tracer.start_span("pipeline_execution");
        // ... execution with metrics
    }
}

Feature Comparison: AirsDSP vs DSPy

Feature DSPy AirsDSP Advantage
Architecture Declarative, automated Explicit, manual Full control
Multiple Models Limited ✅ Full support Cost, flexibility
Hybrid Pipelines No ✅ Yes Optimization
Model Fallback No ✅ Yes Reliability
Model Ensemble No ✅ Yes Quality
AI Evaluation Basic ✅ G-Eval comprehensive Quality assurance
Evaluation Criteria Limited ✅ Pre-built suites Easy to use
Debugging Limited ✅ Comprehensive Fast troubleshooting
Execution Trace No ✅ Full trace Understanding
Prompt Inspection No ✅ Yes Transparency
Performance Profiling No ✅ Yes Optimization
Interactive Debugger No ✅ Yes Development speed
Caching No ✅ Yes Cost reduction
Cost Tracking No ✅ Yes Budget management
Versioning No ✅ Yes Reproducibility
Observability Limited ✅ Full support Production ready
Philosophy Automation Explicit control Rust alignment

Implementation Roadmap

Phase 1: Foundation (Months 1-3)

Priority: Core + Multiple Models + Basic Debugging

  • ✅ Core DSP framework (Layer 1)
  • ✅ Multiple model support (OpenAI, Anthropic, local)
  • ✅ Basic model configuration
  • ✅ Execution inspector
  • ✅ Basic trace visualization

Deliverable: Working core with model flexibility and basic debugging

Phase 2: Evaluation (Months 4-5)

Priority: AI-as-a-Judge Evaluation

  • ✅ G-Eval judge implementation
  • ✅ Common criteria (faithfulness, relevance, correctness, etc.)
  • ✅ Pipeline evaluator
  • ✅ Evaluation reports (JSON, markdown, HTML)
  • ✅ Pre-built evaluation suites

Deliverable: Comprehensive evaluation framework

Phase 3: Advanced Debugging (Months 5-6)

Priority: Production Debugging Tools

  • ✅ Prompt inspector
  • ✅ Context visualizer
  • ✅ Performance profiler
  • ✅ HTML/JSON export
  • ✅ Bottleneck identification

Deliverable: Full debugging toolkit

Phase 4: Production Features (Months 7-9)

Priority: Production Readiness

  • ✅ Hybrid pipelines (different models per stage)
  • ✅ Model fallback chain
  • ✅ Caching system
  • ✅ Cost tracking
  • ✅ Pipeline versioning
  • ✅ Interactive debugger

Deliverable: Production-ready system

Phase 5: Advanced Features (Months 10-12)

Priority: Enterprise Features

  • ✅ Model ensemble
  • ✅ Intelligent routing
  • ✅ Observability (OpenTelemetry)
  • ✅ Continuous evaluation
  • ✅ A/B testing framework
  • ✅ Advanced cost optimization

Deliverable: Enterprise-grade platform

Target Market Segments

Segment 1: Engineering Teams

Profile: Teams building production LLM applications

Needs: - Full control over behavior - Comprehensive debugging - Cost management - Quality assurance

AirsDSP Value: Explicit control + production tooling

Segment 2: Research Labs

Profile: Researchers experimenting with novel approaches

Needs: - Flexibility to try new approaches - Detailed introspection - Reproducibility - Performance analysis

AirsDSP Value: Explicit architecture + comprehensive debugging

Segment 3: Enterprise Organizations

Profile: Large organizations with compliance and budget constraints

Needs: - Cost tracking and optimization - Quality guarantees - Observability and monitoring - Vendor flexibility

AirsDSP Value: Multiple models + cost tracking + observability

Success Metrics

Adoption Metrics

  • GitHub stars and forks
  • Crate downloads
  • Community contributions
  • Production deployments

Quality Metrics

  • Bug reports vs feature requests ratio
  • Documentation completeness
  • Test coverage
  • Performance benchmarks

Differentiation Metrics

  • Feature comparison with DSPy
  • Unique capabilities utilization
  • User satisfaction surveys
  • Production success stories

Key Takeaways

For Product Strategy

  1. Clear Differentiation: AirsDSP focuses on explicit control and production tooling
  2. Multiple Models: True flexibility, not vendor lock-in
  3. AI-as-a-Judge: G-Eval methodology for comprehensive evaluation
  4. Comprehensive Debugging: Leverage DSP's explicit architecture
  5. Production Ready: Cost tracking, caching, observability

For Implementation

  1. Phased Approach: Core → Evaluation → Debugging → Production → Advanced
  2. High-Value Features First: Multiple models, AI-eval, debugging are priorities
  3. Layered Architecture: Features fit naturally into Layer 1-4 structure
  4. Rust Advantages: Leverage Rust's type system, performance, safety
  5. Community Focus: Open source, well-documented, example-rich

For Users

  1. Explicit Over Implicit: Full visibility and control
  2. Production Ready: Comprehensive tooling for real-world use
  3. Cost Effective: Multiple models, caching, cost tracking
  4. Quality Assured: Comprehensive evaluation and debugging
  5. Vendor Agnostic: Not locked into any specific provider

References

  • DSP Framework Core: dsp_framework_core.md
  • DSP Pipeline Architecture: dsp_pipeline_architecture_examples.md
  • DSP Reasoning Strategies: dsp_reasoning_strategies_implementation.md
  • DSP Multi-Task System: dsp_multi_task_system_architecture.md
  • DSP Layered Architecture: dsp_layered_architecture_design.md

External References

  • G-Eval Paper: "G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment"
  • DeepEval: https://github.com/confident-ai/deepeval
  • DSPy: https://github.com/stanfordnlp/dspy

Document Status: Complete
Implementation Readiness: High - Clear product strategy and roadmap
Next Steps: Begin Phase 1 implementation with focus on core + multiple models + basic debugging