Article

DSPy Prompt Optimization: A Scientific Approach to Automotive Intelligence

Jun 13, 2025 Wes Lee ai, nlp, dspy

Introduction: From Prompt Engineering to Prompt Science

The field of prompt engineering has long been dominated by trial-and-error approaches, where practitioners manually iterate through different prompting strategies hoping to find configurations that work. This project represents a paradigm shift: treating prompt optimization as a rigorous scientific discipline using Stanford’s DSPy framework to systematically compile and optimize prompts for structured data extraction.

Related: For the business context and strategic implications of this project, see the DSPy Automotive Extractor project page.

Try It Yourself!

Explore the DSPy optimization results and experiment with different prompting strategies through our interactive dashboard:

Launch Interactive Demo

The Technical Foundation: Two-Phase Research Methodology

This project implements a comprehensive experimental methodology that has produced groundbreaking insights about DSPy optimization strategies:

Phase 1: Reasoning Field Impact Analysis ✅ CONFIRMED

Hypothesis: Explicit reasoning tokens improve extraction accuracy
Method: Compare 5 identical strategies with/without reasoning output fields
Results: Universal improvement across all strategies (100% success rate)
Champion: Contrastive CoT + Reasoning achieved 51.33% F1-score

Phase 2: Meta-Optimization Effectiveness ❌ REFUTED

Hypothesis: Advanced prompt engineering enhances DSPy-optimized baselines
Method: Apply 6 meta-optimization techniques to reasoning-enhanced strategies
Results: Failed to exceed 51.33% ceiling (best meta-optimized: 49.33%)
Critical Discovery: Instruction conflicts create performance degradation

Technical Architecture: Production-Grade Pipeline

Project Structure and Module Design

The system is architected as a modular, sequential pipeline with comprehensive observability:

dspy-automotive-extractor/
├── src/
│   ├── settings.py                    # Central configuration
│   ├── _01_load_data.py              # NHTSA data pipeline
│   ├── _02_define_schema.py          # DSPy signatures + 5 strategies
│   ├── _03_define_program.py         # Core extraction module
│   ├── _04_run_optimization.py       # Basic optimization (Phase 1)
│   ├── _05_meta_optimizers.py        # Meta-optimization techniques
│   ├── _06_run_meta_optimization.py  # Advanced optimization (Phase 2)
│   ├── verify_gpu.py                 # System diagnostics
│   ├── app.py                        # Local dashboard with live demo
│   └── app_cloud.py                  # Cloud-ready dashboard
├── data/
│   └── NHTSA_complaints.csv          # Automotive complaints dataset
├── results/
│   ├── optimized_*.json              # Compiled DSPy programs
│   └── results_summary.json          # Experimental results
└── requirements.txt                   # Dependencies

Environment Setup and Configuration

The project uses a centralized configuration system with comprehensive environment validation:

# From src/settings.py
def setup_environment():
    """Complete DSPy environment configuration with Langfuse tracking."""
    load_dotenv()
    
    # Configure DSPy with Ollama
    model_name = os.getenv("OLLAMA_MODEL", "gemma3:12b")
    llm = dspy.LM(model=f"ollama/{model_name}")
    dspy.settings.configure(lm=llm)
    
    # Initialize Langfuse for comprehensive observability
    configure_litellm_callbacks()
    langfuse_handler = CallbackHandler(
        secret_key=os.getenv("LANGFUSE_SECRET_KEY"),
        public_key=os.getenv("LANGFUSE_PUBLIC_KEY"),
        host=os.getenv("LANGFUSE_HOST", "http://localhost:3000")
    )
    
    logger.info(f"✅ DSPy configured with {model_name}")
    logger.info(f"✅ Langfuse tracking enabled")
    
    return langfuse_handler

Data Pipeline: NHTSA Automotive Complaints

The pipeline processes real-world automotive complaint data with intelligent filtering:

# From src/_01_load_data.py
def load_and_clean_nhtsa_data(file_path: str, sample_size: int = 500) -> List[dspy.Example]:
    """Load and clean NHTSA automotive complaint data with quality filtering."""
    
    logger.info(f"Loading NHTSA data from {file_path}")
    df = pd.read_csv(file_path)
    logger.info(f"Loaded {len(df)} raw complaints")
    
    # Quality filtering pipeline
    df = df[
        # Content length requirements
        (df['NARRATIVE'].str.len() >= 100) &
        (df['NARRATIVE'].str.len() <= 5000) &
        
        # Remove redacted/incomplete content
        (~df['NARRATIVE'].str.contains('REDACTED', case=False, na=False)) &
        (~df['NARRATIVE'].str.contains('INFORMATION NOT PROVIDED', case=False, na=False)) &
        (~df['NARRATIVE'].str.contains('NO ADDITIONAL INFORMATION', case=False, na=False)) &
        
        # Ensure essential fields exist
        (df['MAKE'].notna()) &
        (df['MODEL'].notna()) &
        (df['YEAR'].notna()) &
        (df['YEAR'] >= 1990) &
        (df['YEAR'] <= 2025)
    ]
    
    logger.info(f"After filtering: {len(df)} quality complaints")
    
    # Create structured DSPy Examples
    examples = []
    for _, row in df.head(sample_size).iterrows():
        # Clean and normalize fields
        make = clean_automotive_field(row['MAKE'])
        model = clean_automotive_field(row['MODEL'])
        year = int(row['YEAR']) if pd.notna(row['YEAR']) else None
        
        # Create structured output target
        vehicle_info = VehicleInfo(make=make, model=model, year=year)
        
        # Create DSPy Example with input/output pairing
        example = dspy.Example(
            narrative=row['NARRATIVE'].strip(),
            vehicle_info=vehicle_info
        ).with_inputs('narrative')
        
        examples.append(example)
    
    logger.info(f"Created {len(examples)} structured examples")
    return examples

Schema Definition: Strategy Pattern Implementation

The system implements 5 distinct prompting strategies using the Strategy Pattern:

# From src/_02_define_schema.py
from abc import ABC, abstractmethod
from typing import Type
import dspy

class PromptStrategy(ABC):
    """Abstract base class for prompting strategies."""
    
    @abstractmethod
    def get_docstring(self) -> str:
        """Return strategy-specific instructions."""
        pass

class ContrastiveCoTStrategy(PromptStrategy):
    """Contrastive Chain of Thought with positive/negative examples."""
    
    def get_docstring(self) -> str:
        return """
Extract vehicle information using contrastive reasoning analysis.

GOOD REASONING EXAMPLE:
Text: "I own a 2022 Tesla Model Y that has brake issues"
Analysis: "2022" is clearly a year (4 digits, recent), "Tesla" is the manufacturer, "Model Y" is the specific vehicle model
Result: Make=Tesla, Model=Model Y, Year=2022 ✅

BAD REASONING EXAMPLE:  
Text: "My car was going 65 mph with 50,000 miles"
Analysis: "65" and "50,000" are speed and mileage, not vehicle identification
Result: Make=UNKNOWN, Model=UNKNOWN, Year=UNKNOWN ✅

Now analyze the automotive complaint narrative using contrastive reasoning principles:
- What specific text indicates Make/Model/Year vs other numbers?
- How can you avoid confusing vehicle info with performance metrics?
- What evidence supports each extraction decision?

Provide your reasoning analysis, then extract the structured data.
"""

class SelfRefineStrategy(PromptStrategy):
    """Self-refinement with draft-critique-refine methodology."""
    
    def get_docstring(self) -> str:
        return """
Extract vehicle information using systematic self-refinement:

Step 1 - DRAFT: Extract your initial best guess for make, model, and year
Step 2 - CRITIQUE: Review your draft with these questions:
  - Is the make actually a vehicle manufacturer (not generic "car")?
  - Is the model specific enough (not just "truck" or "sedan")?  
  - Is the year realistic for vehicles (1990-2025 range)?
  - Did I confuse mileage/speed numbers with the model year?
  
Step 3 - REFINE: Based on your critique, provide your final extraction

Show your complete reasoning process including:
- Initial draft and evidence found
- Self-critique and identified issues  
- Final refinement and justification

Then provide the extracted structured data.
"""

# Signature definitions for with/without reasoning
class VehicleExtraction(dspy.Signature):
    """Extract vehicle make, model, and year from automotive complaint text."""
    narrative: str = dspy.InputField(desc="Automotive complaint narrative text")
    vehicle_info: VehicleInfo = dspy.OutputField(desc="Structured vehicle information")

class VehicleExtractionWithReasoning(dspy.Signature):
    """Extract vehicle information with explicit reasoning process."""
    narrative: str = dspy.InputField(desc="Automotive complaint narrative text")
    reasoning: str = dspy.OutputField(desc="Step-by-step extraction reasoning")
    vehicle_info: VehicleInfo = dspy.OutputField(desc="Structured vehicle information")

DSPy Program: Modular Extraction Architecture

The core extraction module supports both standard and reasoning-enhanced modes:

# From src/_03_define_program.py
class ExtractionModule(dspy.Module):
    """Core DSPy module for vehicle information extraction with strategy support."""
    
    def __init__(self, strategy: PromptStrategy = None, include_reasoning: bool = False):
        super().__init__()
        self.strategy = strategy or NaiveStrategy()
        self.include_reasoning = include_reasoning
        
        # Select signature based on reasoning requirement
        if include_reasoning:
            self.signature = VehicleExtractionWithReasoning
        else:
            self.signature = VehicleExtraction
        
        # Create predictor with strategy-specific instructions
        self.predictor = dspy.ChainOfThought(self.signature)
        
        # Apply strategy-specific docstring
        self.predictor.signature.__doc__ = self.strategy.get_docstring()
    
    def forward(self, narrative: str) -> dspy.Prediction:
        """Extract vehicle information from narrative with robust error handling."""
        try:
            # Execute prediction with strategy-specific prompting
            prediction = self.predictor(narrative=narrative)
            
            # Validate and structure output
            if hasattr(prediction, 'vehicle_info'):
                return prediction
            else:
                # Handle legacy or malformed predictions
                vehicle_info = VehicleInfo(
                    make=getattr(prediction, 'make', 'UNKNOWN'),
                    model=getattr(prediction, 'model', 'UNKNOWN'),
                    year=self._parse_year(getattr(prediction, 'year', None))
                )
                prediction.vehicle_info = vehicle_info
                return prediction
                
        except Exception as e:
            logger.error(f"Extraction failed for strategy {self.strategy.__class__.__name__}: {e}")
            return self._create_fallback_prediction()
    
    def _create_fallback_prediction(self) -> dspy.Prediction:
        """Create safe fallback prediction for error cases."""
        fallback_info = VehicleInfo(make="UNKNOWN", model="UNKNOWN", year=None)
        return dspy.Prediction(vehicle_info=fallback_info)

Evaluation Framework: F1-Score with Parallelization

The evaluation system provides robust performance measurement:

# From src/_03_define_program.py  
def extraction_metric(gold: dspy.Example, pred: dspy.Prediction, trace=None) -> float:
    """Calculate F1-score for vehicle information extraction."""
    
    try:
        # Extract gold standard
        gold_vehicle = gold.vehicle_info
        
        # Extract prediction
        if hasattr(pred, 'vehicle_info'):
            pred_vehicle = pred.vehicle_info
        else:
            # Handle direct field predictions
            pred_vehicle = VehicleInfo(
                make=getattr(pred, 'make', 'UNKNOWN'),
                model=getattr(pred, 'model', 'UNKNOWN'),
                year=getattr(pred, 'year', None)
            )
        
        # Calculate field-wise F1 scores
        make_f1 = f1_score_field(pred_vehicle.make, gold_vehicle.make)
        model_f1 = f1_score_field(pred_vehicle.model, gold_vehicle.model)
        year_f1 = f1_score_field(str(pred_vehicle.year), str(gold_vehicle.year))
        
        # Overall F1 is macro-average of field scores
        overall_f1 = (make_f1 + model_f1 + year_f1) / 3.0
        
        # Logging for debugging
        if trace:
            logger.debug(f"Gold: {gold_vehicle}")
            logger.debug(f"Pred: {pred_vehicle}")
            logger.debug(f"Scores: make={make_f1:.3f}, model={model_f1:.3f}, year={year_f1:.3f}")
        
        return overall_f1
        
    except Exception as e:
        logger.error(f"Metric calculation failed: {e}")
        return 0.0

def f1_score_field(pred: str, gold: str) -> float:
    """Calculate F1 score for individual field with normalization."""
    
    # Normalize strings for comparison
    pred_norm = normalize_automotive_field(pred)
    gold_norm = normalize_automotive_field(gold)
    
    # Exact match gets full score
    if pred_norm == gold_norm:
        return 1.0
    
    # Partial match scoring for common variations
    if pred_norm in gold_norm or gold_norm in pred_norm:
        return 0.8
    
    # No match
    return 0.0

Optimization Pipeline: BootstrapFewShot with Langfuse

The optimization process uses DSPy’s BootstrapFewShot teleprompter with comprehensive tracking:

# From src/_04_run_optimization.py
def run_optimization_experiment(strategy_name: str) -> Tuple[ExtractionModule, Dict[str, float]]:
    """Run complete optimization experiment with Langfuse tracking."""
    
    logger.info(f"🚀 Starting optimization for {strategy_name}")
    
    # Initialize environment and tracking
    langfuse_handler = setup_environment()
    
    # Load and split data
    examples = load_and_clean_nhtsa_data("data/NHTSA_complaints.csv", sample_size=500)
    train_examples, eval_examples = train_test_split(examples, test_size=0.1, random_state=42)
    
    logger.info(f"Dataset split: {len(train_examples)} train, {len(eval_examples)} eval")
    
    # Initialize strategy and model
    strategy = PROMPT_STRATEGIES[strategy_name]
    include_reasoning = "with_reasoning" in strategy_name
    model = ExtractionModule(strategy, include_reasoning=include_reasoning)
    
    # Configure DSPy teleprompter
    teleprompter = BootstrapFewShot(
        metric=extraction_metric,
        max_bootstrapped_demos=8,     # Learn from successful examples
        max_labeled_demos=4,          # Include hand-crafted demonstrations
        teacher_settings=dict(lm=dspy.settings.lm),
        student_settings=dict(lm=dspy.settings.lm)
    )
    
    # Create Langfuse trace for observability
    trace = langfuse_handler.trace(
        name=f"DSPy_Optimization_{strategy_name}",
        metadata={
            "strategy": strategy_name, 
            "reasoning": include_reasoning,
            "train_size": len(train_examples),
            "eval_size": len(eval_examples)
        }
    )
    
    with trace:
        # Run compilation/optimization
        logger.info("🔄 Running DSPy compilation...")
        compiled_model = teleprompter.compile(
            model, 
            trainset=train_examples,
            valset=eval_examples[:20]  # Use subset for validation during compilation
        )
        
        # Final evaluation on full eval set
        logger.info("📊 Running final evaluation...")
        evaluator = dspy.Evaluate(
            devset=eval_examples, 
            metric=extraction_metric, 
            num_threads=4,
            display_progress=True
        )
        final_score = evaluator(compiled_model)
        
        # Save results
        save_path = f"results/optimized_{strategy_name}.json"
        compiled_model.save(save_path)
        
        # Update central results tracking
        update_results_summary(
            strategy_name=strategy_name,
            score=final_score,
            trace_url=trace.get_trace_url(),
            optimized_path=save_path
        )
        
        logger.info(f"✅ Optimization complete. F1-Score: {final_score:.3f}")
    
    return compiled_model, {"overall": final_score}

Meta-Optimization: Advanced Prompting Techniques

Phase 2 explores sophisticated meta-optimization approaches that, surprisingly, failed to improve performance:

# From src/_05_meta_optimizers.py
from abc import ABC, abstractmethod
from typing import Type, Dict, Any

class MetaOptimizer(ABC):
    """Abstract base class for meta-optimization techniques."""
    
    @abstractmethod
    def enhance_signature(self, base_signature: Type[dspy.Signature]) -> Type[dspy.Signature]:
        """Apply meta-optimization enhancement to a DSPy signature."""
        pass

class DomainExpertiseEnhancement(MetaOptimizer):
    """Inject automotive domain expertise into prompts."""
    
    def enhance_signature(self, base_signature: Type[dspy.Signature]) -> Type[dspy.Signature]:
        enhanced_docstring = f"""
{base_signature.__doc__}

AUTOMOTIVE DOMAIN EXPERTISE INJECTION:
- Major vehicle manufacturers: Toyota, Honda, Ford, Chevrolet, BMW, Mercedes, Tesla, Nissan, Hyundai, Volkswagen
- Common model patterns: Camry, Accord, F-150, Silverado, Model 3, 3 Series, C-Class, Altima, Elantra, Jetta
- Model years typically range from 1990-2025 for complaint data
- Manufacturer abbreviations: Chevy=Chevrolet, Benz=Mercedes-Benz, VW=Volkswagen  
- Trim levels (LX, EX, Limited, Sport) are NOT the model name
- Watch for model variants: "Model 3 Performance" → Model="Model 3"

Apply this automotive domain knowledge during extraction to improve accuracy.
"""
        
        # Create enhanced signature class dynamically
        class EnhancedSignature(base_signature):
            __doc__ = enhanced_docstring
        
        return EnhancedSignature

class FormatEnforcementEnhancement(MetaOptimizer):
    """Enforce strict output formatting requirements."""
    
    def enhance_signature(self, base_signature: Type[dspy.Signature]) -> Type[dspy.Signature]:
        enhanced_docstring = f"""
{base_signature.__doc__}

CRITICAL FORMAT ENFORCEMENT REQUIREMENTS:
- You MUST respond with a valid JSON object following the exact schema
- No additional text, explanations, commentary, or reasoning outside the JSON
- Follow this precise format: {{"make": "...", "model": "...", "year": ...}}
- If uncertain about any field, use "UNKNOWN" for make/model, null for year
- Do not include markdown formatting, code blocks, or extra whitespace
- Validate JSON structure before responding

RESPOND ONLY WITH THE JSON OBJECT. NO OTHER TEXT ALLOWED.
"""
        
        class EnhancedSignature(base_signature):
            __doc__ = enhanced_docstring
        
        return EnhancedSignature

class ConstitutionalEnhancement(MetaOptimizer):
    """Apply constitutional AI principles for multi-faceted reasoning."""
    
    def enhance_signature(self, base_signature: Type[dspy.Signature]) -> Type[dspy.Signature]:
        enhanced_docstring = f"""
{base_signature.__doc__}

CONSTITUTIONAL REASONING FRAMEWORK:
Apply these constitutional principles in order:

1. ACCURACY PRINCIPLE: Extract only information explicitly stated in the text
2. SPECIFICITY PRINCIPLE: Prefer specific vehicle identifiers over generic terms
3. CONSISTENCY PRINCIPLE: Ensure extracted year matches make/model era compatibility  
4. EVIDENCE PRINCIPLE: Base extractions on clear textual evidence
5. HUMILITY PRINCIPLE: Use "UNKNOWN" when information is ambiguous or absent

For each extraction, validate against ALL constitutional principles before finalizing.
"""
        
        class EnhancedSignature(base_signature):
            __doc__ = enhanced_docstring
        
        return EnhancedSignature

# Meta-optimizer registry for systematic testing
META_OPTIMIZERS = {
    "domain_expertise": DomainExpertiseEnhancement,
    "specificity": SpecificityEnhancement,
    "error_prevention": ErrorPreventionEnhancement,
    "context_anchoring": ContextAnchoringEnhancement,
    "format_enforcement": FormatEnforcementEnhancement,
    "constitutional": ConstitutionalEnhancement,
}

Experimental Results: The Meta-Optimization Paradox

The results revealed a fascinating paradox that challenges conventional wisdom about prompt optimization:

Phase 1: Universal Reasoning Field Success

Strategy	Without Reasoning	With Reasoning	Improvement	Business Impact
Contrastive CoT	42.67%	51.33%	+8.66%	🏆 20% error reduction
Naive	42.67%	46.67%	+4.0%	✅ 9% error reduction
Chain-of-Thought	42.67%	46.0%	+3.33%	✅ 8% error reduction
Plan & Solve	42.67%	46.0%	+3.33%	✅ 8% error reduction
Self-Refine	43.33%	45.33%	+2.0%	✅ 5% error reduction

Key Discovery: 100% of strategies improved with reasoning fields - this represents a universal optimization principle.

Phase 2: Meta-Optimization Performance Regression

# Critical conflict example discovered in analysis
# Contrastive CoT Strategy demands:
"Provide your reasoning showing how you applied good reasoning principles..."

# Format Enforcement Meta-Optimizer demands:  
"You MUST respond with ONLY a JSON object... No additional text or commentary"

# Result: Direct contradiction causing 24% performance drop (51.33% → 27.33%)

Meta-Optimized Strategy	F1-Score	vs Baseline	Status	Root Cause
Contrastive CoT + Domain Expertise	49.33%	-2.0%	❌ Regression	Prompt complexity
Contrastive CoT + Format Enforcement	27.33%	-24%	❌ Catastrophic	Instruction conflict
Contrastive CoT + Constitutional	46.0%	-5.33%	❌ Regression	Cognitive overload
Contrastive CoT + Error Prevention	46.67%	-4.66%	❌ Regression	Competing objectives

Critical Insight: Meta-optimization creates instruction conflicts that degrade performance, establishing reasoning fields as the optimization ceiling.

Deployment Architecture: Multi-Environment Support

Local Development Environment

# Complete local setup with GPU acceleration
git clone https://github.com/Adredes-weslee/dspy-automotive-extractor.git
cd dspy-automotive-extractor

# Install dependencies with UV package manager
pip install uv
python -m uv venv .venv
.\.venv\Scripts\Activate.ps1

# Install PyTorch with CUDA support  
.\.venv\Scripts\python.exe -m pip install torch==2.7.0+cu126 torchvision==0.22.0+cu126 torchaudio==2.7.0+cu126 --extra-index-url https://download.pytorch.org/whl/cu126

# Install project dependencies
python -m uv pip install -e .

# Download Ollama models
ollama pull gemma3:12b      # High-performance (8GB+ VRAM)
ollama pull qwen3:4b        # CPU-friendly alternative

# Configure environment
copy .env.template .env
# Edit .env with your Langfuse credentials

# Verify setup
.\.venv\Scripts\python.exe src/verify_gpu.py

Experimental Pipeline Execution

# Phase 1: Reasoning Field Experiments
.\.venv\Scripts\python.exe src/_04_run_optimization.py naive_without_reasoning
.\.venv\Scripts\python.exe src/_04_run_optimization.py naive_with_reasoning
.\.venv\Scripts\python.exe src/_04_run_optimization.py contrastive_cot_with_reasoning

# Phase 2: Meta-Optimization Experiments  
.\.venv\Scripts\python.exe src/_06_run_meta_optimization.py meta
.\.venv\Scripts\python.exe src/_06_run_meta_optimization.py single --strategy contrastive_cot_domain_expertise

# Launch interactive dashboard
.\.venv\Scripts\python.exe -m streamlit run src/app.py

Cloud Deployment: Streamlit Community Cloud

# From src/app_cloud.py - Zero-dependency cloud deployment
def load_summary_data() -> Dict[str, Any]:
    """Load experimental results with demo data fallback for cloud deployment."""
    
    summary_path = Path("results/results_summary.json")
    
    if summary_path.exists():
        with open(summary_path, "r") as f:
            return json.load(f)
    else:
        # Embedded demo data for Streamlit Community Cloud
        return {
            "naive_without_reasoning": {"final_score": 42.67, "timestamp": "2025-06-30T08:00:00"},
            "naive_with_reasoning": {"final_score": 46.67, "timestamp": "2025-06-30T08:15:00"},
            "contrastive_cot_without_reasoning": {"final_score": 42.67, "timestamp": "2025-06-30T08:30:00"},
            "contrastive_cot_with_reasoning": {"final_score": 51.33, "timestamp": "2025-06-30T08:45:00"},
            "contrastive_cot_domain_expertise_bootstrap": {
                "final_score": 49.33, 
                "strategy_type": "meta_optimized",
                "timestamp": "2025-06-30T09:00:00"
            },
            "contrastive_cot_format_enforcement_bootstrap": {
                "final_score": 27.33,
                "strategy_type": "meta_optimized", 
                "timestamp": "2025-06-30T09:15:00"
            }
        }

def create_performance_visualization(results_data):
    """Create interactive performance comparison charts with Plotly."""
    
    df = pd.DataFrame(results_data)
    
    # Strategy type classification with enhanced logic
    def classify_strategy(strategy_name):
        if "meta_optimized" in strategy_name or strategy_name.endswith("_bootstrap"):
            return "Meta-Optimized"
        elif "with_reasoning" in strategy_name:
            return "Baseline (+ Reasoning)"  
        elif "without_reasoning" in strategy_name:
            return "Baseline (- Reasoning)"
        else:
            return "Baseline"
    
    df["Strategy_Type"] = df["Strategy"].apply(classify_strategy)
    
    # Interactive Plotly visualization with dynamic sizing
    fig = px.bar(
        df.sort_values("F1_Score", ascending=True),
        x="F1_Score",
        y="Strategy", 
        color="Strategy_Type",
        title="DSPy Optimization Results: Reasoning Fields vs Meta-Optimization",
        height=max(500, len(df) * 30),  # Dynamic height based on data
        color_discrete_map={
            "Baseline (- Reasoning)": "#87CEEB",
            "Baseline (+ Reasoning)": "#1f77b4", 
            "Baseline": "#1f77b4",
            "Meta-Optimized": "#ff7f0e"
        }
    )
    
    fig.update_layout(
        xaxis_title="F1-Score (%)",
        yaxis_title="Strategy",
        margin=dict(l=200),  # Space for strategy names
        showlegend=True
    )
    
    return fig

# Deploy to Streamlit Community Cloud
def main():
    st.set_page_config(page_title="DSPy Automotive Extractor", layout="wide")
    st.title("🚗 DSPy Automotive Extractor Dashboard")
    st.markdown("*Cloud version - Comprehensive optimization analysis with embedded demo data*")
    
    # Load data with cloud fallback
    summary_data = load_summary_data()
    
    # Create interactive dashboard
    tab1, tab2, tab3 = st.tabs(["📈 Results Analysis", "🧠 Experimental Insights", "🌐 Cloud Demo"])
    
    with tab1:
        display_enhanced_results_tab(summary_data)
    
    with tab2:
        display_analysis_tab(summary_data)
    
    with tab3:
        display_cloud_demo_tab()

System Diagnostics: Production Readiness

The project includes comprehensive diagnostic capabilities:

# From src/verify_gpu.py
def comprehensive_system_check():
    """Complete system validation for production deployment."""
    
    print("🔍 DSPy AUTOMOTIVE EXTRACTOR - SYSTEM DIAGNOSTICS")
    print("=" * 60)
    
    # PyTorch CUDA verification
    check_pytorch_cuda()
    
    # Ollama connectivity test  
    check_ollama_connection()
    
    # DSPy inference pipeline test
    test_dspy_inference()
    
    # Memory and performance validation
    check_system_resources()
    
    # Data pipeline validation
    validate_data_pipeline()

def check_pytorch_cuda():
    """Comprehensive PyTorch CUDA verification."""
    print("\n🔍 PYTORCH CUDA VERIFICATION")
    print("-" * 30)
    
    try:
        import torch
        
        if torch.cuda.is_available():
            print(f"✅ CUDA Available: {torch.version.cuda}")
            print(f"✅ GPU Count: {torch.cuda.device_count()}")
            
            for i in range(torch.cuda.device_count()):
                gpu_props = torch.cuda.get_device_properties(i)
                memory_gb = gpu_props.total_memory / (1024**3)
                print(f"   GPU {i}: {gpu_props.name} ({memory_gb:.1f} GB)")
            
            # Performance test
            device = torch.device("cuda:0")
            start_time = time.time()
            x = torch.randn(1000, 1000, device=device)
            y = torch.randn(1000, 1000, device=device)
            z = torch.mm(x, y)
            torch.cuda.synchronize()
            elapsed = time.time() - start_time
            
            print(f"✅ GPU Matrix Multiplication: {elapsed:.3f}s")
            
        else:
            print("❌ CUDA not available - will use CPU inference")
            print("⚠️  Performance will be significantly slower")
            
    except ImportError:
        print("❌ PyTorch not installed")

def check_ollama_connection():
    """Test Ollama service connectivity and model availability."""
    print("\n🔍 OLLAMA SERVICE VERIFICATION") 
    print("-" * 30)
    
    try:
        import requests
        
        # Check Ollama service
        response = requests.get("http://localhost:11434/api/tags", timeout=5)
        if response.status_code == 200:
            models = response.json().get("models", [])
            print(f"✅ Ollama service running")
            print(f"✅ Available models: {len(models)}")
            
            # Check for required models
            model_names = [model["name"] for model in models]
            required_models = ["gemma3:12b", "qwen3:4b"]
            
            for model in required_models:
                if any(model in name for name in model_names):
                    print(f"✅ Model available: {model}")
                else:
                    print(f"⚠️  Model missing: {model}")
                    
        else:
            print("❌ Ollama service not responding")
            
    except Exception as e:
        print(f"❌ Ollama connection failed: {e}")
        print("💡 Ensure Ollama is installed and running: ollama serve")

def test_dspy_inference():
    """Test complete DSPy inference pipeline."""
    print("\n🔍 DSPY INFERENCE VERIFICATION")
    print("-" * 30)
    
    try:
        import dspy
        from src._02_define_schema import VehicleExtraction
        from src._03_define_program import ExtractionModule
        
        # Configure DSPy
        model_name = os.getenv("OLLAMA_MODEL", "gemma3:12b")
        llm = dspy.LM(model=f"ollama/{model_name}")
        dspy.settings.configure(lm=llm)
        
        # Test inference
        program = ExtractionModule()
        test_narrative = "I own a 2022 Tesla Model Y with brake issues"
        
        start_time = time.time()
        result = program(narrative=test_narrative)
        elapsed = time.time() - start_time
        
        print(f"✅ DSPy inference successful")
        print(f"✅ Response time: {elapsed:.2f}s") 
        print(f"✅ Extracted: {result.vehicle_info}")
        
    except Exception as e:
        print(f"❌ DSPy inference failed: {e}")

Key Technical Discoveries

1. The Reasoning Field Universal Law

Technical Finding: Adding explicit reasoning output fields improves performance across ALL baseline strategies without exception.

Implementation:

# Standard signature
class VehicleExtraction(dspy.Signature):
    narrative: str = dspy.InputField()
    vehicle_info: VehicleInfo = dspy.OutputField()

# Reasoning-enhanced signature  
class VehicleExtractionWithReasoning(dspy.Signature):
    narrative: str = dspy.InputField()
    reasoning: str = dspy.OutputField(desc="Step-by-step extraction reasoning")
    vehicle_info: VehicleInfo = dspy.OutputField()

Result: Universal +4.26% average improvement, with Contrastive CoT achieving +8.66%.

2. The Meta-Optimization Paradox

Technical Finding: Advanced prompt engineering techniques consistently failed to improve DSPy-optimized baselines.

Root Cause Analysis:

# Instruction conflict example
base_strategy = "Provide reasoning showing your analysis..."
meta_optimizer = "Respond ONLY with JSON, no explanations..."
# Result: Contradictory requirements → 24% performance drop

3. The Framework Alignment Principle

Technical Finding: DSPy-native optimization outperforms external prompt engineering techniques.

Implication: Framework compatibility is more valuable than prompt sophistication.

4. The Performance Ceiling Effect

Technical Finding: Complex optimization approaches hit performance ceilings that simpler methods exceed.

Evidence: Meta-optimization peak (49.33%) < Reasoning field peak (51.33%)

Performance Optimization: Production Considerations

Hardware Requirements and Scaling

Component	Minimum	Recommended	Enterprise
RAM	8GB	16GB	32GB+
GPU VRAM	None (CPU)	8GB	16GB+
Storage	50GB	100GB	500GB+
CPU Cores	4	8	16+

Runtime Performance Benchmarks

Strategy Type	GPU Runtime	CPU Runtime	Throughput
Baseline (- Reasoning)	5-10 min	20-30 min	500 complaints/hour
Baseline (+ Reasoning)	10-15 min	30-45 min	350 complaints/hour
Meta-Optimized	15-25 min	45-60 min	250 complaints/hour

Deployment Scaling Strategies

# Horizontal scaling with multiprocessing
from multiprocessing import Pool
from concurrent.futures import ProcessPoolExecutor

def parallel_optimization(strategies: List[str], max_workers: int = 4):
    """Run multiple optimization experiments in parallel."""
    
    with ProcessPoolExecutor(max_workers=max_workers) as executor:
        futures = {
            executor.submit(run_optimization_experiment, strategy): strategy 
            for strategy in strategies
        }
        
        results = {}
        for future in tqdm(as_completed(futures), total=len(futures)):
            strategy = futures[future]
            try:
                model, scores = future.result()
                results[strategy] = scores
                logger.info(f"✅ {strategy}: {scores['overall']:.3f}")
            except Exception as e:
                logger.error(f"❌ {strategy} failed: {e}")
                results[strategy] = {"overall": 0.0, "error": str(e)}
        
        return results

# Memory optimization for large datasets
def batch_evaluation(model, examples: List[dspy.Example], batch_size: int = 50):
    """Evaluate model in batches to manage memory usage."""
    
    total_score = 0.0
    total_examples = len(examples)
    
    for i in tqdm(range(0, total_examples, batch_size), desc="Batch evaluation"):
        batch = examples[i:i + batch_size]
        
        batch_scores = []
        for example in batch:
            try:
                prediction = model(narrative=example.narrative)
                score = extraction_metric(example, prediction)
                batch_scores.append(score)
            except Exception as e:
                logger.warning(f"Evaluation failed for example {i}: {e}")
                batch_scores.append(0.0)
        
        total_score += sum(batch_scores)
        
        # Memory cleanup after each batch
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
        
        gc.collect()
    
    return total_score / total_examples

Future Research Directions: Expanding the Framework

Immediate Technical Extensions

Multi-Domain Validation: Test reasoning field principles across medical, legal, and financial extraction tasks
Advanced Metrics: Implement semantic similarity scoring beyond exact string matching
Real-Time Processing: Stream processing capabilities for continuous complaint monitoring
Multi-Modal Integration: Extend framework to process images, PDFs, and technical diagrams

Long-Term Research Opportunities

# Future research directions with technical foundations

class SemanticSimilarityMetric:
    """Enhanced evaluation using semantic similarity instead of exact matching."""
    
    def __init__(self, similarity_threshold: float = 0.8):
        self.threshold = similarity_threshold
        self.encoder = SentenceTransformer('all-MiniLM-L6-v2')
    
    def calculate_similarity(self, pred: str, gold: str) -> float:
        """Calculate semantic similarity between predicted and gold values."""
        if pred == gold:
            return 1.0
        
        pred_embedding = self.encoder.encode([pred])
        gold_embedding = self.encoder.encode([gold])
        
        similarity = cosine_similarity(pred_embedding, gold_embedding)[0][0]
        return max(0.0, similarity)

class MultiModalExtractionModule(dspy.Module):
    """Future extension for multi-modal document processing."""
    
    def __init__(self, include_vision: bool = False):
        super().__init__()
        self.include_vision = include_vision
        
        if include_vision:
            self.signature = VehicleExtractionMultiModal
        else:
            self.signature = VehicleExtraction
        
        self.predictor = dspy.ChainOfThought(self.signature)
    
    def forward(self, narrative: str, image_path: str = None) -> dspy.Prediction:
        """Extract from text and optionally images."""
        # Implementation for future multi-modal capabilities
        pass

class StreamingExtractionPipeline:
    """Real-time complaint processing pipeline."""
    
    def __init__(self, model: ExtractionModule, batch_size: int = 10):
        self.model = model
        self.batch_size = batch_size
        self.buffer = []
    
    async def process_stream(self, complaint_stream):
        """Process complaints in real-time batches."""
        async for complaint in complaint_stream:
            self.buffer.append(complaint)
            
            if len(self.buffer) >= self.batch_size:
                results = await self.process_batch(self.buffer)
                yield results
                self.buffer = []

Conclusion: Transforming Prompt Optimization Methodology

This project fundamentally challenges how we approach prompt optimization, providing the first rigorous scientific validation that reasoning fields + DSPy alignment = optimization sweet spot while meta-optimization creates diminishing returns on optimized baselines.

Technical Achievements Summary:

Systematic Validation: First rigorous comparison of reasoning fields vs meta-optimization with 26 strategies tested
Production Framework: Complete pipeline from data loading to cloud deployment with observability
Reproducible Science: Quantitative methodology that eliminates subjective prompt engineering
Framework Principles: Established DSPy-specific optimization principles that prioritize architectural alignment

Methodological Impact:

The discovery that DSPy’s framework alignment trumps prompt engineering sophistication represents a paradigm shift from creativity-driven to systematic, framework-aware optimization approaches. This has profound implications for:

Enterprise AI Development: Systematic optimization reduces development cycles
Research Methodology: Establishes quantitative foundations for prompt optimization research
Framework Design: Informs future development of LLM optimization frameworks
Best Practices: Provides evidence-based guidelines for structured extraction tasks

Code Quality and Architecture:

The implementation demonstrates production-ready practices including:

Comprehensive error handling and logging
Modular architecture with clear separation of concerns
Extensive documentation following Google-style conventions
Multi-environment deployment (local, cloud, enterprise)
Performance optimization and resource management
Systematic testing and validation frameworks

This research provides both theoretical insights and practical tools for building robust, high-performance structured extraction systems that prioritize framework compatibility over prompt complexity, establishing a new standard for systematic prompt optimization methodology.

To explore the complete DSPy Automotive Extractor platform, including its overall architecture, optimization methodology, and usage instructions, please refer to the DSPy Automotive Extractor: Systematic Prompt Optimization for Enterprise AI Project Page. The full codebase for the framework and the optimization techniques discussed herein is available on GitHub.

dspy prompt-optimization llms structured-extraction ollama langfuse automotive nhtsa meta-optimization reasoning-fields

Introduction: From Prompt Engineering to Prompt Science

Try It Yourself!

The Technical Foundation: Two-Phase Research Methodology

Phase 1: Reasoning Field Impact Analysis ✅ CONFIRMED

Phase 2: Meta-Optimization Effectiveness ❌ REFUTED

Technical Architecture: Production-Grade Pipeline

Project Structure and Module Design

Environment Setup and Configuration

Data Pipeline: NHTSA Automotive Complaints

Schema Definition: Strategy Pattern Implementation

DSPy Program: Modular Extraction Architecture

Evaluation Framework: F1-Score with Parallelization

Optimization Pipeline: BootstrapFewShot with Langfuse

Meta-Optimization: Advanced Prompting Techniques

Experimental Results: The Meta-Optimization Paradox

Phase 1: Universal Reasoning Field Success

Phase 2: Meta-Optimization Performance Regression

Deployment Architecture: Multi-Environment Support

Local Development Environment

Experimental Pipeline Execution

Cloud Deployment: Streamlit Community Cloud

System Diagnostics: Production Readiness

Key Technical Discoveries

1. The Reasoning Field Universal Law

2. The Meta-Optimization Paradox

3. The Framework Alignment Principle

4. The Performance Ceiling Effect

Performance Optimization: Production Considerations

Hardware Requirements and Scaling

Runtime Performance Benchmarks

Deployment Scaling Strategies

Future Research Directions: Expanding the Framework

Immediate Technical Extensions

Long-Term Research Opportunities

Conclusion: Transforming Prompt Optimization Methodology

Technical Achievements Summary:

Methodological Impact:

Code Quality and Architecture:

Templates:

Error