Article
DSPy Prompt Optimization: A Scientific Approach to Automotive Intelligence
Introduction: From Prompt Engineering to Prompt Science
The field of prompt engineering has long been dominated by trial-and-error approaches, where practitioners manually iterate through different prompting strategies hoping to find configurations that work. This project represents a paradigm shift: treating prompt optimization as a rigorous scientific discipline using Stanford’s DSPy framework to systematically compile and optimize prompts for structured data extraction.
Related: For the business context and strategic implications of this project, see the DSPy Automotive Extractor project page.
Try It Yourself!
Explore the DSPy optimization results and experiment with different prompting strategies through our interactive dashboard:
Launch Interactive DemoThe Technical Foundation: Two-Phase Research Methodology
This project implements a comprehensive experimental methodology that has produced groundbreaking insights about DSPy optimization strategies:
Phase 1: Reasoning Field Impact Analysis ✅ CONFIRMED
- Hypothesis: Explicit reasoning tokens improve extraction accuracy
- Method: Compare 5 identical strategies with/without reasoning output fields
- Results: Universal improvement across all strategies (100% success rate)
- Champion: Contrastive CoT + Reasoning achieved 51.33% F1-score
Phase 2: Meta-Optimization Effectiveness ❌ REFUTED
- Hypothesis: Advanced prompt engineering enhances DSPy-optimized baselines
- Method: Apply 6 meta-optimization techniques to reasoning-enhanced strategies
- Results: Failed to exceed 51.33% ceiling (best meta-optimized: 49.33%)
- Critical Discovery: Instruction conflicts create performance degradation
Technical Architecture: Production-Grade Pipeline
Project Structure and Module Design
The system is architected as a modular, sequential pipeline with comprehensive observability:
dspy-automotive-extractor/
├── src/
│ ├── settings.py # Central configuration
│ ├── _01_load_data.py # NHTSA data pipeline
│ ├── _02_define_schema.py # DSPy signatures + 5 strategies
│ ├── _03_define_program.py # Core extraction module
│ ├── _04_run_optimization.py # Basic optimization (Phase 1)
│ ├── _05_meta_optimizers.py # Meta-optimization techniques
│ ├── _06_run_meta_optimization.py # Advanced optimization (Phase 2)
│ ├── verify_gpu.py # System diagnostics
│ ├── app.py # Local dashboard with live demo
│ └── app_cloud.py # Cloud-ready dashboard
├── data/
│ └── NHTSA_complaints.csv # Automotive complaints dataset
├── results/
│ ├── optimized_*.json # Compiled DSPy programs
│ └── results_summary.json # Experimental results
└── requirements.txt # Dependencies
Environment Setup and Configuration
The project uses a centralized configuration system with comprehensive environment validation:
# From src/settings.py
def setup_environment():
"""Complete DSPy environment configuration with Langfuse tracking."""
load_dotenv()
# Configure DSPy with Ollama
model_name = os.getenv("OLLAMA_MODEL", "gemma3:12b")
llm = dspy.LM(model=f"ollama/{model_name}")
dspy.settings.configure(lm=llm)
# Initialize Langfuse for comprehensive observability
configure_litellm_callbacks()
langfuse_handler = CallbackHandler(
secret_key=os.getenv("LANGFUSE_SECRET_KEY"),
public_key=os.getenv("LANGFUSE_PUBLIC_KEY"),
host=os.getenv("LANGFUSE_HOST", "http://localhost:3000")
)
logger.info(f"✅ DSPy configured with {model_name}")
logger.info(f"✅ Langfuse tracking enabled")
return langfuse_handler
Data Pipeline: NHTSA Automotive Complaints
The pipeline processes real-world automotive complaint data with intelligent filtering:
# From src/_01_load_data.py
def load_and_clean_nhtsa_data(file_path: str, sample_size: int = 500) -> List[dspy.Example]:
"""Load and clean NHTSA automotive complaint data with quality filtering."""
logger.info(f"Loading NHTSA data from {file_path}")
df = pd.read_csv(file_path)
logger.info(f"Loaded {len(df)} raw complaints")
# Quality filtering pipeline
df = df[
# Content length requirements
(df['NARRATIVE'].str.len() >= 100) &
(df['NARRATIVE'].str.len() <= 5000) &
# Remove redacted/incomplete content
(~df['NARRATIVE'].str.contains('REDACTED', case=False, na=False)) &
(~df['NARRATIVE'].str.contains('INFORMATION NOT PROVIDED', case=False, na=False)) &
(~df['NARRATIVE'].str.contains('NO ADDITIONAL INFORMATION', case=False, na=False)) &
# Ensure essential fields exist
(df['MAKE'].notna()) &
(df['MODEL'].notna()) &
(df['YEAR'].notna()) &
(df['YEAR'] >= 1990) &
(df['YEAR'] <= 2025)
]
logger.info(f"After filtering: {len(df)} quality complaints")
# Create structured DSPy Examples
examples = []
for _, row in df.head(sample_size).iterrows():
# Clean and normalize fields
make = clean_automotive_field(row['MAKE'])
model = clean_automotive_field(row['MODEL'])
year = int(row['YEAR']) if pd.notna(row['YEAR']) else None
# Create structured output target
vehicle_info = VehicleInfo(make=make, model=model, year=year)
# Create DSPy Example with input/output pairing
example = dspy.Example(
narrative=row['NARRATIVE'].strip(),
vehicle_info=vehicle_info
).with_inputs('narrative')
examples.append(example)
logger.info(f"Created {len(examples)} structured examples")
return examples
Schema Definition: Strategy Pattern Implementation
The system implements 5 distinct prompting strategies using the Strategy Pattern:
# From src/_02_define_schema.py
from abc import ABC, abstractmethod
from typing import Type
import dspy
class PromptStrategy(ABC):
"""Abstract base class for prompting strategies."""
@abstractmethod
def get_docstring(self) -> str:
"""Return strategy-specific instructions."""
pass
class ContrastiveCoTStrategy(PromptStrategy):
"""Contrastive Chain of Thought with positive/negative examples."""
def get_docstring(self) -> str:
return """
Extract vehicle information using contrastive reasoning analysis.
GOOD REASONING EXAMPLE:
Text: "I own a 2022 Tesla Model Y that has brake issues"
Analysis: "2022" is clearly a year (4 digits, recent), "Tesla" is the manufacturer, "Model Y" is the specific vehicle model
Result: Make=Tesla, Model=Model Y, Year=2022 ✅
BAD REASONING EXAMPLE:
Text: "My car was going 65 mph with 50,000 miles"
Analysis: "65" and "50,000" are speed and mileage, not vehicle identification
Result: Make=UNKNOWN, Model=UNKNOWN, Year=UNKNOWN ✅
Now analyze the automotive complaint narrative using contrastive reasoning principles:
- What specific text indicates Make/Model/Year vs other numbers?
- How can you avoid confusing vehicle info with performance metrics?
- What evidence supports each extraction decision?
Provide your reasoning analysis, then extract the structured data.
"""
class SelfRefineStrategy(PromptStrategy):
"""Self-refinement with draft-critique-refine methodology."""
def get_docstring(self) -> str:
return """
Extract vehicle information using systematic self-refinement:
Step 1 - DRAFT: Extract your initial best guess for make, model, and year
Step 2 - CRITIQUE: Review your draft with these questions:
- Is the make actually a vehicle manufacturer (not generic "car")?
- Is the model specific enough (not just "truck" or "sedan")?
- Is the year realistic for vehicles (1990-2025 range)?
- Did I confuse mileage/speed numbers with the model year?
Step 3 - REFINE: Based on your critique, provide your final extraction
Show your complete reasoning process including:
- Initial draft and evidence found
- Self-critique and identified issues
- Final refinement and justification
Then provide the extracted structured data.
"""
# Signature definitions for with/without reasoning
class VehicleExtraction(dspy.Signature):
"""Extract vehicle make, model, and year from automotive complaint text."""
narrative: str = dspy.InputField(desc="Automotive complaint narrative text")
vehicle_info: VehicleInfo = dspy.OutputField(desc="Structured vehicle information")
class VehicleExtractionWithReasoning(dspy.Signature):
"""Extract vehicle information with explicit reasoning process."""
narrative: str = dspy.InputField(desc="Automotive complaint narrative text")
reasoning: str = dspy.OutputField(desc="Step-by-step extraction reasoning")
vehicle_info: VehicleInfo = dspy.OutputField(desc="Structured vehicle information")
DSPy Program: Modular Extraction Architecture
The core extraction module supports both standard and reasoning-enhanced modes:
# From src/_03_define_program.py
class ExtractionModule(dspy.Module):
"""Core DSPy module for vehicle information extraction with strategy support."""
def __init__(self, strategy: PromptStrategy = None, include_reasoning: bool = False):
super().__init__()
self.strategy = strategy or NaiveStrategy()
self.include_reasoning = include_reasoning
# Select signature based on reasoning requirement
if include_reasoning:
self.signature = VehicleExtractionWithReasoning
else:
self.signature = VehicleExtraction
# Create predictor with strategy-specific instructions
self.predictor = dspy.ChainOfThought(self.signature)
# Apply strategy-specific docstring
self.predictor.signature.__doc__ = self.strategy.get_docstring()
def forward(self, narrative: str) -> dspy.Prediction:
"""Extract vehicle information from narrative with robust error handling."""
try:
# Execute prediction with strategy-specific prompting
prediction = self.predictor(narrative=narrative)
# Validate and structure output
if hasattr(prediction, 'vehicle_info'):
return prediction
else:
# Handle legacy or malformed predictions
vehicle_info = VehicleInfo(
make=getattr(prediction, 'make', 'UNKNOWN'),
model=getattr(prediction, 'model', 'UNKNOWN'),
year=self._parse_year(getattr(prediction, 'year', None))
)
prediction.vehicle_info = vehicle_info
return prediction
except Exception as e:
logger.error(f"Extraction failed for strategy {self.strategy.__class__.__name__}: {e}")
return self._create_fallback_prediction()
def _create_fallback_prediction(self) -> dspy.Prediction:
"""Create safe fallback prediction for error cases."""
fallback_info = VehicleInfo(make="UNKNOWN", model="UNKNOWN", year=None)
return dspy.Prediction(vehicle_info=fallback_info)
Evaluation Framework: F1-Score with Parallelization
The evaluation system provides robust performance measurement:
# From src/_03_define_program.py
def extraction_metric(gold: dspy.Example, pred: dspy.Prediction, trace=None) -> float:
"""Calculate F1-score for vehicle information extraction."""
try:
# Extract gold standard
gold_vehicle = gold.vehicle_info
# Extract prediction
if hasattr(pred, 'vehicle_info'):
pred_vehicle = pred.vehicle_info
else:
# Handle direct field predictions
pred_vehicle = VehicleInfo(
make=getattr(pred, 'make', 'UNKNOWN'),
model=getattr(pred, 'model', 'UNKNOWN'),
year=getattr(pred, 'year', None)
)
# Calculate field-wise F1 scores
make_f1 = f1_score_field(pred_vehicle.make, gold_vehicle.make)
model_f1 = f1_score_field(pred_vehicle.model, gold_vehicle.model)
year_f1 = f1_score_field(str(pred_vehicle.year), str(gold_vehicle.year))
# Overall F1 is macro-average of field scores
overall_f1 = (make_f1 + model_f1 + year_f1) / 3.0
# Logging for debugging
if trace:
logger.debug(f"Gold: {gold_vehicle}")
logger.debug(f"Pred: {pred_vehicle}")
logger.debug(f"Scores: make={make_f1:.3f}, model={model_f1:.3f}, year={year_f1:.3f}")
return overall_f1
except Exception as e:
logger.error(f"Metric calculation failed: {e}")
return 0.0
def f1_score_field(pred: str, gold: str) -> float:
"""Calculate F1 score for individual field with normalization."""
# Normalize strings for comparison
pred_norm = normalize_automotive_field(pred)
gold_norm = normalize_automotive_field(gold)
# Exact match gets full score
if pred_norm == gold_norm:
return 1.0
# Partial match scoring for common variations
if pred_norm in gold_norm or gold_norm in pred_norm:
return 0.8
# No match
return 0.0
Optimization Pipeline: BootstrapFewShot with Langfuse
The optimization process uses DSPy’s BootstrapFewShot teleprompter with comprehensive tracking:
# From src/_04_run_optimization.py
def run_optimization_experiment(strategy_name: str) -> Tuple[ExtractionModule, Dict[str, float]]:
"""Run complete optimization experiment with Langfuse tracking."""
logger.info(f"🚀 Starting optimization for {strategy_name}")
# Initialize environment and tracking
langfuse_handler = setup_environment()
# Load and split data
examples = load_and_clean_nhtsa_data("data/NHTSA_complaints.csv", sample_size=500)
train_examples, eval_examples = train_test_split(examples, test_size=0.1, random_state=42)
logger.info(f"Dataset split: {len(train_examples)} train, {len(eval_examples)} eval")
# Initialize strategy and model
strategy = PROMPT_STRATEGIES[strategy_name]
include_reasoning = "with_reasoning" in strategy_name
model = ExtractionModule(strategy, include_reasoning=include_reasoning)
# Configure DSPy teleprompter
teleprompter = BootstrapFewShot(
metric=extraction_metric,
max_bootstrapped_demos=8, # Learn from successful examples
max_labeled_demos=4, # Include hand-crafted demonstrations
teacher_settings=dict(lm=dspy.settings.lm),
student_settings=dict(lm=dspy.settings.lm)
)
# Create Langfuse trace for observability
trace = langfuse_handler.trace(
name=f"DSPy_Optimization_{strategy_name}",
metadata={
"strategy": strategy_name,
"reasoning": include_reasoning,
"train_size": len(train_examples),
"eval_size": len(eval_examples)
}
)
with trace:
# Run compilation/optimization
logger.info("🔄 Running DSPy compilation...")
compiled_model = teleprompter.compile(
model,
trainset=train_examples,
valset=eval_examples[:20] # Use subset for validation during compilation
)
# Final evaluation on full eval set
logger.info("📊 Running final evaluation...")
evaluator = dspy.Evaluate(
devset=eval_examples,
metric=extraction_metric,
num_threads=4,
display_progress=True
)
final_score = evaluator(compiled_model)
# Save results
save_path = f"results/optimized_{strategy_name}.json"
compiled_model.save(save_path)
# Update central results tracking
update_results_summary(
strategy_name=strategy_name,
score=final_score,
trace_url=trace.get_trace_url(),
optimized_path=save_path
)
logger.info(f"✅ Optimization complete. F1-Score: {final_score:.3f}")
return compiled_model, {"overall": final_score}
Meta-Optimization: Advanced Prompting Techniques
Phase 2 explores sophisticated meta-optimization approaches that, surprisingly, failed to improve performance:
# From src/_05_meta_optimizers.py
from abc import ABC, abstractmethod
from typing import Type, Dict, Any
class MetaOptimizer(ABC):
"""Abstract base class for meta-optimization techniques."""
@abstractmethod
def enhance_signature(self, base_signature: Type[dspy.Signature]) -> Type[dspy.Signature]:
"""Apply meta-optimization enhancement to a DSPy signature."""
pass
class DomainExpertiseEnhancement(MetaOptimizer):
"""Inject automotive domain expertise into prompts."""
def enhance_signature(self, base_signature: Type[dspy.Signature]) -> Type[dspy.Signature]:
enhanced_docstring = f"""
{base_signature.__doc__}
AUTOMOTIVE DOMAIN EXPERTISE INJECTION:
- Major vehicle manufacturers: Toyota, Honda, Ford, Chevrolet, BMW, Mercedes, Tesla, Nissan, Hyundai, Volkswagen
- Common model patterns: Camry, Accord, F-150, Silverado, Model 3, 3 Series, C-Class, Altima, Elantra, Jetta
- Model years typically range from 1990-2025 for complaint data
- Manufacturer abbreviations: Chevy=Chevrolet, Benz=Mercedes-Benz, VW=Volkswagen
- Trim levels (LX, EX, Limited, Sport) are NOT the model name
- Watch for model variants: "Model 3 Performance" → Model="Model 3"
Apply this automotive domain knowledge during extraction to improve accuracy.
"""
# Create enhanced signature class dynamically
class EnhancedSignature(base_signature):
__doc__ = enhanced_docstring
return EnhancedSignature
class FormatEnforcementEnhancement(MetaOptimizer):
"""Enforce strict output formatting requirements."""
def enhance_signature(self, base_signature: Type[dspy.Signature]) -> Type[dspy.Signature]:
enhanced_docstring = f"""
{base_signature.__doc__}
CRITICAL FORMAT ENFORCEMENT REQUIREMENTS:
- You MUST respond with a valid JSON object following the exact schema
- No additional text, explanations, commentary, or reasoning outside the JSON
- Follow this precise format: {{"make": "...", "model": "...", "year": ...}}
- If uncertain about any field, use "UNKNOWN" for make/model, null for year
- Do not include markdown formatting, code blocks, or extra whitespace
- Validate JSON structure before responding
RESPOND ONLY WITH THE JSON OBJECT. NO OTHER TEXT ALLOWED.
"""
class EnhancedSignature(base_signature):
__doc__ = enhanced_docstring
return EnhancedSignature
class ConstitutionalEnhancement(MetaOptimizer):
"""Apply constitutional AI principles for multi-faceted reasoning."""
def enhance_signature(self, base_signature: Type[dspy.Signature]) -> Type[dspy.Signature]:
enhanced_docstring = f"""
{base_signature.__doc__}
CONSTITUTIONAL REASONING FRAMEWORK:
Apply these constitutional principles in order:
1. ACCURACY PRINCIPLE: Extract only information explicitly stated in the text
2. SPECIFICITY PRINCIPLE: Prefer specific vehicle identifiers over generic terms
3. CONSISTENCY PRINCIPLE: Ensure extracted year matches make/model era compatibility
4. EVIDENCE PRINCIPLE: Base extractions on clear textual evidence
5. HUMILITY PRINCIPLE: Use "UNKNOWN" when information is ambiguous or absent
For each extraction, validate against ALL constitutional principles before finalizing.
"""
class EnhancedSignature(base_signature):
__doc__ = enhanced_docstring
return EnhancedSignature
# Meta-optimizer registry for systematic testing
META_OPTIMIZERS = {
"domain_expertise": DomainExpertiseEnhancement,
"specificity": SpecificityEnhancement,
"error_prevention": ErrorPreventionEnhancement,
"context_anchoring": ContextAnchoringEnhancement,
"format_enforcement": FormatEnforcementEnhancement,
"constitutional": ConstitutionalEnhancement,
}
Experimental Results: The Meta-Optimization Paradox
The results revealed a fascinating paradox that challenges conventional wisdom about prompt optimization:
Phase 1: Universal Reasoning Field Success
| Strategy | Without Reasoning | With Reasoning | Improvement | Business Impact |
|---|---|---|---|---|
| Contrastive CoT | 42.67% | 51.33% | +8.66% | 🏆 20% error reduction |
| Naive | 42.67% | 46.67% | +4.0% | ✅ 9% error reduction |
| Chain-of-Thought | 42.67% | 46.0% | +3.33% | ✅ 8% error reduction |
| Plan & Solve | 42.67% | 46.0% | +3.33% | ✅ 8% error reduction |
| Self-Refine | 43.33% | 45.33% | +2.0% | ✅ 5% error reduction |
Key Discovery: 100% of strategies improved with reasoning fields - this represents a universal optimization principle.
Phase 2: Meta-Optimization Performance Regression
# Critical conflict example discovered in analysis
# Contrastive CoT Strategy demands:
"Provide your reasoning showing how you applied good reasoning principles..."
# Format Enforcement Meta-Optimizer demands:
"You MUST respond with ONLY a JSON object... No additional text or commentary"
# Result: Direct contradiction causing 24% performance drop (51.33% → 27.33%)
| Meta-Optimized Strategy | F1-Score | vs Baseline | Status | Root Cause |
|---|---|---|---|---|
| Contrastive CoT + Domain Expertise | 49.33% | -2.0% | ❌ Regression | Prompt complexity |
| Contrastive CoT + Format Enforcement | 27.33% | -24% | ❌ Catastrophic | Instruction conflict |
| Contrastive CoT + Constitutional | 46.0% | -5.33% | ❌ Regression | Cognitive overload |
| Contrastive CoT + Error Prevention | 46.67% | -4.66% | ❌ Regression | Competing objectives |
Critical Insight: Meta-optimization creates instruction conflicts that degrade performance, establishing reasoning fields as the optimization ceiling.
Deployment Architecture: Multi-Environment Support
Local Development Environment
# Complete local setup with GPU acceleration
git clone https://github.com/Adredes-weslee/dspy-automotive-extractor.git
cd dspy-automotive-extractor
# Install dependencies with UV package manager
pip install uv
python -m uv venv .venv
.\.venv\Scripts\Activate.ps1
# Install PyTorch with CUDA support
.\.venv\Scripts\python.exe -m pip install torch==2.7.0+cu126 torchvision==0.22.0+cu126 torchaudio==2.7.0+cu126 --extra-index-url https://download.pytorch.org/whl/cu126
# Install project dependencies
python -m uv pip install -e .
# Download Ollama models
ollama pull gemma3:12b # High-performance (8GB+ VRAM)
ollama pull qwen3:4b # CPU-friendly alternative
# Configure environment
copy .env.template .env
# Edit .env with your Langfuse credentials
# Verify setup
.\.venv\Scripts\python.exe src/verify_gpu.py
Experimental Pipeline Execution
# Phase 1: Reasoning Field Experiments
.\.venv\Scripts\python.exe src/_04_run_optimization.py naive_without_reasoning
.\.venv\Scripts\python.exe src/_04_run_optimization.py naive_with_reasoning
.\.venv\Scripts\python.exe src/_04_run_optimization.py contrastive_cot_with_reasoning
# Phase 2: Meta-Optimization Experiments
.\.venv\Scripts\python.exe src/_06_run_meta_optimization.py meta
.\.venv\Scripts\python.exe src/_06_run_meta_optimization.py single --strategy contrastive_cot_domain_expertise
# Launch interactive dashboard
.\.venv\Scripts\python.exe -m streamlit run src/app.py
Cloud Deployment: Streamlit Community Cloud
# From src/app_cloud.py - Zero-dependency cloud deployment
def load_summary_data() -> Dict[str, Any]:
"""Load experimental results with demo data fallback for cloud deployment."""
summary_path = Path("results/results_summary.json")
if summary_path.exists():
with open(summary_path, "r") as f:
return json.load(f)
else:
# Embedded demo data for Streamlit Community Cloud
return {
"naive_without_reasoning": {"final_score": 42.67, "timestamp": "2025-06-30T08:00:00"},
"naive_with_reasoning": {"final_score": 46.67, "timestamp": "2025-06-30T08:15:00"},
"contrastive_cot_without_reasoning": {"final_score": 42.67, "timestamp": "2025-06-30T08:30:00"},
"contrastive_cot_with_reasoning": {"final_score": 51.33, "timestamp": "2025-06-30T08:45:00"},
"contrastive_cot_domain_expertise_bootstrap": {
"final_score": 49.33,
"strategy_type": "meta_optimized",
"timestamp": "2025-06-30T09:00:00"
},
"contrastive_cot_format_enforcement_bootstrap": {
"final_score": 27.33,
"strategy_type": "meta_optimized",
"timestamp": "2025-06-30T09:15:00"
}
}
def create_performance_visualization(results_data):
"""Create interactive performance comparison charts with Plotly."""
df = pd.DataFrame(results_data)
# Strategy type classification with enhanced logic
def classify_strategy(strategy_name):
if "meta_optimized" in strategy_name or strategy_name.endswith("_bootstrap"):
return "Meta-Optimized"
elif "with_reasoning" in strategy_name:
return "Baseline (+ Reasoning)"
elif "without_reasoning" in strategy_name:
return "Baseline (- Reasoning)"
else:
return "Baseline"
df["Strategy_Type"] = df["Strategy"].apply(classify_strategy)
# Interactive Plotly visualization with dynamic sizing
fig = px.bar(
df.sort_values("F1_Score", ascending=True),
x="F1_Score",
y="Strategy",
color="Strategy_Type",
title="DSPy Optimization Results: Reasoning Fields vs Meta-Optimization",
height=max(500, len(df) * 30), # Dynamic height based on data
color_discrete_map={
"Baseline (- Reasoning)": "#87CEEB",
"Baseline (+ Reasoning)": "#1f77b4",
"Baseline": "#1f77b4",
"Meta-Optimized": "#ff7f0e"
}
)
fig.update_layout(
xaxis_title="F1-Score (%)",
yaxis_title="Strategy",
margin=dict(l=200), # Space for strategy names
showlegend=True
)
return fig
# Deploy to Streamlit Community Cloud
def main():
st.set_page_config(page_title="DSPy Automotive Extractor", layout="wide")
st.title("🚗 DSPy Automotive Extractor Dashboard")
st.markdown("*Cloud version - Comprehensive optimization analysis with embedded demo data*")
# Load data with cloud fallback
summary_data = load_summary_data()
# Create interactive dashboard
tab1, tab2, tab3 = st.tabs(["📈 Results Analysis", "🧠 Experimental Insights", "🌐 Cloud Demo"])
with tab1:
display_enhanced_results_tab(summary_data)
with tab2:
display_analysis_tab(summary_data)
with tab3:
display_cloud_demo_tab()
System Diagnostics: Production Readiness
The project includes comprehensive diagnostic capabilities:
# From src/verify_gpu.py
def comprehensive_system_check():
"""Complete system validation for production deployment."""
print("🔍 DSPy AUTOMOTIVE EXTRACTOR - SYSTEM DIAGNOSTICS")
print("=" * 60)
# PyTorch CUDA verification
check_pytorch_cuda()
# Ollama connectivity test
check_ollama_connection()
# DSPy inference pipeline test
test_dspy_inference()
# Memory and performance validation
check_system_resources()
# Data pipeline validation
validate_data_pipeline()
def check_pytorch_cuda():
"""Comprehensive PyTorch CUDA verification."""
print("\n🔍 PYTORCH CUDA VERIFICATION")
print("-" * 30)
try:
import torch
if torch.cuda.is_available():
print(f"✅ CUDA Available: {torch.version.cuda}")
print(f"✅ GPU Count: {torch.cuda.device_count()}")
for i in range(torch.cuda.device_count()):
gpu_props = torch.cuda.get_device_properties(i)
memory_gb = gpu_props.total_memory / (1024**3)
print(f" GPU {i}: {gpu_props.name} ({memory_gb:.1f} GB)")
# Performance test
device = torch.device("cuda:0")
start_time = time.time()
x = torch.randn(1000, 1000, device=device)
y = torch.randn(1000, 1000, device=device)
z = torch.mm(x, y)
torch.cuda.synchronize()
elapsed = time.time() - start_time
print(f"✅ GPU Matrix Multiplication: {elapsed:.3f}s")
else:
print("❌ CUDA not available - will use CPU inference")
print("⚠️ Performance will be significantly slower")
except ImportError:
print("❌ PyTorch not installed")
def check_ollama_connection():
"""Test Ollama service connectivity and model availability."""
print("\n🔍 OLLAMA SERVICE VERIFICATION")
print("-" * 30)
try:
import requests
# Check Ollama service
response = requests.get("http://localhost:11434/api/tags", timeout=5)
if response.status_code == 200:
models = response.json().get("models", [])
print(f"✅ Ollama service running")
print(f"✅ Available models: {len(models)}")
# Check for required models
model_names = [model["name"] for model in models]
required_models = ["gemma3:12b", "qwen3:4b"]
for model in required_models:
if any(model in name for name in model_names):
print(f"✅ Model available: {model}")
else:
print(f"⚠️ Model missing: {model}")
else:
print("❌ Ollama service not responding")
except Exception as e:
print(f"❌ Ollama connection failed: {e}")
print("💡 Ensure Ollama is installed and running: ollama serve")
def test_dspy_inference():
"""Test complete DSPy inference pipeline."""
print("\n🔍 DSPY INFERENCE VERIFICATION")
print("-" * 30)
try:
import dspy
from src._02_define_schema import VehicleExtraction
from src._03_define_program import ExtractionModule
# Configure DSPy
model_name = os.getenv("OLLAMA_MODEL", "gemma3:12b")
llm = dspy.LM(model=f"ollama/{model_name}")
dspy.settings.configure(lm=llm)
# Test inference
program = ExtractionModule()
test_narrative = "I own a 2022 Tesla Model Y with brake issues"
start_time = time.time()
result = program(narrative=test_narrative)
elapsed = time.time() - start_time
print(f"✅ DSPy inference successful")
print(f"✅ Response time: {elapsed:.2f}s")
print(f"✅ Extracted: {result.vehicle_info}")
except Exception as e:
print(f"❌ DSPy inference failed: {e}")
Key Technical Discoveries
1. The Reasoning Field Universal Law
Technical Finding: Adding explicit reasoning output fields improves performance across ALL baseline strategies without exception.
Implementation:
# Standard signature
class VehicleExtraction(dspy.Signature):
narrative: str = dspy.InputField()
vehicle_info: VehicleInfo = dspy.OutputField()
# Reasoning-enhanced signature
class VehicleExtractionWithReasoning(dspy.Signature):
narrative: str = dspy.InputField()
reasoning: str = dspy.OutputField(desc="Step-by-step extraction reasoning")
vehicle_info: VehicleInfo = dspy.OutputField()
Result: Universal +4.26% average improvement, with Contrastive CoT achieving +8.66%.
2. The Meta-Optimization Paradox
Technical Finding: Advanced prompt engineering techniques consistently failed to improve DSPy-optimized baselines.
Root Cause Analysis:
# Instruction conflict example
base_strategy = "Provide reasoning showing your analysis..."
meta_optimizer = "Respond ONLY with JSON, no explanations..."
# Result: Contradictory requirements → 24% performance drop
3. The Framework Alignment Principle
Technical Finding: DSPy-native optimization outperforms external prompt engineering techniques.
Implication: Framework compatibility is more valuable than prompt sophistication.
4. The Performance Ceiling Effect
Technical Finding: Complex optimization approaches hit performance ceilings that simpler methods exceed.
Evidence: Meta-optimization peak (49.33%) < Reasoning field peak (51.33%)
Performance Optimization: Production Considerations
Hardware Requirements and Scaling
| Component | Minimum | Recommended | Enterprise |
|---|---|---|---|
| RAM | 8GB | 16GB | 32GB+ |
| GPU VRAM | None (CPU) | 8GB | 16GB+ |
| Storage | 50GB | 100GB | 500GB+ |
| CPU Cores | 4 | 8 | 16+ |
Runtime Performance Benchmarks
| Strategy Type | GPU Runtime | CPU Runtime | Throughput |
|---|---|---|---|
| Baseline (- Reasoning) | 5-10 min | 20-30 min | 500 complaints/hour |
| Baseline (+ Reasoning) | 10-15 min | 30-45 min | 350 complaints/hour |
| Meta-Optimized | 15-25 min | 45-60 min | 250 complaints/hour |
Deployment Scaling Strategies
# Horizontal scaling with multiprocessing
from multiprocessing import Pool
from concurrent.futures import ProcessPoolExecutor
def parallel_optimization(strategies: List[str], max_workers: int = 4):
"""Run multiple optimization experiments in parallel."""
with ProcessPoolExecutor(max_workers=max_workers) as executor:
futures = {
executor.submit(run_optimization_experiment, strategy): strategy
for strategy in strategies
}
results = {}
for future in tqdm(as_completed(futures), total=len(futures)):
strategy = futures[future]
try:
model, scores = future.result()
results[strategy] = scores
logger.info(f"✅ {strategy}: {scores['overall']:.3f}")
except Exception as e:
logger.error(f"❌ {strategy} failed: {e}")
results[strategy] = {"overall": 0.0, "error": str(e)}
return results
# Memory optimization for large datasets
def batch_evaluation(model, examples: List[dspy.Example], batch_size: int = 50):
"""Evaluate model in batches to manage memory usage."""
total_score = 0.0
total_examples = len(examples)
for i in tqdm(range(0, total_examples, batch_size), desc="Batch evaluation"):
batch = examples[i:i + batch_size]
batch_scores = []
for example in batch:
try:
prediction = model(narrative=example.narrative)
score = extraction_metric(example, prediction)
batch_scores.append(score)
except Exception as e:
logger.warning(f"Evaluation failed for example {i}: {e}")
batch_scores.append(0.0)
total_score += sum(batch_scores)
# Memory cleanup after each batch
if torch.cuda.is_available():
torch.cuda.empty_cache()
gc.collect()
return total_score / total_examples
Future Research Directions: Expanding the Framework
Immediate Technical Extensions
- Multi-Domain Validation: Test reasoning field principles across medical, legal, and financial extraction tasks
- Advanced Metrics: Implement semantic similarity scoring beyond exact string matching
- Real-Time Processing: Stream processing capabilities for continuous complaint monitoring
- Multi-Modal Integration: Extend framework to process images, PDFs, and technical diagrams
Long-Term Research Opportunities
# Future research directions with technical foundations
class SemanticSimilarityMetric:
"""Enhanced evaluation using semantic similarity instead of exact matching."""
def __init__(self, similarity_threshold: float = 0.8):
self.threshold = similarity_threshold
self.encoder = SentenceTransformer('all-MiniLM-L6-v2')
def calculate_similarity(self, pred: str, gold: str) -> float:
"""Calculate semantic similarity between predicted and gold values."""
if pred == gold:
return 1.0
pred_embedding = self.encoder.encode([pred])
gold_embedding = self.encoder.encode([gold])
similarity = cosine_similarity(pred_embedding, gold_embedding)[0][0]
return max(0.0, similarity)
class MultiModalExtractionModule(dspy.Module):
"""Future extension for multi-modal document processing."""
def __init__(self, include_vision: bool = False):
super().__init__()
self.include_vision = include_vision
if include_vision:
self.signature = VehicleExtractionMultiModal
else:
self.signature = VehicleExtraction
self.predictor = dspy.ChainOfThought(self.signature)
def forward(self, narrative: str, image_path: str = None) -> dspy.Prediction:
"""Extract from text and optionally images."""
# Implementation for future multi-modal capabilities
pass
class StreamingExtractionPipeline:
"""Real-time complaint processing pipeline."""
def __init__(self, model: ExtractionModule, batch_size: int = 10):
self.model = model
self.batch_size = batch_size
self.buffer = []
async def process_stream(self, complaint_stream):
"""Process complaints in real-time batches."""
async for complaint in complaint_stream:
self.buffer.append(complaint)
if len(self.buffer) >= self.batch_size:
results = await self.process_batch(self.buffer)
yield results
self.buffer = []
Conclusion: Transforming Prompt Optimization Methodology
This project fundamentally challenges how we approach prompt optimization, providing the first rigorous scientific validation that reasoning fields + DSPy alignment = optimization sweet spot while meta-optimization creates diminishing returns on optimized baselines.
Technical Achievements Summary:
- Systematic Validation: First rigorous comparison of reasoning fields vs meta-optimization with 26 strategies tested
- Production Framework: Complete pipeline from data loading to cloud deployment with observability
- Reproducible Science: Quantitative methodology that eliminates subjective prompt engineering
- Framework Principles: Established DSPy-specific optimization principles that prioritize architectural alignment
Methodological Impact:
The discovery that DSPy’s framework alignment trumps prompt engineering sophistication represents a paradigm shift from creativity-driven to systematic, framework-aware optimization approaches. This has profound implications for:
- Enterprise AI Development: Systematic optimization reduces development cycles
- Research Methodology: Establishes quantitative foundations for prompt optimization research
- Framework Design: Informs future development of LLM optimization frameworks
- Best Practices: Provides evidence-based guidelines for structured extraction tasks
Code Quality and Architecture:
The implementation demonstrates production-ready practices including:
- Comprehensive error handling and logging
- Modular architecture with clear separation of concerns
- Extensive documentation following Google-style conventions
- Multi-environment deployment (local, cloud, enterprise)
- Performance optimization and resource management
- Systematic testing and validation frameworks
This research provides both theoretical insights and practical tools for building robust, high-performance structured extraction systems that prioritize framework compatibility over prompt complexity, establishing a new standard for systematic prompt optimization methodology.
To explore the complete DSPy Automotive Extractor platform, including its overall architecture, optimization methodology, and usage instructions, please refer to the DSPy Automotive Extractor: Systematic Prompt Optimization for Enterprise AI Project Page. The full codebase for the framework and the optimization techniques discussed herein is available on GitHub.