ML to Production: Engineering Guide for Production ML Systems
Straight talk: The biggest ML productivity myth is that training an accurate model is the hard part. Getting that model to production, keeping it accurate, and running it reliably at scale — that is where most ML projects fail.
A widely cited figure: only 12% of ML models ever make it to production. The failure modes are predictable and preventable.
The Production Gap
The gap between a well-performing model in a Jupyter notebook and a reliable system serving 24/7 is a cluster of engineering challenges:
Data pipeline brittleness: A model trained on a clean CSV fails when the upstream database schema changes or ETL produces unexpected nulls.
Infrastructure mismatch: Model trained on a MacBook, production runs on Linux container in Kubernetes. Libraries differ, hardware differs.
Feature drift: The real world changes. A credit risk model trained on pre-pandemic data degrades on post-pandemic data.
Ownership vacuum: Data science owns training. Engineering owns infrastructure. Nobody owns the intersection.
Missing operational tooling: No monitoring, no alerting, no rollback plan.
What "Production-Ready" Actually Means
Performance SLAs
- Latency: Real-time (sync) or batch? p95 latency budget?
- Throughput: Peak requests per second?
- Availability: 99.9% uptime?
Reliability Requirements
- Graceful degradation when model service unavailable
- Data validation for out-of-distribution inputs
- Idempotency guarantees
Operational Requirements
- Full observability: logs, metrics, traces
- Rollback capability in <5 minutes
- Audit trail for regulated domains
Deployment Architecture Patterns
Pattern 1: REST/gRPC Endpoint
Real-time sync inference. Docker container + load balancer + Kubernetes.
Pattern 2: Batch Pipeline
Latency tolerate >minutes. 10–50× cheaper than real-time.
Pattern 3: Streaming Inference
Near-real-time on continuous streams. More complexity.
Pattern 4: Multi-Model Ensemble
Multiple models collaborate for A/B testing, ensemble predictions.
Model Monitoring
Monitor:
- Model metrics: Accuracy, precision, recall (with ground truth)
- Input distributions: Feature drift vs. training baseline
- Prediction distributions: Output score shifts
- System health: Latency p50/p95/p99, error rate, cost
Data Drift vs Concept Drift
Data drift: Distribution of inputs changes. Model may still be correct.
Concept drift: Underlying relationship changes. Requires retraining.
Label drift: Distribution of outcomes changes. Decision threshold needs adjustment.
Use tools: Evidently AI (open source), WhyLabs, Arize, Fiddler.
MLOps Pipeline
Automated journey from code commit to monitored production:
- Pull validated training data
- Validate data quality
- Train model
- Evaluate vs test set & champion
- Register in model registry
- Deploy to staging → run tests
- Canary deployment (5% traffic)
- Promote to full production if metrics satisfied
- Decommission previous version
Common Pitfalls
| Pitfall | Prevention | |---|---| | Training/serving skew | Feature store, single code path | | No rollback | Model registry + blue-green deployment | | Monitoring theatre | Actionable alerts to PagerDuty/Slack | | Data leakage | Temporal validation splits | | Ignoring latency | Latency testing in CI/CD |
Practical Progression
Stage 1 (0–3 models): MLflow tracking + Docker + GitHub Actions + manual deploy + Prometheus monitoring.
Stage 2 (3–10 models): Add model registry + automated retraining + drift detection.
Stage 3 (10+ models): Feature store + A/B testing framework + champion-challenger deployment.
Build when you need it, not before.
