Model Versioning in Production: Lessons from Breaking Things in the Dark
Deploy with confidence. We break down how to version ML models in production, why it matters, and the exact mistakes that taught us the hard way.
Model Versioning in Production: Lessons from Breaking Things in the Dark
At 3 AM on a Tuesday, a recommendation model silently degraded. No errors. No alerts. Just predictions that drifted slowly wrong over the course of six hours. By the time anyone noticed, the damage was done—and we had no way to roll back because nobody had documented which model version was running.
That was our wake-up call. Model versioning isn't a nice-to-have in production. It's the difference between a graceful recovery and a complete system failure. Here's what we learned.
The Cost of Not Versioning
Without versioning, production breaks silently. A new model trains faster than expected and deploys automatically. Training data shifts slightly. Dependencies change. A month later, performance has degraded 8%, but you can't trace why because you don't know which version is running, when it was deployed, or what changed.
The real cost isn't the technical debt—it's the blind spot. You can't debug what you can't see.
What to Version
Version three things: the model artifact, the training data snapshot, and the preprocessing logic.
python# Model metadata structure model_manifest = { "version": "2024.01.15.prod", "model_hash": "sha256:a3f2d1...", "training_date": "2024-01-15T09:22:00Z", "data_snapshot_id": "snapshot_v847", "preprocessing_version": "1.2.3", "framework": "tensorflow:2.14", "metrics": { "accuracy": 0.9847, "f1_score": 0.9156, "auc_roc": 0.9923 }, "deployed_by": "ml-pipeline", "rollback_target": "2024.01.10.prod" }
Store this alongside your model. When something breaks, you have context.
Implementation: The Boring Way That Works
Use Semantic Versioning for Models
Adopt a date-based schema:
YYYY.MM.DD.stagebash# Model storage structure ml-models/ ├── recommendation-v1/ │ ├── 2024.01.15.prod/ │ │ ├── model.h5 │ │ ├── manifest.json │ │ └── scaler.pkl │ ├── 2024.01.10.prod/ │ └── 2024.01.12.staging/
Track Deployments Explicitly
Don't rely on implicit versioning. Record every deployment with context.
pythonclass ModelRegistry: def deploy(self, model_version, environment, metadata): deployment_record = { "timestamp": datetime.utcnow().isoformat(), "model_version": model_version, "environment": environment, "deployed_by": os.environ.get("CI_USER"), "commit_hash": os.environ.get("GIT_COMMIT"), "status": "active" } self.log_deployment(deployment_record) return deployment_record
Enable Fast Rollbacks
Your deployment must support instant rollback without ceremony. Maintain a pointer to the active version, not hardcoded model paths.
typescript// Load model configuration at runtime const loadModel = async (environment: string) => { const config = await fetch( `/api/model-registry/${environment}/active` ).then(r => r.json()); return tf.loadLayersModel( `file://${config.model_path}/model.json` ); };
When you need to rollback, change a single pointer. Done in seconds.
Monitoring: Know When Things Go Wrong
Versioning solves half the problem. Monitoring solves the other half. Track model performance metrics continuously.
python# Log predictions for monitoring def log_prediction(version, prediction, confidence, ground_truth=None): metrics.record({ "model_version": version, "prediction": prediction, "confidence": confidence, "ground_truth": ground_truth, "timestamp": time.time() })
Compare current version performance against your baseline hourly. Drift detection catches silent failures before they spread.
The Takeaway
Model versioning is not about perfection or elaborate systems. At LavaPi, we've found that simple, consistent versioning practices—combined with explicit deployment tracking and continuous monitoring—eliminate the worst class of production failures: the ones you don't notice until they've been wrong for hours.
Version your models. Track your deployments. Monitor continuously. That's the formula. Everything else is implementation detail.
LavaPi Team
Digital Engineering Company