2024-06-10 7 min read

A/B Testing ML Models in Production Safely

Running A/B tests on ML models doesn't mean exposing users to poor predictions. Learn practical strategies to validate improvements without risk.

A/B Testing ML Models in Production Safely

You've trained a new ML model. It performs better on your test set. But deploying it to 50% of users means 50% might get worse predictions—and you won't know until it's too late. The tension between moving fast and protecting user experience is real, and it's why many teams either skip validation entirely or move so cautiously they never ship improvements.

The solution isn't to avoid A/B testing. It's to test smarter.

Shadow Deployment: Validate Before Exposing Users

Shadow deployment runs your new model alongside production without using its predictions. Users always see results from the current model, while the new model runs silently in parallel, collecting metrics.

How It Works

Your new model processes the same input data as production, generates predictions, and logs them for analysis. You can then compare offline metrics—accuracy, latency, resource usage—before any user sees its output.

python
import logging
from datetime import datetime

class ShadowModelWrapper:
    def __init__(self, production_model, shadow_model):
        self.prod = production_model
        self.shadow = shadow_model
        self.logger = logging.getLogger("shadow_metrics")
    
    def predict(self, features):
        # Always return production prediction
        prod_prediction = self.prod.predict(features)
        
        # Run shadow model silently
        try:
            shadow_prediction = self.shadow.predict(features)
            self.logger.info(f"shadow_pred={shadow_prediction}, prod_pred={prod_prediction}")
        except Exception as e:
            self.logger.error(f"Shadow model failed: {e}")
        
        return prod_prediction

This approach gives you real production data without real production risk. You can run it for hours or days, accumulating enough data to make a confident decision.

Canary Releases: Gradual Rollout with Real Metrics

Once shadow deployment looks good, use a canary release to route a small percentage of live traffic to the new model. Start at 1–5%, monitor real user impact, then gradually increase.

Implementation Pattern

python
import random

class CanaryRouter:
    def __init__(self, prod_model, canary_model, canary_percentage=5):
        self.prod = prod_model
        self.canary = canary_model
        self.canary_pct = canary_percentage
    
    def predict(self, user_id, features):
        # Consistent bucketing per user
        bucket = hash(user_id) % 100
        
        if bucket < self.canary_pct:
            return self.canary.predict(features)
        else:
            return self.prod.predict(features)

Key principle: route by user ID, not randomly per request. This ensures consistent experience—users always see the same model version.

Monitor four things during a canary:

  • Latency: Does the new model respond faster or slower?
  • Error rate: Does it fail more often?
  • Business metrics: Click-through rate, conversion, retention. Does the new model actually improve what matters?
  • User feedback: Watch for support tickets.

If any metric degrades, roll back immediately. If metrics improve, increase the canary percentage incrementally.

Monitoring That Actually Catches Problems

Canaries only work if you know what to measure. Set thresholds before launch.

bash
# Example: alert if latency p95 exceeds 200ms
alert_latency_p95: 200ms

# Alert if error rate jumps by 2%
alert_error_rate_increase: 2%

# Alert if conversion rate drops by 1%
alert_conversion_drop: 1%

At LavaPi, we've seen teams spend more time building canary infrastructure than the actual model. That's backwards. Use existing monitoring tools—Datadog, Prometheus, CloudWatch—and add A/B test awareness to them. Track which model served each prediction so you can attribute metrics correctly.

The Complete Flow

  1. Train new model
  2. Shadow deploy for 24–48 hours
  3. Review offline metrics
  4. Start canary at 1%
  5. Monitor for 4–8 hours
  6. Increase to 10%, then 50%, then 100%
  7. Keep old model in ready state for instant rollback

This process takes time but protects your users and your reputation. A bad model reaching 50% of users at once will cause more damage than it takes to validate properly.

The truth: A/B testing ML safely isn't complex. It's methodical. Plan for rollback, start small, and let data drive scale. Your users—and your metrics—will thank you.

Share
LP

LavaPi Team

Digital Engineering Company

All articles