2024-11-24 5 min read

Experiment Tracking with MLflow: What to Log and What to Ignore

MLflow is powerful, but logging everything wastes storage and obscures signal. Learn what metrics, parameters, and artifacts actually matter for your ML experiments.

You've trained a model. Now you need to know: did it work? More importantly, can you reproduce it next month? This is where experiment tracking becomes critical—and where teams often go wrong by logging indiscriminately.

MLflow makes it easy to track experiments, but easy doesn't mean thoughtless. Logging every variable, every intermediate output, and every debug statement creates noise that makes it harder to compare runs, slower to retrieve results, and more expensive to store. The real skill is knowing what belongs in your experiment record and what should stay in your code.

What You Should Always Log

Model Performance Metrics

This is non-negotiable. Log metrics that directly answer: "Does this model work?" For classification, that's accuracy, precision, recall, F1-score, and AUC. For regression, MAE, RMSE, and R² are standard. Log both training and validation metrics—the gap between them tells you about overfitting.

python
from mlflow import log_metric

log_metric("train_loss", train_loss, step=epoch)
log_metric("val_accuracy", val_accuracy, step=epoch)
log_metric("val_loss", val_loss, step=epoch)

Be specific with naming. Use prefixes like

code
train_
,
code
val_
, and
code
test_
to make comparisons across runs instant and unambiguous.

Hyperparameters That Matter

Log every hyperparameter you tuned or changed. Learning rate, batch size, regularization strength, model architecture choices—these are your experiment's DNA. When you find a good result three months later, you need to know exactly what settings created it.

python
from mlflow import log_param

log_param("learning_rate", 0.001)
log_param("batch_size", 32)
log_param("dropout_rate", 0.2)
log_param("optimizer", "adam")

Skip logging constants that never change. If you always use the same random seed across all experiments, it doesn't need to be logged—document it in your code instead.

What You Should Selectively Log

Feature Importance and Model Artifacts

Model weights and feature importance plots add insight, but they multiply storage costs. Log them if they inform your next steps. If you're exploring 50 random forest variants to find a baseline, you probably don't need the feature importance for every single run. If you're narrowing down to your final three candidates, absolutely log them—you'll need to explain model behavior to stakeholders.

python
from mlflow import log_artifact
import matplotlib.pyplot as plt

# Only log if it changes your decision
if is_final_candidate:
    plt.figure()
    plt.barh(feature_names, importances)
    plt.savefig("/tmp/feature_importance.png")
    log_artifact("/tmp/feature_importance.png")

Data Summaries

Log statistics about your dataset—record count, class distribution, missing value rates—but only once per dataset. You don't need to repeat it across 100 runs using the same training data. This becomes essential when multiple team members run experiments; data validation up front saves debugging later.

What You Should Ignore

Training-Step Diagnostics

Logging loss or accuracy at every step is tempting but usually wasteful. Log at epoch intervals instead. If you need fine-grained diagnostics, save them locally during development, then remove them before running your final experiments.

python
# Too much
for step in range(10000):
    log_metric("loss", current_loss, step=step)

# Better
if step % 100 == 0:
    log_metric("loss", current_loss, step=step)

Implementation Details

Don't log library versions, Python path information, or system configuration unless they're known sources of variance in your results. The goal is reproducibility of the model, not the exact computational environment—that's what Docker and conda files are for.

The Practical Balance

When we work with clients at LavaPi on production ML systems, the pattern is consistent: teams waste months trying to parse sprawling experiment logs, then overcorrect by logging almost nothing. The solution is discipline. Log what affects model behavior and interpretation. Ignore operational details.

Start lean. Add logging only when you find yourself thinking "I wish I'd tracked that." Your future self, reviewing experiments six months from now, will appreciate the clarity.

Share
LP

LavaPi Team

Digital Engineering Company

All articles