2026-03-05 6 min read

Running ML Models on Microcontrollers: TinyML Essentials

Deploy trained ML models directly to resource-constrained devices. Learn the practical workflow for TinyML, from model quantization to firmware integration.

Running ML Models on Microcontrollers: TinyML Essentials

Your edge device has 256KB of RAM and a 48MHz processor. Your ML model is 2MB. This isn't a hypothetical problem—it's the reality for thousands of IoT deployments today. TinyML bridges this gap by fitting trained neural networks into microcontrollers, enabling intelligent inference without cloud connectivity or power-hungry processors.

Unlike traditional cloud-based ML, edge inference on constrained hardware means faster response times, offline operation, and reduced bandwidth costs. The tradeoff is straightforward: you accept lower accuracy and model complexity for deployment feasibility.

Model Preparation and Quantization

Your first step is aggressive model reduction. Start with a trained model—typically trained on standard hardware—then apply quantization to shrink it dramatically.

Quantization Techniques

Post-training quantization converts 32-bit floating-point weights to 8-bit integers. This alone reduces model size by 75%.

python
import tensorflow as tf

converter = tf.lite.TFLiteConverter.from_saved_model('model_dir')
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_ops = [
    tf.lite.OpsSet.TFLITE_BUILTINS_INT8
]
quantized_model = converter.convert()

with open('model.tflite', 'wb') as f:
    f.write(quantized_model)

For even smaller models, quantize-aware training (QAT) incorporates quantization during training, preserving accuracy better than post-training approaches.

Profiling and Architecture Selection

Before committing to a model, profile its resource requirements. A 500KB model might work on an STM32L476, but fail on an Arduino Uno with only 32KB of RAM.

TensorFlow Lite Micro provides a size estimation tool:

bash
bazel run tensorflow/lite/micro/tools:model_size -- \
  --model_file=model.tflite

Architecture matters too. Depthwise separable convolutions use 8–9× fewer operations than standard convolutions. MobileNet and SqueezeNet are built for this constraint.

Firmware Integration and Deployment

Once quantized, the model becomes a C++ byte array embedded directly in your firmware.

Using TensorFlow Lite Micro

TensorFlow Lite Micro is purpose-built for microcontrollers. It requires no OS, no dynamic memory allocation, and compiles to ~80KB of binary code.

cpp
#include "tensorflow/lite/micro/all_ops_resolver.h"
#include "tensorflow/lite/micro/interpreter.h"
#include "tensorflow/lite/micro/kernel_util.h"
#include "tensorflow/lite/schema/schema_generated.h"

extern const unsigned char model_tflite[];
extern const unsigned int model_tflite_len;

namespace tflite {
  extern "C" void* aligned_alloc(size_t alignment, size_t size);
}

const tflite::Model* model = tflite::GetModel(model_tflite);
tflite::AllOpsResolver resolver;
static uint8_t tensor_arena[10000];

tflite::InterpreterBuilder builder(model, resolver);
std::unique_ptr<tflite::Interpreter> interpreter;
builder(&interpreter);

interpreter->AllocateTensors();

TfLiteTensor* input = interpreter->input(0);
input->data.f[0] = sensor_reading;

interpreter->Invoke();

TfLiteTensor* output = interpreter->output(0);
float prediction = output->data.f[0];

Memory Management

Microcontrollers require static memory allocation. The

code
tensor_arena
is a pre-allocated buffer that holds all intermediate tensors during inference. Size it based on profiling—too small and inference fails; too large and you waste precious RAM.

Real-World Considerations

Deployment isn't just about fitting code into Flash. Account for:

  • Latency: Inference on a 48MHz ARM Cortex-M4 typically takes 50–500ms depending on model complexity
  • Power consumption: Each inference cycle draws current; ultra-low-power designs may need optimization
  • Debugging: Use serial output and LED indicators for firmware validation—traditional debuggers often conflict with real-time sensor sampling

At LavaPi, we've deployed TinyML models for vibration monitoring, environmental sensing, and anomaly detection on IoT gateways. The pattern is consistent: aggressive quantization, careful architecture selection, and thorough profiling.

The Takeaway

TinyML isn't magic—it's disciplined engineering. Your model must fit within strict resource budgets, and accuracy drops with aggressive quantization. But when implemented correctly, edge inference on microcontrollers delivers responsive, offline-capable systems that traditional cloud architectures simply can't match. Start with quantization, validate on hardware early, and iterate on your tensor arena size. The constraints are real, but the possibilities are worth the engineering effort.

Share
LP

LavaPi Team

Digital Engineering Company

All articles