Problem: Cloud AI Is Too Slow and Too Expensive for Edge Devices
You have a Raspberry Pi. You want to run image classification, keyword detection, or anomaly detection in real time. Sending every frame to a cloud API adds 200–800ms latency, burns money, and breaks when the internet goes down.
TinyML solves this. You run a quantized model directly on the Pi — inference in under 50ms, zero network calls.
You'll learn:
- How to install TensorFlow Lite runtime on Raspberry Pi OS (64-bit)
- How to run a quantized image classification model at 20+ FPS
- How to convert your own PyTorch or Keras model to
.tflitefor edge deployment
Time: 30 min | Difficulty: Intermediate
Why Cloud AI Fails at the Edge
Sending sensor data or camera frames to an external API creates three hard problems:
Latency. A round-trip to a cloud API averages 300ms. For real-time detection — people counting, gesture control, anomaly alerting — that's unusable.
Cost at scale. 10 frames/sec × 60 sec × 60 min × 24 hours = 864,000 API calls per day per device. At $0.001/call that's $864/day for one Pi.
Offline dependency. Industrial sensors, agricultural monitors, and home automation devices often run in low-connectivity environments. Cloud AI fails silently when the connection drops.
TinyML + TensorFlow Lite moves inference to the device itself. The model runs on the Pi's ARM CPU — or its Neural Processing Unit on Pi 5 — with no outbound network requirement.
What You Need
- Raspberry Pi 4 (4GB) or Pi 5 (recommended) running Raspberry Pi OS 64-bit (Bookworm)
- Python 3.11+
- A USB camera or Pi Camera Module 3 (for camera examples)
- ~500MB free disk space
Check your OS and Python version before starting:
uname -m # Should output: aarch64
python3 --version # Should be 3.11 or 3.12
Solution
Step 1: Install TensorFlow Lite Runtime
The full TensorFlow package is 600MB+ and unnecessary for inference. Install only the lightweight runtime:
# Install system dependencies first
sudo apt update && sudo apt install -y \
libatlas-base-dev \
libjpeg-dev \
libopenjp2-7
# Install tflite-runtime — much smaller than full TensorFlow (~15MB)
pip3 install tflite-runtime --break-system-packages
Verify the install:
python3 -c "import tflite_runtime.interpreter as tflite; print('TFLite OK')"
Expected output: TFLite OK
If it fails:
ERROR: Could not find a version that satisfies the requirement→ Add--extra-index-url https://google-coral.github.io/py-repo/to the pip commandImportError: libatlas→ Runsudo apt install libatlas-base-devand retry
Step 2: Download a Pre-Quantized Model
Start with MobileNetV2 quantized for INT8 — it's 3.5MB, runs fast, and classifies 1,000 ImageNet categories.
mkdir -p ~/tinyml && cd ~/tinyml
# Download INT8 quantized MobileNetV2
wget https://storage.googleapis.com/download.tensorflow.org/models/tflite/mobilenet_v2_1.0_224_quant.tflite
# Download matching labels
wget https://storage.googleapis.com/download.tensorflow.org/models/tflite/labels_mobilenet_quant_v1_224.txt
INT8 quantization reduces the model from fp32 (~14MB) to ~3.5MB with less than 1% accuracy loss on ImageNet. On Raspberry Pi 5, INT8 runs 3–4x faster than fp32 because the CPU's NEON SIMD unit accelerates 8-bit operations natively.
Step 3: Run Image Classification
Create the inference script:
# classify.py
import numpy as np
import time
from PIL import Image
import tflite_runtime.interpreter as tflite
MODEL_PATH = "mobilenet_v2_1.0_224_quant.tflite"
LABELS_PATH = "labels_mobilenet_quant_v1_224.txt"
IMAGE_PATH = "test.jpg" # replace with your image
# Load labels
with open(LABELS_PATH) as f:
labels = [line.strip() for line in f.readlines()]
# Load model — use num_threads=4 to utilize all Pi cores
interpreter = tflite.Interpreter(model_path=MODEL_PATH, num_threads=4)
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
# Preprocess: resize to 224x224, convert to uint8
img = Image.open(IMAGE_PATH).resize((224, 224))
input_data = np.expand_dims(np.array(img, dtype=np.uint8), axis=0)
# Run inference and measure latency
interpreter.set_tensor(input_details[0]['index'], input_data)
start = time.perf_counter()
interpreter.invoke()
elapsed_ms = (time.perf_counter() - start) * 1000
output = interpreter.get_tensor(output_details[0]['index'])[0]
# INT8 output is dequantized: scale and zero_point from output_details
scale, zero_point = output_details[0]['quantization']
scores = (output.astype(np.float32) - zero_point) * scale
top5 = np.argsort(scores)[::-1][:5]
print(f"Inference: {elapsed_ms:.1f}ms")
for i in top5:
print(f" {scores[i]:.3f} {labels[i]}")
Run it with a test image:
# Download a test image
wget -O test.jpg https://upload.wikimedia.org/wikipedia/commons/thumb/4/43/Cute_dog.jpg/320px-Cute_dog.jpg
python3 classify.py
Expected output:
Inference: 38.2ms
0.812 golden retriever
0.071 Labrador retriever
0.034 kuvasz
0.021 Great Pyrenees
0.009 Sussex spaniel
38ms inference = ~26 FPS theoretical throughput on Pi 5 with num_threads=4.
Step 4: Run Live Camera Inference
For real-time inference from a USB camera or Pi Camera Module 3:
# camera_classify.py
import cv2
import numpy as np
import time
import tflite_runtime.interpreter as tflite
MODEL_PATH = "mobilenet_v2_1.0_224_quant.tflite"
LABELS_PATH = "labels_mobilenet_quant_v1_224.txt"
with open(LABELS_PATH) as f:
labels = [line.strip() for line in f.readlines()]
interpreter = tflite.Interpreter(model_path=MODEL_PATH, num_threads=4)
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
scale, zero_point = output_details[0]['quantization']
cap = cv2.VideoCapture(0) # 0 = first USB camera; use /dev/video0 explicitly if needed
cap.set(cv2.CAP_PROP_FRAME_WIDTH, 640)
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 480)
while True:
ret, frame = cap.read()
if not ret:
break
# Resize and convert BGR→RGB for the model
rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
resized = cv2.resize(rgb, (224, 224))
input_data = np.expand_dims(resized.astype(np.uint8), axis=0)
start = time.perf_counter()
interpreter.set_tensor(input_details[0]['index'], input_data)
interpreter.invoke()
fps = 1.0 / (time.perf_counter() - start)
output = interpreter.get_tensor(output_details[0]['index'])[0]
scores = (output.astype(np.float32) - zero_point) * scale
top_idx = np.argmax(scores)
label_text = f"{labels[top_idx]} ({scores[top_idx]:.2f}) | {fps:.1f} FPS"
cv2.putText(frame, label_text, (10, 30), cv2.FONT_HERSHEY_SIMPLEX, 0.8, (0, 255, 0), 2)
cv2.imshow("TinyML Edge Inference", frame)
if cv2.waitKey(1) & 0xFF == ord('q'):
break
cap.release()
cv2.destroyAllWindows()
pip3 install opencv-python-headless --break-system-packages
python3 camera_classify.py
On Pi 5, expect 20–25 FPS at 224×224 model input. The camera capture itself is the bottleneck at higher resolutions.
Step 5: Convert Your Own Model to TFLite
If you trained a custom model in PyTorch or Keras, convert it to .tflite with INT8 quantization. Run this on your development machine (not the Pi):
# convert_to_tflite.py — run on your dev machine, not the Pi
import tensorflow as tf
import numpy as np
# --- Option A: Convert a Keras model ---
# model = tf.keras.models.load_model("my_model.h5")
# --- Option B: Convert from SavedModel ---
model = tf.saved_model.load("my_saved_model/")
converter = tf.lite.TFLiteConverter.from_saved_model("my_saved_model/")
# Enable INT8 quantization — requires representative dataset
converter.optimizations = [tf.lite.Optimize.DEFAULT]
def representative_dataset():
# Feed ~100 samples representative of your real input distribution
for _ in range(100):
sample = np.random.rand(1, 224, 224, 3).astype(np.float32)
yield [sample]
converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
# Force input/output to uint8 so no fp32 conversion happens at runtime
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8
tflite_model = converter.convert()
with open("my_model_quant.tflite", "wb") as f:
f.write(tflite_model)
print(f"Model size: {len(tflite_model) / 1024:.1f} KB")
Copy the resulting .tflite file to the Pi and use it in the inference scripts above.
If conversion fails with Some ops are not supported: Add tf.lite.OpsSet.SELECT_TF_OPS to supported_ops. This increases model size but ensures compatibility.
Verification
Run the benchmark script to measure your Pi's actual throughput:
# Official TFLite benchmark tool — gives per-layer timing too
wget https://storage.googleapis.com/download.tensorflow.org/models/tflite/linux/aarch64/benchmark_model
chmod +x benchmark_model
./benchmark_model \
--graph=mobilenet_v2_1.0_224_quant.tflite \
--num_threads=4 \
--num_runs=50
You should see on Pi 5:
INFO: Inference timings in us: Init: 12345, First inference: 38210, Warmup: 36500, Inference: 37800
INFO: Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE.
INFO: Memory footprint delta from the start of the tool (RSS: 28.00 MB, VmPeak: 52.00 MB)
Target numbers by device:
| Device | MobileNetV2 INT8 | Notes |
|---|---|---|
| Pi 5 (4 threads) | ~38ms / 26 FPS | Best option for new builds |
| Pi 4 (4 threads) | ~65ms / 15 FPS | Sufficient for most use cases |
| Pi Zero 2W | ~320ms / 3 FPS | Single-shot detection only |
What You Learned
- TFLite runtime is the right tool for Pi inference — full TensorFlow is not needed
- INT8 quantization cuts model size 4x and speeds inference 3x with minimal accuracy cost
num_threads=4is the key flag for Pi 4/5 — the default of 1 thread leaves 75% of CPU unused- The bottleneck on live video is usually camera capture resolution, not model inference
When NOT to use this approach: If your model exceeds ~10MB INT8, or if you need transformer-based NLP (BERT, LLaMA), TFLite on Pi CPU becomes impractical. For those cases, look at llama.cpp with GGUF quantization or run inference on a more powerful device and stream results to the Pi.
Next step: For object detection instead of classification, swap in ssd_mobilenet_v2_coco_quant.tflite — same pipeline, different model, outputs bounding boxes instead of class scores.
Tested on Raspberry Pi 5 (8GB), Raspberry Pi OS Bookworm 64-bit, TFLite Runtime 2.16, Python 3.11