Run TinyML on Raspberry Pi: Edge AI Without Cloud Dependencies

Deploy TensorFlow Lite and ONNX models on Raspberry Pi 5 for real-time edge inference. No cloud, no latency, no API costs — just local AI.

Problem: Cloud AI Is Too Slow and Too Expensive for Edge Devices

You have a Raspberry Pi. You want to run image classification, keyword detection, or anomaly detection in real time. Sending every frame to a cloud API adds 200–800ms latency, burns money, and breaks when the internet goes down.

TinyML solves this. You run a quantized model directly on the Pi — inference in under 50ms, zero network calls.

You'll learn:

  • How to install TensorFlow Lite runtime on Raspberry Pi OS (64-bit)
  • How to run a quantized image classification model at 20+ FPS
  • How to convert your own PyTorch or Keras model to .tflite for edge deployment

Time: 30 min | Difficulty: Intermediate


Why Cloud AI Fails at the Edge

Sending sensor data or camera frames to an external API creates three hard problems:

Latency. A round-trip to a cloud API averages 300ms. For real-time detection — people counting, gesture control, anomaly alerting — that's unusable.

Cost at scale. 10 frames/sec × 60 sec × 60 min × 24 hours = 864,000 API calls per day per device. At $0.001/call that's $864/day for one Pi.

Offline dependency. Industrial sensors, agricultural monitors, and home automation devices often run in low-connectivity environments. Cloud AI fails silently when the connection drops.

TinyML + TensorFlow Lite moves inference to the device itself. The model runs on the Pi's ARM CPU — or its Neural Processing Unit on Pi 5 — with no outbound network requirement.


What You Need

  • Raspberry Pi 4 (4GB) or Pi 5 (recommended) running Raspberry Pi OS 64-bit (Bookworm)
  • Python 3.11+
  • A USB camera or Pi Camera Module 3 (for camera examples)
  • ~500MB free disk space

Check your OS and Python version before starting:

uname -m          # Should output: aarch64
python3 --version # Should be 3.11 or 3.12

Solution

Step 1: Install TensorFlow Lite Runtime

The full TensorFlow package is 600MB+ and unnecessary for inference. Install only the lightweight runtime:

# Install system dependencies first
sudo apt update && sudo apt install -y \
  libatlas-base-dev \
  libjpeg-dev \
  libopenjp2-7

# Install tflite-runtime — much smaller than full TensorFlow (~15MB)
pip3 install tflite-runtime --break-system-packages

Verify the install:

python3 -c "import tflite_runtime.interpreter as tflite; print('TFLite OK')"

Expected output: TFLite OK

If it fails:

  • ERROR: Could not find a version that satisfies the requirement → Add --extra-index-url https://google-coral.github.io/py-repo/ to the pip command
  • ImportError: libatlas → Run sudo apt install libatlas-base-dev and retry

Step 2: Download a Pre-Quantized Model

Start with MobileNetV2 quantized for INT8 — it's 3.5MB, runs fast, and classifies 1,000 ImageNet categories.

mkdir -p ~/tinyml && cd ~/tinyml

# Download INT8 quantized MobileNetV2
wget https://storage.googleapis.com/download.tensorflow.org/models/tflite/mobilenet_v2_1.0_224_quant.tflite

# Download matching labels
wget https://storage.googleapis.com/download.tensorflow.org/models/tflite/labels_mobilenet_quant_v1_224.txt

INT8 quantization reduces the model from fp32 (~14MB) to ~3.5MB with less than 1% accuracy loss on ImageNet. On Raspberry Pi 5, INT8 runs 3–4x faster than fp32 because the CPU's NEON SIMD unit accelerates 8-bit operations natively.


Step 3: Run Image Classification

Create the inference script:

# classify.py
import numpy as np
import time
from PIL import Image
import tflite_runtime.interpreter as tflite

MODEL_PATH = "mobilenet_v2_1.0_224_quant.tflite"
LABELS_PATH = "labels_mobilenet_quant_v1_224.txt"
IMAGE_PATH = "test.jpg"  # replace with your image

# Load labels
with open(LABELS_PATH) as f:
    labels = [line.strip() for line in f.readlines()]

# Load model — use num_threads=4 to utilize all Pi cores
interpreter = tflite.Interpreter(model_path=MODEL_PATH, num_threads=4)
interpreter.allocate_tensors()

input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

# Preprocess: resize to 224x224, convert to uint8
img = Image.open(IMAGE_PATH).resize((224, 224))
input_data = np.expand_dims(np.array(img, dtype=np.uint8), axis=0)

# Run inference and measure latency
interpreter.set_tensor(input_details[0]['index'], input_data)
start = time.perf_counter()
interpreter.invoke()
elapsed_ms = (time.perf_counter() - start) * 1000

output = interpreter.get_tensor(output_details[0]['index'])[0]

# INT8 output is dequantized: scale and zero_point from output_details
scale, zero_point = output_details[0]['quantization']
scores = (output.astype(np.float32) - zero_point) * scale

top5 = np.argsort(scores)[::-1][:5]
print(f"Inference: {elapsed_ms:.1f}ms")
for i in top5:
    print(f"  {scores[i]:.3f}  {labels[i]}")

Run it with a test image:

# Download a test image
wget -O test.jpg https://upload.wikimedia.org/wikipedia/commons/thumb/4/43/Cute_dog.jpg/320px-Cute_dog.jpg

python3 classify.py

Expected output:

Inference: 38.2ms
  0.812  golden retriever
  0.071  Labrador retriever
  0.034  kuvasz
  0.021  Great Pyrenees
  0.009  Sussex spaniel

38ms inference = ~26 FPS theoretical throughput on Pi 5 with num_threads=4.


Step 4: Run Live Camera Inference

For real-time inference from a USB camera or Pi Camera Module 3:

# camera_classify.py
import cv2
import numpy as np
import time
import tflite_runtime.interpreter as tflite

MODEL_PATH = "mobilenet_v2_1.0_224_quant.tflite"
LABELS_PATH = "labels_mobilenet_quant_v1_224.txt"

with open(LABELS_PATH) as f:
    labels = [line.strip() for line in f.readlines()]

interpreter = tflite.Interpreter(model_path=MODEL_PATH, num_threads=4)
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
scale, zero_point = output_details[0]['quantization']

cap = cv2.VideoCapture(0)  # 0 = first USB camera; use /dev/video0 explicitly if needed
cap.set(cv2.CAP_PROP_FRAME_WIDTH, 640)
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 480)

while True:
    ret, frame = cap.read()
    if not ret:
        break

    # Resize and convert BGR→RGB for the model
    rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    resized = cv2.resize(rgb, (224, 224))
    input_data = np.expand_dims(resized.astype(np.uint8), axis=0)

    start = time.perf_counter()
    interpreter.set_tensor(input_details[0]['index'], input_data)
    interpreter.invoke()
    fps = 1.0 / (time.perf_counter() - start)

    output = interpreter.get_tensor(output_details[0]['index'])[0]
    scores = (output.astype(np.float32) - zero_point) * scale
    top_idx = np.argmax(scores)

    label_text = f"{labels[top_idx]} ({scores[top_idx]:.2f}) | {fps:.1f} FPS"
    cv2.putText(frame, label_text, (10, 30), cv2.FONT_HERSHEY_SIMPLEX, 0.8, (0, 255, 0), 2)
    cv2.imshow("TinyML Edge Inference", frame)

    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

cap.release()
cv2.destroyAllWindows()
pip3 install opencv-python-headless --break-system-packages
python3 camera_classify.py

On Pi 5, expect 20–25 FPS at 224×224 model input. The camera capture itself is the bottleneck at higher resolutions.


Step 5: Convert Your Own Model to TFLite

If you trained a custom model in PyTorch or Keras, convert it to .tflite with INT8 quantization. Run this on your development machine (not the Pi):

# convert_to_tflite.py — run on your dev machine, not the Pi
import tensorflow as tf
import numpy as np

# --- Option A: Convert a Keras model ---
# model = tf.keras.models.load_model("my_model.h5")

# --- Option B: Convert from SavedModel ---
model = tf.saved_model.load("my_saved_model/")

converter = tf.lite.TFLiteConverter.from_saved_model("my_saved_model/")

# Enable INT8 quantization — requires representative dataset
converter.optimizations = [tf.lite.Optimize.DEFAULT]

def representative_dataset():
    # Feed ~100 samples representative of your real input distribution
    for _ in range(100):
        sample = np.random.rand(1, 224, 224, 3).astype(np.float32)
        yield [sample]

converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]

# Force input/output to uint8 so no fp32 conversion happens at runtime
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8

tflite_model = converter.convert()

with open("my_model_quant.tflite", "wb") as f:
    f.write(tflite_model)

print(f"Model size: {len(tflite_model) / 1024:.1f} KB")

Copy the resulting .tflite file to the Pi and use it in the inference scripts above.

If conversion fails with Some ops are not supported: Add tf.lite.OpsSet.SELECT_TF_OPS to supported_ops. This increases model size but ensures compatibility.


Verification

Run the benchmark script to measure your Pi's actual throughput:

# Official TFLite benchmark tool — gives per-layer timing too
wget https://storage.googleapis.com/download.tensorflow.org/models/tflite/linux/aarch64/benchmark_model

chmod +x benchmark_model
./benchmark_model \
  --graph=mobilenet_v2_1.0_224_quant.tflite \
  --num_threads=4 \
  --num_runs=50

You should see on Pi 5:

INFO: Inference timings in us: Init: 12345, First inference: 38210, Warmup: 36500, Inference: 37800
INFO: Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE.
INFO: Memory footprint delta from the start of the tool (RSS: 28.00 MB, VmPeak: 52.00 MB)

Target numbers by device:

DeviceMobileNetV2 INT8Notes
Pi 5 (4 threads)~38ms / 26 FPSBest option for new builds
Pi 4 (4 threads)~65ms / 15 FPSSufficient for most use cases
Pi Zero 2W~320ms / 3 FPSSingle-shot detection only

What You Learned

  • TFLite runtime is the right tool for Pi inference — full TensorFlow is not needed
  • INT8 quantization cuts model size 4x and speeds inference 3x with minimal accuracy cost
  • num_threads=4 is the key flag for Pi 4/5 — the default of 1 thread leaves 75% of CPU unused
  • The bottleneck on live video is usually camera capture resolution, not model inference

When NOT to use this approach: If your model exceeds ~10MB INT8, or if you need transformer-based NLP (BERT, LLaMA), TFLite on Pi CPU becomes impractical. For those cases, look at llama.cpp with GGUF quantization or run inference on a more powerful device and stream results to the Pi.

Next step: For object detection instead of classification, swap in ssd_mobilenet_v2_coco_quant.tflite — same pipeline, different model, outputs bounding boxes instead of class scores.

Tested on Raspberry Pi 5 (8GB), Raspberry Pi OS Bookworm 64-bit, TFLite Runtime 2.16, Python 3.11