Your 200MB TensorFlow model can't run on a phone. After TFLite INT8 quantization, it's 12MB and runs in 18ms on a Pixel 8. That's the promise of edge deployment: taking your bloated, GPU-hungry creation and forcing it onto a device with the computational power of a toaster. The reality is a gauntlet of conversion errors, mysterious accuracy drops, and deployment scripts that fail silently. This guide is for when you've got a working model.save() and the naive TFLiteConverter path has left you with a file that's either too slow, too big, or just plain broken. We're going to shove that model onto Android and Raspberry Pi, and we're going to make it fast.
From SavedModel to .tflite: The Converter Gauntlet
You don't deploy a Keras model. You deploy a SavedModel. That's your artifact. The jump from model.save('my_model') to a .tflite file is where most attempts die. The basic converter is a one-liner that lies to you about its simplicity.
import tensorflow as tf
model = tf.keras.models.load_model('path/to/your/saved_model')
# The naive converter. This will probably fail or give you a bad model.
converter = tf.lite.TFLiteConverter.from_saved_model('path/to/your/saved_model')
# This is where you start adding options that actually matter.
converter.optimizations = [tf.lite.Optimize.DEFAULT] # We'll get to quantization
converter.target_spec.supported_ops = [
tf.lite.OpsSet.TFLITE_BUILTINS, # The core TFLite ops
tf.lite.OpsSet.SELECT_TF_OPS, # CRITICAL: For ops TFLite doesn't support natively
]
tflite_model = converter.convert()
# Save it.
with open('model.tflite', 'wb') as f:
f.write(tflite_model)
The first trap is supported_ops. If your model uses anything remotely exotic (e.g., tf.unique, certain types of tf.gather), leaving out SELECT_TF_OPS will cause a conversion error. The trade-off? A slightly larger binary and potential speed hit, but it's often the only way forward.
Real Error & Fix: ValueError: Input 0 of layer is incompatible with the layer... This often surfaces during conversion, not training. The fix is to check your input_shape in the first layer of your original Keras model matches the actual data shape you intend for inference, including the batch dimension. If your model expects (None, 224, 224, 3) (dynamic batch), ensure your representative dataset for quantization (next section) supplies data with shape (1, 224, 224, 3).
Dynamic vs. Full INT8: Choosing Your Poison
Quantization isn't one thing. You have two main paths, and picking the wrong one turns your accurate model into a random number generator.
- Dynamic Range Quantization (The Quick and Dirty): Converts weights to INT8 but keeps activations in float. It's done post-training, requires no data, and is as simple as
converter.optimizations = [tf.lite.Optimize.DEFAULT]. You get about 3x smaller model size immediately. The speedup is modest (maybe 1.5x), and accuracy loss is usually minimal. Use this when you need a quick win, are prototyping, or your model is stubbornly accurate-sensitive. - Full Integer Quantization (INT8) (The Performance Play): This is what gives you the 2x faster inference vs FP32. It converts both weights and activations to INT8. This requires a representative dataset to calibrate the activation ranges. Without it, you'll get a model that only speaks integer and will fail at runtime. Use this when latency and binary size are non-negotiable, and you have a calibration dataset you can trust.
The rule of thumb: Start with Dynamic. If it's not fast enough, endure the pain of Full INT8.
The Representative Dataset: Calibration Isn't Training
This is the most misunderstood step. A representative dataset isn't your validation set. It's a small, unlabeled sample (100-500 examples) of typical input data used solely to observe the range of activation values for each layer. No gradients are computed. No weights are updated. The converter uses these ranges to map float values to the 256 integer levels of INT8.
def representative_dataset():
# Use your tf.data pipeline, but yield samples one by one.
# This is a MNIST example. Replace with your data source.
mnist_train, _ = tf.keras.datasets.mnist.load_data()
images = mnist_train[0].astype(np.float32) / 255.0
images = np.expand_dims(images, axis=-1) # Add channel dimension
for i in range(100): # 100 calibration samples
# Yield a single sample with batch dimension of 1.
yield [images[i:i+1]]
converter = tf.lite.TFLiteConverter.from_saved_model('path/to/saved_model')
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset
# This line triggers full INT8 quantization.
converter.target_spec.supported_types = [tf.int8]
# Ensure you have a float input/output fallback if your I/O must be float.
converter.inference_input_type = tf.uint8 # Or tf.float32
converter.inference_output_type = tf.uint8 # Or tf.float32
int8_tflite_model = converter.convert()
Get this wrong—by using atypical data or too few samples—and your quantized model's accuracy will nosedive because the integer ranges are miscalibrated.
Shoving It Into Android: Kotlin and the Interpreter
On Android, you're not running TensorFlow. You're running the TensorFlow Lite interpreter, a lean C++ library wrapped in a Java/Kotlin API. The model is just a static asset.
- Add the dependency to your
app/build.gradle.kts:dependencies { implementation("org.tensorflow:tensorflow-lite:2.14.0") // Optional: GPU delegate for acceleration implementation("org.tensorflow:tensorflow-lite-gpu:2.14.0") } - Drop the
.tflitefile intoapp/src/main/assets/. - Load and run it in your Kotlin code:
import org.tensorflow.lite.Interpreter
import java.nio.ByteBuffer
import java.nio.ByteOrder
class TFLiteModelRunner(context: Context) {
private lateinit var interpreter: Interpreter
init {
// Load model from assets
val modelFile = context.assets.openFd("model.tflite")
val modelBuffer = modelFile.mapReadOnly()
interpreter = Interpreter(modelBuffer)
// Optional: Configure with options
val options = Interpreter.Options().apply {
// numberThreads = 4
// Use this for GPU acceleration (check compatibility)
// addDelegate(GpuDelegate())
}
interpreter = Interpreter(modelBuffer, options)
}
fun runInference(inputData: FloatArray): FloatArray {
// 1. Prepare input ByteBuffer (CRITICAL: match model I/O types)
val inputBuffer = ByteBuffer.allocateDirect(4 * inputData.size)
.order(ByteOrder.nativeOrder())
.asFloatBuffer()
inputBuffer.put(inputData)
// 2. Prepare output buffer
val outputShape = interpreter.getOutputTensor(0).shape()
val outputData = FloatArray(outputShape[1]) // Example for [1, N] output
// 3. Run
interpreter.run(inputBuffer, outputData)
return outputData
}
}
The devil is in the buffers. If you quantized to INT8, your ByteBuffer must be ByteBuffer.allocateDirect(inputData.size) and you'll load byte values, not float. Mismatch here causes silent, garbage outputs.
Raspberry Pi: The ARM Battle
On Raspberry Pi (Linux ARM), you bypass the full TensorFlow install and use the tflite_runtime pip package—a 30MB install vs 500MB.
# For Python on Raspberry Pi OS (ARM 32-bit or 64-bit)
pip install tflite-runtime
Then, your Python script is interpreter-centric:
import tflite_runtime.interpreter as tflite
import numpy as np
# Load the model
interpreter = tflite.Interpreter(model_path='model.tflite')
interpreter.allocate_tensors()
# Get input/output details
input_details = interpreter.get_input_details()[0]
output_details = interpreter.get_output_details()[0]
# Prepare input. Check if input is quantized.
if input_details['dtype'] == np.uint8:
# Quantized model
input_scale, input_zero_point = input_details['quantization']
# You must quantize your float input to uint8 using scale/zero_point
input_data = np.array(your_float_data / input_scale + input_zero_point, dtype=np.uint8)
else:
# Float model
input_data = np.array(your_float_data, dtype=np.float32)
# Run inference
interpreter.set_tensor(input_details['index'], input_data)
interpreter.invoke()
output_data = interpreter.get_tensor(output_details['index'])
# If output is quantized, dequantize it.
if output_details['dtype'] == np.uint8:
output_scale, output_zero_point = output_details['quantization']
output_data = output_scale * (output_data.astype(np.float32) - output_zero_point)
Real Error & Fix: SavedModel load error: No such attribute 'call'. This happens when you try to load a Keras-formatted .keras file as a SavedModel in a serving environment. The fix is explicit saving: Use model.export('path/') for a pure SavedModel (best for TFLite conversion/TF Serving), or use model.save('path.keras') for the Keras format. Know which one you have.
The Hard Numbers: What You Gain, What You Lose
Let's be concrete. Here's what you can expect when quantizing a medium-sized vision model (like a MobileNetV2) for a Pixel 8 and a Raspberry Pi 4.
| Metric | Original FP32 (SavedModel) | Dynamic Quantization (INT8 Weights) | Full INT8 Quantization |
|---|---|---|---|
| Model Size | 14 MB | 4.7 MB (3x smaller) | 4.7 MB (3x smaller) |
| Inference Latency (Pixel 8) | 42 ms | 28 ms (1.5x faster) | 18 ms (2.3x faster) |
| Inference Latency (RPi 4) | 210 ms | 155 ms (1.4x faster) | 95 ms (2.2x faster) |
| Accuracy (Top-1) | 71.8% | 71.5% (-0.3 pp) | 70.1% (-1.7 pp) |
Table: Benchmark of a MobileNetV2-style model on ImageNet validation subset. Your mileage will vary.
The trade-off is clear. Full INT8 buys you speed at the cost of accuracy. The 1.7 percentage point drop is typical and often acceptable for edge applications. If it's not, you need to debug.
Debugging the Accuracy Cliff Dive
Your full INT8 model is fast but useless. Don't panic. Use the TFLite Model Analyzer.
# First, get the flatbuffer tool (or use Colab)
flatc --version # You need flatbuffers installed
# A simpler, more direct method: Use the Python interpreter
import tensorflow as tf
import numpy as np
interpreter = tf.lite.Interpreter(model_path='your_model.tflite')
interpreter.allocate_tensors()
for detail in interpreter.get_tensor_details():
print(f"Name: {detail['name']:30} | Shape: {detail['shape']} | Dtype: {detail['dtype']} | Quant: {detail['quantization']}")
Look for layers that stayed in float (dtype: float32). These are ops that TFLite cannot quantize (e.g., some tf.unique, certain custom ops). They become delegation boundaries, forcing costly data conversion between int and float, destroying your latency gains and potentially hurting accuracy. The solution is often to replace that op, use a different model architecture, or accept a hybrid model.
Profile on-device. On Android, use Android Studio's System Trace. On Raspberry Pi, use simple wall-clock timing around interpreter.invoke(). The bottleneck might not be the model but your data preprocessing on the CPU.
Next Steps: Beyond the Basic .tflite File
You have a working, quantized model on device. What now?
- Benchmark for Real: Use the TensorFlow Lite Benchmark Tool for Android. On Pi, write a tight loop and measure. Compare against the 15,000 req/s on 4-core CPU benchmark for TF Serving to set realistic expectations for your edge hardware.
- Explore Delegates: The GPU delegate for Android can speed up compatible ops further. The XNNPACK delegate (enabled by default in recent versions) optimizes float models. For Raspberry Pi, the Coral USB Accelerator (Edge TPU) offers delegate support for insane speedups on fully compatible models.
- Consider the Pipeline: If your edge app is part of a larger system, look at TFX—which powers 50%+ of Google's production ML models—to automate the retraining, validation, and conversion pipeline that feeds your edge deployments.
- Embrace the Ecosystem: Remember, TensorFlow Lite is deployed on 6B+ devices. You're not building a one-off hack; you're using a core platform. Integrate with ML Kit for turn-key Android solutions, or explore TensorFlow.js for browser-based edge AI if your target expands.
The edge isn't the consolation prize for models that aren't big enough for the cloud. It's the constraint that forces efficiency, the requirement that breeds ingenuity. Stop trying to shrink the cloud. Build for the edge from the start.