Android 16 AI: Integrate On-Device Gemini Nano in Kotlin

Add on-device AI to your Android 16 app using Gemini Nano and the ML Kit Inference API — no internet, no API key required.

Problem: Your Android App Needs AI Without the Cloud

You want to add text summarization, smart replies, or content classification to your Android app — but cloud API calls mean latency, cost, and privacy concerns.

Android 16 ships with Gemini Nano baked in. Here's how to use it directly from Kotlin.

You'll learn:

  • How to check if Gemini Nano is available on a device
  • How to run inference with the Android AI Core API
  • How to stream responses for a responsive UX

Time: 25 min | Level: Intermediate


Why This Happens

Google introduced the AICore system service in Android 14 QPR1 and expanded it significantly for Android 16. The model runs on-device via a system-level singleton — your app doesn't bundle the model weights, it requests access through a structured API.

This means:

  • Gemini Nano must be downloaded on the device (happens automatically on supported hardware)
  • Pixel 9+ and devices with 8GB+ RAM are primary targets
  • The API is part of Google Play Services, not AOSP

Common blockers:

  • Device doesn't meet hardware requirements (API returns FEATURE_NOT_SUPPORTED)
  • Model not yet downloaded (API returns DOWNLOADING)
  • Requesting too large a context window for on-device limits

Android AI Core architecture diagram showing app → AICore service → Gemini Nano model Your app talks to AICore as a system service — the model never lives in your APK


Solution

Step 1: Add Dependencies

// build.gradle.kts (app)
dependencies {
    // Android AI Core - on-device inference
    implementation("com.google.android.gms:play-services-mlkit-language-id:17.0.0")
    implementation("com.google.ai.edge.aicore:aicore:0.0.1-exp01")
}

Also add the required manifest declaration:

<!-- AndroidManifest.xml -->
<uses-feature
    android:name="android.software.ai_capabilities"
    android:required="false" />

Setting required="false" lets your app install on all devices — you'll handle availability gracefully in code.

Expected: Gradle sync completes, no dependency conflicts.

If it fails:

  • Unresolved reference aicore: Confirm you're on AGP 8.5+ and targeting compileSdk = 36
  • Manifest merge conflict: Remove duplicate <uses-feature> from a library

Step 2: Check Availability Before Use

Never assume Gemini Nano is ready. Always check first.

import com.google.ai.edge.aicore.GenerativeModel
import com.google.ai.edge.aicore.DownloadCallback
import com.google.ai.edge.aicore.generationConfig

class AiAvailabilityChecker(private val context: Context) {

    suspend fun checkAndPrepare(): Result<GenerativeModel> {
        val config = generationConfig {
            context = this@AiAvailabilityChecker.context
        }

        return try {
            val model = GenerativeModel(
                generationConfig = config
            )

            // prepareInferenceSession triggers download if needed
            when (val state = model.checkAvailability()) {
                AvailabilityStatus.AVAILABLE -> Result.success(model)
                AvailabilityStatus.DOWNLOADING -> {
                    // Wait for download to complete
                    awaitDownload(model)
                }
                else -> Result.failure(
                    UnsupportedOperationException("Gemini Nano not supported: $state")
                )
            }
        } catch (e: Exception) {
            Result.failure(e)
        }
    }

    private suspend fun awaitDownload(model: GenerativeModel): Result<GenerativeModel> {
        return suspendCoroutine { cont ->
            model.downloadModel(object : DownloadCallback {
                override fun onDownloadCompleted() {
                    cont.resume(Result.success(model))
                }
                override fun onDownloadFailed(e: Exception) {
                    cont.resume(Result.failure(e))
                }
            })
        }
    }
}

Why checkAvailability() first: Calling generateContent() on an unavailable model throws a checked exception. Explicit state checking makes your error handling cleaner and your UX more informative.


Step 3: Run Inference with Streaming

Streaming matters here — on-device models can be slower than cloud APIs on mid-range hardware.

import com.google.ai.edge.aicore.generationConfig
import kotlinx.coroutines.flow.Flow

class OnDeviceAiService(private val context: Context) {

    private val model by lazy {
        GenerativeModel(
            generationConfig = generationConfig {
                context = this@OnDeviceAiService.context
                // Keep temperature low for factual tasks
                temperature = 0.2f
                // On-device limit: typically 512-1024 tokens output
                maxOutputTokens = 512
            }
        )
    }

    // Streaming: emit tokens as they're generated
    fun summarize(text: String): Flow<String> {
        val prompt = buildPrompt(text)
        return model.generateContentStream(prompt)
            .map { chunk -> chunk.text ?: "" }
    }

    // Non-streaming: wait for full response
    suspend fun classify(text: String): String {
        val prompt = "Classify this text into one category " +
            "(positive/negative/neutral): $text"
        val response = model.generateContent(prompt)
        return response.text?.trim() ?: "unknown"
    }

    private fun buildPrompt(input: String): String =
        """
        Summarize the following in 2-3 sentences. Be concise.
        
        Text: $input
        
        Summary:
        """.trimIndent()
}

Expected: summarize() returns a Flow<String> that emits tokens progressively.

If it fails:

  • maxOutputTokens exception: On-device models cap at 512 on most Pixel 9 devices — reduce if you see TOKEN_LIMIT_EXCEEDED
  • Empty chunk.text: Some chunks carry metadata only; the ?: "" fallback handles this

Step 4: Wire Up to Your UI (Compose)

@Composable
fun SummarizerScreen(
    aiService: OnDeviceAiService = remember { OnDeviceAiService(LocalContext.current) }
) {
    var inputText by remember { mutableStateOf("") }
    var summary by remember { mutableStateOf("") }
    var isLoading by remember { mutableStateOf(false) }
    val scope = rememberCoroutineScope()

    Column(modifier = Modifier.padding(16.dp)) {
        OutlinedTextField(
            value = inputText,
            onValueChange = { inputText = it },
            label = { Text("Paste text to summarize") },
            modifier = Modifier.fillMaxWidth()
        )

        Spacer(modifier = Modifier.height(8.dp))

        Button(
            onClick = {
                isLoading = true
                summary = ""
                scope.launch {
                    aiService.summarize(inputText)
                        .collect { token ->
                            summary += token  // Stream tokens into UI
                        }
                    isLoading = false
                }
            },
            enabled = inputText.isNotBlank() && !isLoading
        ) {
            Text(if (isLoading) "Summarizing..." else "Summarize")
        }

        Spacer(modifier = Modifier.height(16.dp))

        if (summary.isNotBlank()) {
            Text(
                text = summary,
                style = MaterialTheme.typography.bodyMedium
            )
        }
    }
}

Why stream into UI: First token appears in ~300ms on Pixel 9. Without streaming, the user sees a blank screen for 3-8 seconds. The += pattern gives the typing effect for free.

Compose UI with text input and streaming summary output Tokens stream in progressively — no spinner needed


Verification

# Run on a physical device (emulator doesn't support AICore)
adb shell cmd aicore status

You should see:

AICore status: AVAILABLE
Gemini Nano: DOWNLOADED (v1.x.x)

Then run your app and paste a paragraph of text. Summary should begin appearing within 1 second.

Successful inference output in Android Studio Logcat Logcat showing inference completing with token count and latency


What You Learned

  • Gemini Nano lives in the system, not your APK — always check availability before calling inference
  • Streaming (generateContentStream) is essential for on-device AI UX; latency is real
  • Token limits are hardware-dependent — design prompts to stay under 512 output tokens

Limitations to know:

  • No on-device fine-tuning — you work with the base model only
  • Context window is smaller than cloud Gemini; chunk long documents before sending
  • Emulators don't support AICore — test on physical hardware only

When NOT to use this:

  • Tasks requiring >1000 output tokens (summarizing entire books, long-form generation) — fall back to cloud
  • Devices running Android <14 QPR1 — add a cloud fallback path

Tested on Android 16 DP2, Pixel 9 Pro, Gemini Nano 1.0, Kotlin 2.0, Compose BOM 2025.04