Run and Fine-Tune LLMs on Mac with MLX-LM 2026

Run and fine-tune LLMs on Apple Silicon using MLX-LM. Covers install, quantization, LoRA fine-tuning, and serving — tested on M2/M3 with 16GB RAM.

MLX-LM lets you run and fine-tune LLMs directly on Apple Silicon — using the full unified memory pool, no discrete GPU required. On a 16GB M2 MacBook Pro, you can run Mistral 7B at ~40 tokens/sec and fine-tune a LoRA adapter on a custom dataset in under 30 minutes, all for $0 in cloud spend.

This guide covers everything: install, model download, quantization, inference, LoRA fine-tuning, and local serving with an OpenAI-compatible endpoint.

You'll learn:

  • How to install MLX-LM with uv and download models from Hugging Face
  • How to run quantized 4-bit inference on 8GB and 16GB Macs
  • How to fine-tune a LoRA adapter on your own dataset and merge it back
  • How to serve any MLX model with an OpenAI-compatible HTTP server

Time: 25 min | Difficulty: Intermediate


Why MLX-LM Outperforms Other Mac Inference Stacks

Most inference tools (llama.cpp, Ollama) were built for NVIDIA GPUs and ported to Metal. MLX was designed by Apple's ML Research team from scratch for the unified memory architecture of Apple Silicon. The result: zero memory copies between CPU and GPU, and a compute graph that maps directly to the Neural Engine + GPU together.

For fine-tuning especially, this matters. PyTorch on Mac requires moving tensors across memory buses. MLX keeps everything in one flat address space, which cuts fine-tune overhead by 30–40% on the same M-series chip.

MLX-LM unified memory inference and LoRA fine-tune pipeline on Apple Silicon MLX-LM pipeline: model weights load into unified memory once, shared by CPU, GPU, and Neural Engine with zero copy overhead

Supported Apple Silicon: M1, M2, M3, M4 — all variants (Pro, Max, Ultra)
Minimum RAM: 8GB (quantized 7B models) · 16GB (13B quantized, or 7B full precision) · 32GB+ (34B+ quantized)
macOS: Ventura 13.5+ required · Sonoma 14.x and Sequoia 15.x recommended


Install MLX-LM

The MLX team ships mlx-lm on PyPI. Use uv for fast, isolated installs — it avoids the pip dependency conflicts that often break MLX on shared environments.

Step 1: Install uv and create a virtual environment

# Install uv if you don't have it
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create a project environment — Python 3.11+ required
uv venv mlx-env --python 3.12
source mlx-env/bin/activate

Expected output: Using Python 3.12.x and a prompt prefix (mlx-env).

Step 2: Install mlx-lm

# mlx-lm pulls mlx-core as a dependency automatically
uv pip install mlx-lm

Expected output: Successfully installed mlx-0.x.x mlx-lm-0.x.x

If it fails:

  • ERROR: No matching distribution for mlx → You're on Intel Mac. MLX only runs on Apple Silicon (M1+). Intel Macs are not supported.
  • ImportError: dlopen ... no suitable image → macOS version too old. Upgrade to Ventura 13.5+.

Step 3: Verify install

python -c "import mlx_lm; print('mlx-lm ready')"

You should see: mlx-lm ready


Download and Run a Model

MLX-LM reads models in the MLX format, which the community maintains on Hugging Face under the mlx-community org. These are pre-converted and pre-quantized — no manual conversion needed for common models.

Step 4: Run your first model

# --model pulls from mlx-community HuggingFace org automatically
# 4-bit quantized Mistral 7B — fits in 8GB RAM
mlx_lm.generate \
  --model mlx-community/Mistral-7B-Instruct-v0.3-4bit \
  --prompt "Explain LoRA fine-tuning in two sentences." \
  --max-tokens 200

Expected output: Model downloads to ~/.cache/huggingface/hub/, then streams the response. First run takes 1–2 min for download. Subsequent runs start in under 3 seconds.

If it fails:

  • huggingface_hub.errors.GatedRepoError → Model is gated. Run huggingface-cli login and accept the license on the model page.
  • mlx.core.metal.is_available() == False → Running inside Rosetta or a non-native Python. Install a native arm64 Python: uv venv --python 3.12 pulls the arm64 build automatically.

Step 5: Try a chat model with system prompt

mlx_lm.generate \
  --model mlx-community/Llama-3.2-3B-Instruct-4bit \
  --prompt "What is the capital of France?" \
  --system "You are a concise assistant. Answer in one sentence." \
  --max-tokens 60 \
  --temp 0.0   # temp=0 for deterministic output — good for testing

Recommended models by RAM:

RAMModelQuantizationSpeed (tok/s)
8GBLlama 3.2 3B Instruct4-bit~90
8GBMistral 7B Instruct v0.34-bit~40
16GBLlama 3.1 8B Instruct4-bit~45
16GBMistral 7B Instruct v0.38-bit~32
32GBLlama 3.3 70B Instruct4-bit~12
64GBLlama 3.3 70B Instruct8-bit~9

Convert and Quantize Your Own Model

If you want to run a model not already in the mlx-community org, convert it yourself in two commands.

Step 6: Convert a HuggingFace model to MLX format

# Convert Qwen2.5-7B-Instruct to MLX 4-bit
# --quantize applies post-training quantization during conversion
mlx_lm.convert \
  --hf-path Qwen/Qwen2.5-7B-Instruct \
  --mlx-path ./qwen2.5-7b-mlx-4bit \
  --quantize \
  --q-bits 4        # 4-bit = best RAM/speed tradeoff for 7B models

Expected output: Progress bar downloading weights, then Saved weights to ./qwen2.5-7b-mlx-4bit. Conversion takes 5–10 min depending on model size and internet speed.

Quantization options:

--q-bitsRAM for 7BQualityUse case
4~4.5 GBGoodDaily inference on 8GB Mac
6~6 GBBetter16GB Mac, quality-sensitive tasks
8~8 GBNear lossless16GB Mac, code and math

Step 7: Run the converted model

mlx_lm.generate \
  --model ./qwen2.5-7b-mlx-4bit \
  --prompt "Write a Python function to reverse a linked list." \
  --max-tokens 400

LoRA Fine-Tuning on Your Own Dataset

This is where MLX-LM shines over Ollama and llama.cpp — it supports full LoRA fine-tuning on-device, no cloud required. A 7B model fine-tuned on 500 examples takes about 20–25 minutes on an M2 Pro.

Step 8: Prepare your dataset

MLX-LM expects JSONL format with prompt and completion keys — or chat format with a messages key.

mkdir -p ./finetune-data

Create ./finetune-data/train.jsonl:

{"prompt": "What is gradient descent?", "completion": "Gradient descent is an optimization algorithm that iteratively adjusts model parameters in the direction that reduces the loss function."}
{"prompt": "Explain attention in transformers.", "completion": "Attention allows each token to weigh the relevance of every other token in the sequence when computing its output representation."}

Create ./finetune-data/valid.jsonl with 10–20% of your data for validation loss tracking.

Minimum dataset size: 50 examples trains but overfits. 200–500 examples is the practical minimum for behavior change. 1,000–5,000 examples for meaningful style or domain shift.

Step 9: Run LoRA fine-tuning

mlx_lm.lora \
  --train \
  --model mlx-community/Mistral-7B-Instruct-v0.3-4bit \
  --data ./finetune-data \
  --batch-size 4 \       # 4 fits in 16GB; drop to 2 for 8GB
  --lora-layers 16 \     # number of transformer layers to apply LoRA — more = slower but better
  --iters 600 \          # ~600 iters on 200 examples = ~3 epochs
  --learning-rate 1e-4 \
  --adapter-path ./mistral-lora-adapter

Expected output: Training loss printed every 10 iterations. Valid loss every 100. Final adapter saved to ./mistral-lora-adapter/adapters.safetensors.

Watch for:

  • Loss not decreasing after 100 iters → Learning rate too high. Try --learning-rate 5e-5.
  • mlx.core.metal.MetalAllocationError → Batch size too large. Drop --batch-size to 2 or 1.
  • Valid loss rising while train loss drops → Overfitting. Reduce --iters by 30%.

MLX-LM LoRA fine-tuning loss curve and adapter merge workflow LoRA training loop: frozen base weights + trainable rank decomposition matrices → adapter saved separately → merged on demand

Step 10: Test the adapter before merging

# Load base model + adapter without merging — faster for iteration
mlx_lm.generate \
  --model mlx-community/Mistral-7B-Instruct-v0.3-4bit \
  --adapter-path ./mistral-lora-adapter \
  --prompt "What is gradient descent?" \
  --max-tokens 150

Step 11: Merge adapter into base model

Once satisfied with the adapter, fuse it into the weights for single-file distribution and faster load times.

mlx_lm.fuse \
  --model mlx-community/Mistral-7B-Instruct-v0.3-4bit \
  --adapter-path ./mistral-lora-adapter \
  --save-path ./mistral-finetuned-merged \
  --de-quantize   # optional: merge into full-precision weights before re-quantizing

Expected output: Saved fused model to ./mistral-finetuned-merged


Serve MLX Models with an OpenAI-Compatible API

MLX-LM ships a built-in HTTP server that mirrors the OpenAI /v1/chat/completions endpoint. Any tool that talks to OpenAI — LangChain, Continue.dev, Cursor, Open WebUI — works against it with a one-line config change.

Step 12: Start the server

# Serves on http://localhost:8080 by default
mlx_lm.server \
  --model mlx-community/Llama-3.1-8B-Instruct-4bit \
  --port 8080 \
  --host 127.0.0.1   # keep local-only; don't expose to LAN without auth

Expected output:

INFO:     Started server process
INFO:     Uvicorn running on http://127.0.0.1:8080

Step 13: Test the endpoint

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Llama-3.1-8B-Instruct-4bit",
    "messages": [{"role": "user", "content": "What year did the Berlin Wall fall?"}],
    "max_tokens": 60,
    "temperature": 0.0
  }'

Expected output: JSON response with choices[0].message.content containing the answer.

Step 14: Point LangChain at the local server

from langchain_openai import ChatOpenAI

# base_url overrides the OpenAI endpoint — api_key is required but ignored by mlx_lm.server
llm = ChatOpenAI(
    model="mlx-community/Llama-3.1-8B-Instruct-4bit",
    base_url="http://localhost:8080/v1",
    api_key="not-needed",   # mlx_lm.server doesn't validate keys
    temperature=0.0,
)

response = llm.invoke("What is LoRA fine-tuning?")
print(response.content)

MLX-LM vs Ollama on Apple Silicon

Both run locally on Mac. The right choice depends on your use case.

MLX-LMOllama
BackendApple MLX (native)llama.cpp (Metal port)
Fine-tuning (LoRA)✅ Built-in❌ Not supported
Model formatMLX / SafeTensorsGGUF
Model hubmlx-community HFOllama registry
Inference speed (7B 4-bit, M2 Pro)~40 tok/s~35 tok/s
OpenAI-compatible servermlx_lm.serverollama serve
GUI / easy onboarding❌ CLI only✅ Desktop app
Python API✅ First-class⚠️ REST only
Custom quantization--q-bits 2–8⚠️ GGUF presets only
Pricing (self-hosted)FreeFree

Choose MLX-LM if: You need LoRA fine-tuning, Python-native integration, or custom quantization control.
Choose Ollama if: You want a one-click install with a GUI and don't need fine-tuning.


Verification

# Confirm end-to-end: convert → infer → serve
mlx_lm.generate \
  --model mlx-community/Llama-3.2-3B-Instruct-4bit \
  --prompt "Summarize LoRA in one sentence." \
  --max-tokens 80

You should see: A coherent one-sentence summary streamed to your terminal within 2–3 seconds of model load.

Check GPU utilization while inference runs:

# In a second terminal — watch GPU usage spike during generation
sudo powermetrics --samplers gpu_power -i 500 -n 5

You should see: GPU Active % jumping to 60–90% during token generation, confirming MLX is using the GPU.


What You Learned

  • MLX's unified memory architecture eliminates CPU↔GPU transfer overhead, giving Apple Silicon a real edge for local LLM inference and fine-tuning
  • 4-bit quantization (--q-bits 4) fits 7B models in under 5GB, leaving room for the OS and applications on 8GB Macs
  • LoRA adapters let you fine-tune a 7B model in 20–25 minutes without touching base weights — meaning you can swap adapters at load time without re-downloading the base model
  • mlx_lm.server makes any MLX model a drop-in OpenAI API replacement, compatible with LangChain, Continue.dev, and Cursor out of the box

Limitation: MLX-LM doesn't support multi-GPU or distributed inference across multiple Macs. For that, look at vLLM on Linux or llama.cpp with RPC. MLX also can't run on Intel Macs — if you're on Intel, Ollama with llama.cpp is your only native option.

Tested on MLX-LM 0.21.x, MLX 0.24.x, macOS Sequoia 15.3, M2 Pro 16GB and M3 Max 36GB


FAQ

Q: Does MLX-LM work on 8GB RAM Macs?
A: Yes — use 4-bit quantized models under 7B parameters. Mistral 7B 4-bit uses ~4.5GB, leaving ~3GB for macOS. Llama 3.2 3B 4-bit uses ~2.2GB and runs comfortably at 80+ tok/s on M1/M2.

Q: What's the difference between --lora-layers 8 and --lora-layers 16?
A: More LoRA layers means more trainable parameters, better fine-tune quality, but slower training and higher memory use. Start with 8 for quick experiments, increase to 16–32 for production adapters on 16GB+ machines.

Q: Can MLX-LM fine-tune with QLoRA (quantized base + LoRA)?
A: Yes — when you pass a 4-bit quantized model as --model with --train, MLX-LM automatically does QLoRA: the base stays quantized, only the LoRA adapter weights are full precision. No extra flags needed.

Q: How does mlx-lm compare to Hugging Face Transformers on Mac?
A: Transformers on Mac uses PyTorch + MPS backend, which still copies tensors between CPU and GPU memory pools. MLX's unified memory eliminates that copy. In practice, MLX-LM is 20–40% faster for inference and uses 10–20% less peak RAM on equivalent models.

Q: Can I use a fine-tuned MLX model with Open WebUI?
A: Yes — start mlx_lm.server on port 8080, then point Open WebUI's OpenAI-compatible connection to http://localhost:8080. Set any string as the API key (it's ignored). The model dropdown will show the MLX model name.