Format Fine-Tuning Datasets: ShareGPT vs Alpaca Compared 2026

ShareGPT vs Alpaca dataset formatting for LLM fine-tuning explained. Convert, validate, and pick the right format for Unsloth, Axolotl, and TRL. Python 3.12.

ShareGPT vs Alpaca: TL;DR

ShareGPT vs Alpaca dataset formatting is the first real decision you make when fine-tuning an LLM — and the wrong choice silently degrades your model.

ShareGPTAlpaca
Best forMulti-turn chat, tool use, system promptsSingle-turn instruction following
Structureconversations list of {from, value}instruction, input, output keys
Framework supportUnsloth, Axolotl, TRL, LLaMA-FactoryAxolotl, TRL, older scripts
System promptNative (from: system)Bolted on via instruction field
Multi-turn✅ Native❌ Workaround only
Hugging Face Datasetssharegpt loaderalpaca loader
Self-hosted fine-tune costFree (GPU time only)Free (GPU time only)

Choose ShareGPT if: You're fine-tuning a chat model, need multi-turn data, or use Unsloth/Axolotl in 2026.
Choose Alpaca if: You have a legacy dataset already in Alpaca format and a single-turn task like summarization or classification.


Why Format Matters More Than You Think

Most fine-tuning failures aren't caused by bad hyperparameters. They're caused by mismatched data formatting — the model sees malformed tokens, the chat template gets applied twice, or turn boundaries collapse.

Symptoms of wrong format:

  • Loss drops normally but the model ignores instructions at inference
  • apply_chat_template throws a KeyError on conversations
  • Multi-turn evals show the model repeating the human turn verbatim
  • Axolotl warns: dataset_type: sharegpt but found alpaca keys

Both formats are JSON (or JSONL). The difference is in key names and nesting depth.


ShareGPT vs Alpaca dataset structure: field mapping and conversation nesting ShareGPT wraps turns in a conversations list; Alpaca flattens everything into three top-level keys.


Alpaca Format: Structure and Limits

Alpaca was introduced with the Stanford Alpaca paper in 2023. It's flat and simple:

{
  "instruction": "Summarize the following customer complaint in one sentence.",
  "input": "I ordered a laptop on March 1st and it still hasn't arrived...",
  "output": "Customer placed an order on March 1st and has not received it after 10 days."
}

When input is empty, most loaders merge instruction + output directly:

{
  "instruction": "Write a Python function that returns the Fibonacci sequence up to n.",
  "input": "",
  "output": "def fibonacci(n):\n    a, b = 0, 1\n    result = []\n    while a < n:\n        result.append(a)\n        a, b = b, a + b\n    return result"
}

What Alpaca can't do natively:

  • Multi-turn dialogue (no concept of a "conversation")
  • Per-example system prompts (instruction doubles as system context)
  • Tool call / function call turns
  • Role-aware masking (you can't mask the human turn loss separately)

For 2026 chat models — Llama 3.3, Qwen 2.5, Mistral Small 3 — Alpaca is the wrong default.


ShareGPT Format: Structure and Strengths

ShareGPT wraps all turns in a conversations array. Each turn has a from role and a value string:

{
  "conversations": [
    {
      "from": "system",
      "value": "You are a senior Python developer. Be concise and correct."
    },
    {
      "from": "human",
      "value": "Write a Python function that returns the Fibonacci sequence up to n."
    },
    {
      "from": "gpt",
      "value": "def fibonacci(n):\n    a, b = 0, 1\n    result = []\n    while a < n:\n        result.append(a)\n        a, b = b, a + b\n    return result"
    }
  ]
}

Multi-turn adds more objects to the same array:

{
  "conversations": [
    {"from": "system", "value": "You are a helpful coding assistant."},
    {"from": "human", "value": "What does `__slots__` do in Python?"},
    {"from": "gpt", "value": "`__slots__` restricts instance attributes to a fixed set, reducing memory overhead per object."},
    {"from": "human", "value": "Give me an example with a dataclass comparison."},
    {"from": "gpt", "value": "class Point:\n    __slots__ = ('x', 'y')\n    def __init__(self, x, y):\n        self.x = x\n        self.y = y\n\n# vs @dataclass which uses __dict__ by default"}
  ]
}

Role name variants — different loaders accept different spellings. Unsloth and Axolotl both accept these by default:

CanonicalAlternatives accepted
humanuser
gptassistant, model
systemsystem

Stick to human / gpt / system unless your framework specifies otherwise.


Converting Alpaca → ShareGPT in Python

When your dataset is in Alpaca format but your framework needs ShareGPT, use this converter. It handles both the input-present and input-empty cases:

# convert_alpaca_to_sharegpt.py
# Converts an Alpaca JSONL dataset to ShareGPT format.
# Tested on Python 3.12, datasets==2.19, transformers==4.41

import json
from pathlib import Path


def alpaca_to_sharegpt(record: dict) -> dict:
    instruction = record.get("instruction", "").strip()
    input_text = record.get("input", "").strip()
    output = record.get("output", "").strip()

    # Merge instruction + input when both are present
    human_turn = f"{instruction}\n\n{input_text}" if input_text else instruction

    return {
        "conversations": [
            {"from": "human", "value": human_turn},
            {"from": "gpt", "value": output},
        ]
    }


def convert_file(src: str, dst: str) -> None:
    src_path = Path(src)
    dst_path = Path(dst)

    converted = []
    with src_path.open() as f:
        for line in f:
            line = line.strip()
            if not line:
                continue
            record = json.loads(line)
            converted.append(alpaca_to_sharegpt(record))

    with dst_path.open("w") as f:
        for item in converted:
            f.write(json.dumps(item, ensure_ascii=False) + "\n")

    print(f"Converted {len(converted)} records → {dst_path}")


if __name__ == "__main__":
    convert_file("train_alpaca.jsonl", "train_sharegpt.jsonl")

Expected output:

Converted 52002 records → train_sharegpt.jsonl

If it fails:

  • KeyError: 'instruction' → your source file uses a different schema; inspect with python -c "import json; print(json.loads(open('file.jsonl').readline()).keys())"
  • UnicodeDecodeError → add encoding="utf-8" to both open() calls

Loading Each Format with Hugging Face Datasets

# load_datasets.py
# Shows how to load both formats for inspection before training.

from datasets import load_dataset

# Alpaca — flat JSON keys
alpaca_ds = load_dataset("json", data_files="train_alpaca.jsonl", split="train")
print(alpaca_ds[0].keys())  # dict_keys(['instruction', 'input', 'output'])

# ShareGPT — nested conversations list
sharegpt_ds = load_dataset("json", data_files="train_sharegpt.jsonl", split="train")
print(sharegpt_ds[0]["conversations"][0])  # {'from': 'human', 'value': '...'}

Using ShareGPT with Unsloth (Llama 3.3, Qwen 2.5)

Unsloth's standardize_sharegpt normalizes role name variants before applying the chat template. This is the full pipeline as of Unsloth 2025.11:

# unsloth_sharegpt_train.py
# Fine-tune Llama 3.3 8B on a ShareGPT dataset with Unsloth.
# Requires: unsloth[colab-new], datasets, trl

from unsloth import FastLanguageModel
from unsloth.chat_templates import get_chat_template, standardize_sharegpt
from datasets import load_dataset
from trl import SFTTrainer, SFTConfig

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.3-8B-Instruct",
    max_seq_length=4096,
    load_in_4bit=True,
)

tokenizer = get_chat_template(tokenizer, chat_template="llama-3.1")

dataset = load_dataset("json", data_files="train_sharegpt.jsonl", split="train")

# Normalize 'user'/'assistant' → 'human'/'gpt' and validate structure
dataset = standardize_sharegpt(dataset)


def apply_template(examples):
    # apply_chat_template expects a list of conversation dicts
    texts = [
        tokenizer.apply_chat_template(
            convo,
            tokenize=False,       # Return strings; SFTTrainer tokenizes internally
            add_generation_prompt=False,  # Don't append assistant prefix during training
        )
        for convo in examples["conversations"]
    ]
    return {"text": texts}


dataset = dataset.map(apply_template, batched=True)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    args=SFTConfig(
        dataset_text_field="text",   # Must match the key set in apply_template above
        max_seq_length=4096,
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        num_train_epochs=3,
        learning_rate=2e-4,
        output_dir="./outputs",
    ),
)

trainer.train()

Key parameters explained:

  • tokenize=False — returns a formatted string; SFTTrainer handles tokenization so you don't apply it twice
  • add_generation_prompt=False — during training you include the full assistant turn; only set True at inference
  • dataset_text_field="text" — must match the key name you write in apply_template

Using Alpaca with Axolotl

Axolotl still has strong Alpaca support via its alpaca dataset type. In your config.yaml:

# axolotl_alpaca_config.yaml
base_model: mistralai/Mistral-Small-3.1-24B-Instruct
model_type: MistralForCausalLM

datasets:
  - path: train_alpaca.jsonl
    ds_type: json
    type: alpaca           # Axolotl's built-in Alpaca formatter

sequence_len: 2048
val_set_size: 0.02

output_dir: ./axolotl-output
num_epochs: 3
learning_rate: 2e-5
micro_batch_size: 2
gradient_accumulation_steps: 4

For ShareGPT in Axolotl, swap type: alpacatype: sharegpt:

datasets:
  - path: train_sharegpt.jsonl
    ds_type: json
    type: sharegpt
    conversation: llama-3   # Maps roles to Llama 3's chat template tokens

If Axolotl warns conversation not set → explicitly add conversation: chatml or the model-specific template name.


Validation: Catch Format Errors Before Training

Running a 3-hour fine-tune only to find malformed data at epoch 2 is painful. Validate first:

# validate_sharegpt.py
# Checks every record in a ShareGPT JSONL for required keys and role ordering.

import json
import sys
from pathlib import Path

VALID_ROLES = {"human", "gpt", "system", "user", "assistant", "model"}


def validate_record(record: dict, idx: int) -> list[str]:
    errors = []
    if "conversations" not in record:
        errors.append(f"[{idx}] Missing 'conversations' key")
        return errors

    convos = record["conversations"]
    if not isinstance(convos, list) or len(convos) == 0:
        errors.append(f"[{idx}] 'conversations' must be a non-empty list")
        return errors

    for turn_idx, turn in enumerate(convos):
        if "from" not in turn:
            errors.append(f"[{idx}] Turn {turn_idx} missing 'from'")
        elif turn["from"] not in VALID_ROLES:
            errors.append(f"[{idx}] Turn {turn_idx} unknown role: {turn['from']}")
        if "value" not in turn or not isinstance(turn["value"], str):
            errors.append(f"[{idx}] Turn {turn_idx} missing or non-string 'value'")

    return errors


def validate_file(path: str) -> None:
    total = 0
    all_errors = []
    with Path(path).open() as f:
        for idx, line in enumerate(f):
            line = line.strip()
            if not line:
                continue
            try:
                record = json.loads(line)
            except json.JSONDecodeError as e:
                all_errors.append(f"[{idx}] JSON parse error: {e}")
                continue
            all_errors.extend(validate_record(record, idx))
            total += 1

    if all_errors:
        print(f"Found {len(all_errors)} errors in {total} records:")
        for err in all_errors[:20]:   # Show first 20 to avoid wall of text
            print(" ", err)
        sys.exit(1)
    else:
        print(f"✅ {total} records valid")


if __name__ == "__main__":
    validate_file(sys.argv[1])
python validate_sharegpt.py train_sharegpt.jsonl
# ✅ 52002 records valid

Head-to-Head: When Each Format Wins

ScenarioWinnerReason
Chat assistant fine-tuneShareGPTMulti-turn is native; role masking works correctly
Single-turn summarizationAlpacaSimpler structure, less conversion overhead
Tool/function calling dataShareGPTTool turns map to from: tool naturally
Legacy dataset from 2023AlpacaAlready formatted; conversion adds risk with no benefit
Unsloth + Llama 3.3ShareGPTstandardize_sharegpt + apply_chat_template pipeline is battle-tested
Axolotl + MistralEitherAxolotl handles both natively
Mixing system prompts per exampleShareGPTfrom: system per conversation; Alpaca has one global instruction field
Filtering by turn countShareGPTlen(example["conversations"]) is trivial; Alpaca has no concept of turns

What You Learned

  • Alpaca is flat and fast to set up, but it has no real multi-turn support — don't use it for chat models in 2026.
  • ShareGPT's conversations list maps cleanly to apply_chat_template, which every major framework now expects.
  • Always run a validation script before training — malformed records cause late-epoch crashes, not early ones.
  • tokenize=False + add_generation_prompt=False is the correct pairing during training; flip add_generation_prompt=True only at inference.
  • Axolotl accepts both formats via type: alpaca or type: sharegpt — no code required, just config.

Tested on Unsloth 2025.11, Axolotl 0.7, TRL 0.9, Python 3.12, CUDA 12.4, RTX 4090 (24GB VRAM)


FAQ

Q: Can I mix Alpaca and ShareGPT records in one training run?
A: No. Each framework expects one format per dataset entry. Convert everything to ShareGPT first using the script above, then concatenate the JSONL files.

Q: What's the minimum number of ShareGPT records needed for fine-tuning?
A: Quality beats quantity. 500–1,000 high-quality, diverse records typically outperform 50,000 noisy ones. Start with 1,000 and evaluate before scaling.

Q: Does ShareGPT format work with OpenAI's fine-tuning API?
A: OpenAI uses a similar but distinct format — messages with role/content keys, not conversations with from/value. Convert with: {"messages": [{"role": t["from"].replace("gpt","assistant").replace("human","user"), "content": t["value"]} for t in record["conversations"]]}.

Q: How do I handle tool call turns in ShareGPT?
A: Add turns with "from": "tool" and put the tool result in "value". Unsloth and LLaMA-Factory both support this. Axolotl requires conversation: tool_use in the dataset config.

Q: Does the input field in Alpaca have a cost at inference?
A: It adds tokens to the prompt, so yes — longer prompts cost more on hosted APIs (OpenAI, Anthropic, etc., priced in USD per million tokens). In ShareGPT you can include the same context inside a human turn and control token usage more precisely.