Transfer Learning with ResNet and EfficientNet: 95% Accuracy on 500 Images

You have 500 labeled images and need a classifier. Training from scratch will fail. Transfer learning will give you 95% accuracy.

Your GPU is bored. It’s used to crunching billions of tokens or churning through ImageNet-1k, not your paltry collection of 500 cat/dog/bird/whatever images. If you try to train a CNN from random weights on this, you’ll get a model that memorizes noise and fails spectacularly on anything new. This isn't a limitation of your skill; it's basic math. A modern convolutional layer has thousands of parameters hungry for patterns that simply don't exist in 500 samples.

The escape hatch is transfer learning. You’re not starting from scratch; you’re starting from a model that has already seen 1.28 million images and learned universal features like edges, textures, and shapes. Your job is to repurpose that pre-trained knowledge. As the DeepMind 2025 survey confirms, this approach reduces the required training data by 10–100x versus training from scratch. With the right strategy, hitting 95% accuracy is not just possible—it’s expected.

Let’s stop the guesswork and build something that works.

Choosing Your Architectural Backbone: ResNet, EfficientNet, or ViT?

Your first decision is which pre-trained model to use as your feature extractor. This isn't a philosophical choice; it's a trade-off between accuracy, speed, and compatibility with your tiny dataset.

Model	ImageNet Top-1 Acc.	Parameters	Best For...	Pitfall with Small Data
ResNet-50	76.1%	25M	Reliability, extensive tutorials, stable training.	Lower baseline accuracy.
EfficientNet-B4	82.9%	19M	Higher accuracy with fewer params, good FLOPs efficiency.	Slightly more fragile tuning; can overfit faster.
ViT-B/16	~80-85%*	86M	State-of-the-art potential with huge data.	Will underperform badly on 500 images without major regularization.

*ViT performance is highly dependent on dataset size. Google's 2025 paper notes Vision Transformers outperform CNNs on ImageNet only when trained with 300M+ samples.

For your 500-image project, the choice is clear: Use a CNN. ViTs are the future—they underpin 94% of top-performing models on major benchmarks—but they are data-hungry beasts. A ResNet or EfficientNet provides a dense, spatially-aware feature map that is far more sample-efficient for transfer learning.

Verdict: Start with EfficientNet-B4. It gives you more accuracy for fewer parameters than ResNet-50, which is crucial when data is limited. We'll use the timm library, a treasure trove of pre-trained models.

import torch
import timm
import torch.nn as nn

def get_pretrained_backbone(model_name='efficientnet_b4', num_classes=10, pretrained=True):
    """
    Fetches a pre-trained model and replaces its classifier head.
    """
    # Create model from timm
    model = timm.create_model(model_name, pretrained=pretrained, num_classes=0)  # num_classes=0 drops the head
    # Get the feature dimension
    num_features = model.num_features  # e.g., 1792 for efficientnet_b4

    # Create a new sequential head
    classifier_head = nn.Sequential(
        nn.Linear(num_features, 512),
        nn.ReLU(),
        nn.Dropout(0.3),  # Immediate regularization for small data
        nn.Linear(512, num_classes)
    )

    # Wrap it in a simple container model
    class TransferModel(nn.Module):
        def __init__(self, backbone, head):
            super().__init__()
            self.backbone = backbone
            self.head = head

        def forward(self, x):
            features = self.backbone(x)
            return self.head(features)

    return TransferModel(model, classifier_head)


model = get_pretrained_backbone(model_name='efficientnet_b4', num_classes=3)
print(f"Model ready. Backbone features: {model.backbone.num_features}")

The Fine-Tuning Fork in the Road: Feature Extraction vs. Full Fine-Tuning

You have two paths, and choosing wrong means wasted hours.

Feature Extraction (Freeze the backbone): Lock all the pre-trained layers. Only train the new classifier head you just attached. This is fast, stable, and prevents catastrophic forgetting of useful features. Use this as your mandatory first step to get a stable baseline.
Full Fine-Tuning (Unfreeze some/all): After the head is trained, you can unfreeze some backbone layers to let them adapt specifically to your dataset. This can boost performance but is the fast track to overfitting if done carelessly.

The Decision Framework: For 500 images, your default plan should be: Step 1: Feature Extraction (freeze all backbone layers, train only the head for ~20 epochs). Step 2: If validation accuracy plateaus, proceed to cautious fine-tuning.

A Layer Freezing Strategy That Actually Works

The classic beginner mistake is to unfreeze the entire model and set a global learning rate. This is like taking a master painter, shaking their arm violently, and asking them to add a tiny detail. You'll destroy the pre-trained knowledge.

Here is a phased strategy:

Phase 1 - Freeze & Train the Head: As in the code above, the backbone is frozen by default in our wrapper. Train with a relatively high LR (e.g., 1e-3) for the head.
Phase 2 - Unfreeze & Differential Learning Rates: Unfreeze the backbone layers but apply a much smaller learning rate to them than to the head. The earlier the layer (e.g., edge detectors), the less it should change.

from torch.optim import AdamW

# After Phase 1, prepare for Phase 2 fine-tuning
def prepare_for_fine_tuning(model, base_lr=1e-3, backbone_lr_factor=0.1):
    """
    Sets up parameter groups for differential learning rates.
    """
    # Unfreeze the backbone
    for param in model.backbone.parameters():
        param.requires_grad = True

    # Group parameters
    backbone_params = []
    head_params = []
    for name, param in model.named_parameters():
        if param.requires_grad:
            if 'backbone' in name:
                backbone_params.append(param)
            else:
                head_params.append(param)

    # Create optimizer with different LRs
    optimizer = AdamW([
        {'params': backbone_params, 'lr': base_lr * backbone_lr_factor},
        {'params': head_params, 'lr': base_lr}
    ])
    return optimizer

optimizer = prepare_for_fine_tuning(model, base_lr=1e-4, backbone_lr_factor=0.1)
print("Optimizer ready with differential LR: Head LR=1e-4, Backbone LR=1e-5")

Why Your Learning Rate is Probably Wrong

If you take the standard lr=1e-3 from a CIFAR-10 tutorial and apply it to your frozen EfficientNet backbone, you will destroy its carefully pre-trained weights. For fine-tuning pre-trained models, especially early layers, 1e-4 is often too high.

For the new, randomly initialized head: Start with lr=1e-3. It needs to learn quickly.
For the pre-trained backbone layers during fine-tuning: Start with lr=1e-5 or even lr=1e-6. You are making subtle refinements, not relearning from scratch.

Real Error & Fix: Training Plateaus After Epoch 5

Symptoms: Val accuracy stops improving early. Loss flatlines.
Likely Cause: Learning rate is too high for the fine-tuning stage, causing noisy, destabilizing updates that don't converge.
Exact Fix: Use CosineAnnealingLR with T_max=total_epochs. This smoothly decays the LR to zero, helping convergence. Also, re-check your base LR and reduce it.

Data Augmentation: Your 500 Images are Now 50,000

This is non-negotiable. You must artificially expand your dataset. Albumentations is the tool of choice here for speed and flexibility.

import albumentations as A
from albumentations.pytorch import ToTensorV2

def get_train_transforms(img_size=224):
    return A.Compose([
        A.RandomResizedCrop(height=img_size, width=img_size, scale=(0.8, 1.0)),
        A.HorizontalFlip(p=0.5),
        A.RandomBrightnessContrast(p=0.2),
        A.ShiftScaleRotate(shift_limit=0.05, scale_limit=0.05, rotate_limit=15, p=0.5),
        # CutMix or MixUp are applied later in the training loop, not here.
        A.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
        ToTensorV2(),
    ])

def get_val_transforms(img_size=224):
    return A.Compose([
        A.Resize(height=img_size, width=img_size),
        A.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
        ToTensorV2(),
    ])

For small datasets, go beyond basic flips and crops. MixUp and CutMix are "advanced augmentation" techniques that blend images and labels, acting as powerful regularizers. They make your model less confident on single samples and more robust.

Battling Class Imbalance: More Than Just Weighted Loss

If your 500 images are split 400/80/20, your model will become very good at predicting the first class and ignore the others.

Weighted Loss: The first line of defense. Calculate class weights inversely proportional to their frequency and pass them to nn.CrossEntropyLoss.

# Suppose class counts: [400, 80, 20]
class_counts = torch.tensor([400., 80., 20.])
class_weights = 1. / class_counts
class_weights = class_weights / class_weights.sum() * len(class_counts) # Normalize
criterion = nn.CrossEntropyLoss(weight=class_weights)

Oversampling with Augmentation: Use a weighted sampler (like WeightedRandomSampler) to ensure each batch has a balanced number of samples from each class. The minority class images will be seen more often, each time with different augmentations.

Real Error & Fix: Overfitting with 98% Train / 62% Val Accuracy

Symptoms: Near-perfect training accuracy, dismal validation accuracy.
Likely Cause: Model capacity is too high for the data, or regularization is insufficient.
Exact Fix: 1) Add/Increase Dropout(0.3–0.5) in your classifier head. 2) Ramp up your data augmentation (add MixUp). 3) Apply weight decay (1e-4) to the optimizer.

Evaluation: Moving Beyond "95% Accuracy"

Hitting a number is good. Understanding why is professional. Accuracy hides sins.

Per-Class Accuracy: Use TorchMetrics. If your "95% accuracy" comes from perfect performance on the majority class and 70% on a minor class, you have a problem.

from torchmetrics import Accuracy
# Use a 'multiclass' accuracy with num_classes argument
metric = Accuracy(task="multiclass", num_classes=3, average=None) # Returns accuracy per class

Confusion Matrix: Visualize where your model is confusing classes. This directly informs your next step—maybe you need more specific augmentation for two similar classes.
Grad-CAM Visualization: This tells you what your model is looking at. Use it to debug false positives. Is your "wolf" classifier activating on snow backgrounds because all your wolf pictures have snow? Grad-CAM will show you.

Next Steps: From Working Model to Robust Pipeline

You now have a strategy to get to 95%. To move from a notebook experiment to a reliable solution:

Automate the Workflow: Use PyTorch Lightning. It structures your code into LightningModule and DataModule, making the training loop, checkpointing, and logging someone else's problem. It seamlessly integrates with the differential LR and freezing strategies we discussed.
Experiment Tracking: Don't just tweak and hope. Log your hyperparameters (LR, augmentation strength, dropout) and metrics for each run. Tools like Weights & Biases or TensorBoard are essential.
Push the Regularization Envelope: If you're still borderline overfitting, add label smoothing to your loss or employ knowledge distillation using a larger model as a "teacher." On average, distillation achieves 95% of teacher model accuracy at 30% model size, which could let you use a smaller, faster model in production.
Optimize for Deployment: Use PEFT (Parameter-Efficient Fine-Tuning) methods like LoRA to fine-tune only a tiny subset of parameters. This can match full fine-tuning performance while keeping checkpoint sizes small. For ultimate speed, trace your model with torch.jit or convert to ONNX.

Your 500-image problem is solved. The process is no longer alchemy—it's engineering. You start with a strong, pre-trained backbone, you train cautiously with heavy regularization, and you validate thoroughly. Now go make that bored GPU earn its keep.