LoRAFine-tuningLLMsMachine Learning

What is LoRA Fine-tuning? A Complete Guide

A deep dive into Low-Rank Adaptation — how it works mathematically, what rank and alpha actually mean, when to use QLoRA vs LoRA, and how to get real results from a small dataset.

May 16, 2026·14 min read

What is fine-tuning, really?

A base language model is a giant lookup table of probability distributions — it has learned, from trillions of tokens, which words tend to follow which other words across every topic humans have written about. What it hasn't learned is how your domain talks, what your terminology means, or what style your outputs should take. Fine-tuning is the process of continuing that learning on your data.

Classic fine-tuning updates every weight in the model. For a 7-billion-parameter model with 32-bit weights, that's 28 GB of parameters, each needing a gradient and an optimizer state — easily 3–4× memory overhead during training. That ruled out fine-tuning for anyone without a warehouse of A100s.

LoRA changes the game by asking a different question: do we need to update every weight?

The core idea: low-rank adaptation

LoRA is based on a hypothesis, backed by empirical results: the weight changes needed to adapt a pre-trained model to a new task live in a low-dimensional subspace. In practice, this means that even though a weight matrix might be 4096×4096 (16.7M values), the update to that matrix can be expressed as the product of two much smaller matrices.

Formally, instead of learning a full update ΔW to an existing weight matrix W ∈ ℝ^(d×k), LoRA constrains the update to:

ΔW = B · A

where A ∈ ℝ^(r×k) and B ∈ ℝ^(d×r), and r (the rank) is much smaller than both d and k. The original matrix W stays frozen — only A and B are trained.

During training, A is initialised with a random Gaussian and B with zeros — so ΔW = 0 at the start and the model behaves identically to the base model before any training step. This is important: it means the adapter initialises as a no-op and learns from there, rather than introducing noise upfront.

What rank actually means

Rank is the most important hyperparameter in LoRA. It controls how many dimensions of change you give the adapter to work with.

r = 4 or r = 8 — tiny adapters, fast training, low memory. Good for stylistic changes (tone, format, phrasing) or narrow domain vocabulary. Typical for instruction following tasks.
r = 16 or r = 32 — more expressive. Can learn new facts, change reasoning patterns, adapt to complex multi-turn dialogue. Good default starting point.
r = 64+ — rarely needed. The adapter starts to overfit easily and you lose the efficiency benefits of LoRA. If you need this much expressiveness, full fine-tuning may be more appropriate.

A common mistake is picking rank = 64 "to be safe." Higher rank is not safer — it's just more parameters that can memorise noise. Start at r = 16 and go lower if you have a small, clean dataset.

What alpha does

Alpha (lora_alpha) is a scaling factor applied to the adapter output before it's added to the base model: output = W·x + (alpha/r) · B·A·x

The scaling factor is alpha / r, not just alpha. This means if you double the rank, you halve the effective learning signal from the adapter — you'd need to double alpha to compensate. A common convention is to set alpha = 2 * r (e.g. r = 16, alpha = 32), which keeps the adapter's influence constant regardless of rank.

Another convention is alpha = r (the scaling factor becomes 1.0 — no scaling). Both work. What matters is understanding that alpha controls how much the adapter shifts the model's behaviour. Crank alpha too high and the adapter overwhelms the base model's general reasoning ability. Too low and the adapter barely registers.

Which layers get adapted?

In a transformer, LoRA is typically applied to the attention matrices: q_proj, k_proj, v_proj, and o_proj. Some configurations also add the MLP layers (gate_proj, up_proj, down_proj).

For most fine-tuning tasks, targeting only the attention layers is sufficient. Adding MLP layers increases adapter size and training time for marginal gains unless you're doing heavy knowledge injection. A reasonable default:

target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"]

LoRA vs QLoRA

QLoRA (Quantized LoRA) applies LoRA on top of a 4-bit quantized base model. The base model is loaded in NF4 (4-bit NormalFloat), reducing its memory footprint by 4×. The LoRA adapters themselves are still trained in 16-bit — only the frozen base model is quantized.

Practical implications:

Use LoRA when you have a GPU with enough VRAM to load the full 16-bit base model. A 7B model at 16-bit needs ~14 GB. An H100 80GB can comfortably handle up to 70B with LoRA.
Use QLoRA when VRAM is the bottleneck. A 4-bit 7B model fits in ~6 GB, making fine-tuning accessible on a 3090 or even a T4. Quality is slightly lower than LoRA but often indistinguishable on practical tasks.

If you're fine-tuning via a managed service (Together.ai, Modal, Replicate), VRAM isn't your problem — the provider handles it. Use full LoRA (16-bit) for best quality.

The training loop, concretely

Here's what actually happens when you kick off a LoRA run:

Base model weights are loaded and frozen — no gradients computed for them.
Tiny A and B matrices are injected into the target layers and initialised (A: random Gaussian, B: zeros).
Each training step: forward pass through the (frozen) base + adapters → compute loss against your labels → backprop → update only A and B.
At the end: you have a small adapter file (typically 10–200 MB for a 7B model at r = 16) that can be merged into the base model or loaded separately for inference.

The adapter file is tiny relative to the base model. This is why LoRA is also attractive for multi-tenant serving: you can load one base model and swap adapters per-request with very low overhead.

Learning rate and epochs

LoRA adapters are sensitive to learning rate. Common pitfalls:

LR too high (e.g. 3e-4): adapter diverges or overshoots — output becomes incoherent or repetitive.
LR too low (e.g. 1e-6): adapter barely learns — no meaningful behaviour change.
Good range: 1e-4 to 5e-5 with a cosine scheduler and 5–10% warmup steps.

Epochs: for small datasets (<1000 samples), 3–5 epochs is typical. More epochs on a small dataset causes memorisation — the model starts repeating your exact training examples verbatim. Use a validation split and watch eval loss; stop when it stops decreasing.

What you actually get out of it

A well-trained LoRA adapter gives you a model that speaks in your domain's register, uses your terminology correctly, follows your output format reliably, and handles the specific tasks in your training set with high consistency. What it does not give you: new factual knowledge (hallucinates on facts not in training), reasoning ability beyond the base model, or robustness to inputs wildly different from training.

Fine-tuned models are specialists, not generalists. The art is knowing what to specialise on.

A practical checklist

Start with a base model that's already good at your task type (instruction-tuned, not raw pretrain)
r = 16, alpha = 32 as your first run — tune from there
LR = 2e-4 with cosine decay and 5% warmup
Target: q, k, v, o projections only
Dataset: 100–500 high-quality pairs beats 5,000 mediocre ones
3 epochs, watch eval loss — stop early if it diverges
QLoRA only if you're VRAM-constrained; otherwise use 16-bit LoRA
Merge the adapter before deploying for inference speed

The next guide covers what makes a fine-tuning dataset actually work — because rank and alpha won't save you from bad training data.

CL^YC^BLE

warming up...

← All posts