CL^YC^BLE
warming up...
CL^YC^BLE
warming up...
A deep dive into Low-Rank Adaptation — how it works mathematically, what rank and alpha actually mean, when to use QLoRA vs LoRA, and how to get real results from a small dataset.
A base language model is a giant lookup table of probability distributions — it has learned, from trillions of tokens, which words tend to follow which other words across every topic humans have written about. What it hasn't learned is how your domain talks, what your terminology means, or what style your outputs should take. Fine-tuning is the process of continuing that learning on your data.
Classic fine-tuning updates every weight in the model. For a 7-billion-parameter model with 32-bit weights, that's 28 GB of parameters, each needing a gradient and an optimizer state — easily 3–4× memory overhead during training. That ruled out fine-tuning for anyone without a warehouse of A100s.
LoRA changes the game by asking a different question: do we need to update every weight?
LoRA is based on a hypothesis, backed by empirical results: the weight changes needed to adapt a pre-trained model to a new task live in a low-dimensional subspace. In practice, this means that even though a weight matrix might be 4096×4096 (16.7M values), the update to that matrix can be expressed as the product of two much smaller matrices.
Formally, instead of learning a full update ΔW to an existing weight matrix W ∈ ℝ^(d×k), LoRA constrains the update to:
ΔW = B · Awhere A ∈ ℝ^(r×k) and B ∈ ℝ^(d×r), and r (the rank) is much smaller than both d and k. The original matrix W stays frozen — only A and B are trained.
During training, A is initialised with a random Gaussian and B with zeros — so ΔW = 0 at the start and the model behaves identically to the base model before any training step. This is important: it means the adapter initialises as a no-op and learns from there, rather than introducing noise upfront.
Rank is the most important hyperparameter in LoRA. It controls how many dimensions of change you give the adapter to work with.
A common mistake is picking rank = 64 "to be safe." Higher rank is not safer — it's just more parameters that can memorise noise. Start at r = 16 and go lower if you have a small, clean dataset.
Alpha (lora_alpha) is a scaling factor applied to the adapter output before it's added to the base model: output = W·x + (alpha/r) · B·A·x
The scaling factor is alpha / r, not just alpha. This means if you double the rank, you halve the effective learning signal from the adapter — you'd need to double alpha to compensate. A common convention is to set alpha = 2 * r (e.g. r = 16, alpha = 32), which keeps the adapter's influence constant regardless of rank.
Another convention is alpha = r (the scaling factor becomes 1.0 — no scaling). Both work. What matters is understanding that alpha controls how much the adapter shifts the model's behaviour. Crank alpha too high and the adapter overwhelms the base model's general reasoning ability. Too low and the adapter barely registers.
In a transformer, LoRA is typically applied to the attention matrices: q_proj, k_proj, v_proj, and o_proj. Some configurations also add the MLP layers (gate_proj, up_proj, down_proj).
For most fine-tuning tasks, targeting only the attention layers is sufficient. Adding MLP layers increases adapter size and training time for marginal gains unless you're doing heavy knowledge injection. A reasonable default:
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"]QLoRA (Quantized LoRA) applies LoRA on top of a 4-bit quantized base model. The base model is loaded in NF4 (4-bit NormalFloat), reducing its memory footprint by 4×. The LoRA adapters themselves are still trained in 16-bit — only the frozen base model is quantized.
Practical implications:
If you're fine-tuning via a managed service (Together.ai, Modal, Replicate), VRAM isn't your problem — the provider handles it. Use full LoRA (16-bit) for best quality.
Here's what actually happens when you kick off a LoRA run:
The adapter file is tiny relative to the base model. This is why LoRA is also attractive for multi-tenant serving: you can load one base model and swap adapters per-request with very low overhead.
LoRA adapters are sensitive to learning rate. Common pitfalls:
Epochs: for small datasets (<1000 samples), 3–5 epochs is typical. More epochs on a small dataset causes memorisation — the model starts repeating your exact training examples verbatim. Use a validation split and watch eval loss; stop when it stops decreasing.
A well-trained LoRA adapter gives you a model that speaks in your domain's register, uses your terminology correctly, follows your output format reliably, and handles the specific tasks in your training set with high consistency. What it does not give you: new factual knowledge (hallucinates on facts not in training), reasoning ability beyond the base model, or robustness to inputs wildly different from training.
Fine-tuned models are specialists, not generalists. The art is knowing what to specialise on.
The next guide covers what makes a fine-tuning dataset actually work — because rank and alpha won't save you from bad training data.