CL^YC^BLE
warming up...
CL^YC^BLE
warming up...
Why 300 great pairs beat 3,000 mediocre ones. A practical guide to writing training pairs that produce consistent, domain-specific behavior — not a confused base model with new clothes.
Every guide on LoRA covers rank, alpha, learning rate, and layer targeting. Very few spend enough time on the thing that actually determines whether your fine-tune works: the training pairs.
You can tune hyperparameters for days. If your dataset is mediocre, you get a mediocre model — dressed up in slightly different clothes than the base model it started from, but not reliably better at your actual task. A well-constructed dataset of 300 pairs consistently outperforms a scraped dump of 3,000.
Here's why, and how to build the 300-pair version.
Every pair in your supervised fine-tuning (SFT) dataset is a demonstration. You're showing the model: given this input, produce this output. The model is not reasoning about why the output is correct — it's learning to reproduce the pattern.
This has a critical implication: the model can only learn what's present in the output. If your ideal response requires a specific tone, structure, domain terminology, or reasoning style, that must be present in the training outputs — not described in a system prompt, not hinted at in the input. Shown. Demonstrated. Every time.
Inconsistency in training data creates conflicting gradients. If 60% of your pairs format responses with a bullet list and 40% use flowing prose, the model gets mixed signals and converges on an average — which means inconsistent output at inference time, switching styles unpredictably.
If 10% of your pairs use the right domain terminology and 90% use generic language, the model learns to use generic language. The 10% is noise relative to the dominant pattern.
Consistency is not just a quality metric — it's the signal. The model is learning the consistent pattern. If there isn't one, there's nothing to learn.
The distribution of training inputs must match the distribution of inputs at inference time. If users will ask conversational questions ("what's the best way to restock zone 16?"), your training inputs should be conversational questions — not formal structured queries. If users paste raw CSV data, train on raw CSV data.
Mismatch between training input format and inference input format is one of the most common causes of "it works on the eval set but not in production."
Don't write the output you'd accept from a junior teammate. Write the output you'd be proud to ship. If you wouldn't put your name on it, don't put it in training data.
This matters especially for edge cases. It's easy to write good outputs for easy inputs. The hard inputs — ambiguous queries, partial information, edge-case terminology — are where the model needs the clearest signal, because those are the situations where the base model will be most uncertain.
A dataset of 500 examples where 450 are variations of the same query type will produce a model that handles that query type well and fails on everything else. Map out the categories of inputs your model will receive, then allocate pairs across all categories — even rare ones.
A simple category audit before writing any examples:
Rare-case pairs are worth more per example than common-case pairs, because the model gets less natural signal from the base pretraining for unusual situations.
For tasks where refusal or "I don't know" is the correct output, include explicit examples of it. If your model should decline to answer questions outside its domain, train it on inputs that are out of domain paired with well-formed declinations. If it should ask for clarification when the input is ambiguous, train it on ambiguous inputs paired with clarification questions.
Without negative examples, the model never learns when not to answer — it just pattern-matches to the most plausible answer it can generate, which is the wrong behaviour for a reliable production system.
The best training pairs come from real user interactions, real documents, or real task completions — not hypothetical scenarios you invented. If you have logs of how experts handled past requests, mine them. If you have existing customer emails and the replies that were sent, use them.
Real data has authentic variation in phrasing, tone, and complexity that you can't fully replicate by inventing scenarios. It also tests whether the output you're training on is actually what worked — not just what you think should work.
When constructing pairs from scratch, start with the ideal output. Describe exactly what the model should produce: the structure, the phrasing conventions, the level of detail, the terminology. Then write the input that would elicit that output.
Working input-first tends to produce outputs shaped by what's easy to write rather than what's ideal to receive. Output-first forces you to be precise about what you actually want.
If multiple people write training pairs, every person's stylistic choices become noise relative to every other person's. This is the hidden cost of crowdsourcing training data.
For small datasets (<1000 pairs), have one person (ideally a domain expert) write all outputs, or have one person review and rewrite all outputs for consistency before training. The voice needs to be singular.
This depends entirely on the task complexity and how much behaviour change you need from the base model. Rough guidelines:
When in doubt, train on 200 pairs, evaluate hard, then add more targeted pairs for failure modes — don't front-load with volume.
Before you submit a training job, do these checks:
Sample 20 random pairs from your dataset. For each one: would you be satisfied if the model produced this exact output for this input? If more than 2 of the 20 fail that bar, your dataset isn't ready.
Find 5 pairs that cover similar inputs. Do the outputs use the same structure, terminology, and level of detail? If they don't, you have a consistency problem.
List all the input categories your model will encounter. Does your dataset have examples in each category? Are the proportions reasonable? Mark any gaps before training.
If you have 150 high-quality pairs and need more signal, augmentation is more reliable than writing mediocre new pairs.
Don't use LLM-generated augmentation uncritically. GPT-4 paraphrasing your outputs will introduce GPT-4's patterns — which may conflict with the specific style you're trying to train. Review all augmented pairs before including them.
Set aside 10–15% of your pairs as a held-out evaluation set before training — not randomly sampled from the full set, but deliberately chosen to cover all your input categories. Your eval loss on this set is the signal that tells you when to stop.
Eval loss going down: still learning. Eval loss going up while train loss goes down: overfitting — stop now. Eval loss plateauing before train loss: good stopping point, possibly reduce epochs next run.
A model that scores well on your held-out eval set and poorly in actual use means your eval set isn't representative of real inputs. Fix the eval set, not the model.
Fine-tuning is iterative. The first run is not expected to be perfect — it's expected to show you where the dataset is weak. The second run is where things get good.