DatasetsTraining DataFine-tuningBest Practices

The Dataset Quality Guide: What Actually Makes a Fine-tune Work

Why 300 great pairs beat 3,000 mediocre ones. A practical guide to writing training pairs that produce consistent, domain-specific behavior — not a confused base model with new clothes.

May 16, 2026·9 min read

The single most important variable in fine-tuning

Every guide on LoRA covers rank, alpha, learning rate, and layer targeting. Very few spend enough time on the thing that actually determines whether your fine-tune works: the training pairs.

You can tune hyperparameters for days. If your dataset is mediocre, you get a mediocre model — dressed up in slightly different clothes than the base model it started from, but not reliably better at your actual task. A well-constructed dataset of 300 pairs consistently outperforms a scraped dump of 3,000.

Here's why, and how to build the 300-pair version.

What a training pair actually teaches

Every pair in your supervised fine-tuning (SFT) dataset is a demonstration. You're showing the model: given this input, produce this output. The model is not reasoning about why the output is correct — it's learning to reproduce the pattern.

This has a critical implication: the model can only learn what's present in the output. If your ideal response requires a specific tone, structure, domain terminology, or reasoning style, that must be present in the training outputs — not described in a system prompt, not hinted at in the input. Shown. Demonstrated. Every time.

Why consistency beats volume

Inconsistency in training data creates conflicting gradients. If 60% of your pairs format responses with a bullet list and 40% use flowing prose, the model gets mixed signals and converges on an average — which means inconsistent output at inference time, switching styles unpredictably.

If 10% of your pairs use the right domain terminology and 90% use generic language, the model learns to use generic language. The 10% is noise relative to the dominant pattern.

Consistency is not just a quality metric — it's the signal. The model is learning the consistent pattern. If there isn't one, there's nothing to learn.

The four properties of a good pair

1. Inputs that look like real inference inputs

The distribution of training inputs must match the distribution of inputs at inference time. If users will ask conversational questions ("what's the best way to restock zone 16?"), your training inputs should be conversational questions — not formal structured queries. If users paste raw CSV data, train on raw CSV data.

Mismatch between training input format and inference input format is one of the most common causes of "it works on the eval set but not in production."

2. Outputs that are the gold standard, not the acceptable standard

Don't write the output you'd accept from a junior teammate. Write the output you'd be proud to ship. If you wouldn't put your name on it, don't put it in training data.

This matters especially for edge cases. It's easy to write good outputs for easy inputs. The hard inputs — ambiguous queries, partial information, edge-case terminology — are where the model needs the clearest signal, because those are the situations where the base model will be most uncertain.

3. Coverage of the input space, not just the common cases

A dataset of 500 examples where 450 are variations of the same query type will produce a model that handles that query type well and fails on everything else. Map out the categories of inputs your model will receive, then allocate pairs across all categories — even rare ones.

A simple category audit before writing any examples:

What are all the things users will ask or submit?
What's the frequency distribution (roughly)?
Are the rare cases proportionally represented in training data?

Rare-case pairs are worth more per example than common-case pairs, because the model gets less natural signal from the base pretraining for unusual situations.

4. Explicit failure modes taught as negatives

For tasks where refusal or "I don't know" is the correct output, include explicit examples of it. If your model should decline to answer questions outside its domain, train it on inputs that are out of domain paired with well-formed declinations. If it should ask for clarification when the input is ambiguous, train it on ambiguous inputs paired with clarification questions.

Without negative examples, the model never learns when not to answer — it just pattern-matches to the most plausible answer it can generate, which is the wrong behaviour for a reliable production system.

Practical pair construction: the process

Start with real data, not fabricated scenarios

The best training pairs come from real user interactions, real documents, or real task completions — not hypothetical scenarios you invented. If you have logs of how experts handled past requests, mine them. If you have existing customer emails and the replies that were sent, use them.

Real data has authentic variation in phrasing, tone, and complexity that you can't fully replicate by inventing scenarios. It also tests whether the output you're training on is actually what worked — not just what you think should work.

Write outputs before inputs

When constructing pairs from scratch, start with the ideal output. Describe exactly what the model should produce: the structure, the phrasing conventions, the level of detail, the terminology. Then write the input that would elicit that output.

Working input-first tends to produce outputs shaped by what's easy to write rather than what's ideal to receive. Output-first forces you to be precise about what you actually want.

The one-author rule

If multiple people write training pairs, every person's stylistic choices become noise relative to every other person's. This is the hidden cost of crowdsourcing training data.

For small datasets (<1000 pairs), have one person (ideally a domain expert) write all outputs, or have one person review and rewrite all outputs for consistency before training. The voice needs to be singular.

How many pairs do you need?

This depends entirely on the task complexity and how much behaviour change you need from the base model. Rough guidelines:

Tone/format/style changes: 100–300 pairs. The base model already knows the task — you're shaping how it presents answers.
Domain terminology and workflow: 300–700 pairs. You're teaching new vocabulary and task-specific patterns.
Complex multi-step reasoning or specialised knowledge: 700–2000 pairs. The base model needs significant new signal.
More than 2000 pairs: rarely needed for SFT. If you're here, consider whether the task is too broad and should be decomposed into multiple specialist models.

When in doubt, train on 200 pairs, evaluate hard, then add more targeted pairs for failure modes — don't front-load with volume.

Evaluating dataset quality before training

Before you submit a training job, do these checks:

The random sample review

Sample 20 random pairs from your dataset. For each one: would you be satisfied if the model produced this exact output for this input? If more than 2 of the 20 fail that bar, your dataset isn't ready.

The consistency check

Find 5 pairs that cover similar inputs. Do the outputs use the same structure, terminology, and level of detail? If they don't, you have a consistency problem.

The coverage audit

List all the input categories your model will encounter. Does your dataset have examples in each category? Are the proportions reasonable? Mark any gaps before training.

Augmentation: how to get more without writing more

If you have 150 high-quality pairs and need more signal, augmentation is more reliable than writing mediocre new pairs.

Paraphrase inputs: take existing inputs and rephrase them (formally, casually, abbreviated, verbose). Keep the same output. This teaches the model that multiple phrasings map to the same response.
Swap terminology: if your domain has synonyms (zone / section, tote / bin, picker / operator), create variants that use alternate terms with consistent outputs.
Add noise inputs: create inputs with typos, missing context, or ambiguous phrasing paired with appropriate handling (ask for clarification, make reasonable assumption and state it).

Don't use LLM-generated augmentation uncritically. GPT-4 paraphrasing your outputs will introduce GPT-4's patterns — which may conflict with the specific style you're trying to train. Review all augmented pairs before including them.

The evaluation set matters as much as the training set

Set aside 10–15% of your pairs as a held-out evaluation set before training — not randomly sampled from the full set, but deliberately chosen to cover all your input categories. Your eval loss on this set is the signal that tells you when to stop.

Eval loss going down: still learning. Eval loss going up while train loss goes down: overfitting — stop now. Eval loss plateauing before train loss: good stopping point, possibly reduce epochs next run.

A model that scores well on your held-out eval set and poorly in actual use means your eval set isn't representative of real inputs. Fix the eval set, not the model.

A dataset-building workflow

Define the exact task and the exact output format you want
List all input categories and estimate their frequency
Write 10 gold-standard pairs across different categories to establish the pattern
Review those 10 pairs until they're perfect — this is your style guide in concrete form
Write or collect the remaining pairs, holding that standard
One-author review pass: consistency check across all pairs
Hold out 10–15% as eval set (stratified by category)
Train, watch eval loss, stop when it plateaus
Manual eval on 20 random outputs — identify failure modes
Add targeted pairs for failure modes and repeat

Fine-tuning is iterative. The first run is not expected to be perfect — it's expected to show you where the dataset is weak. The second run is where things get good.

CL^YC^BLE

warming up...

← All posts