Image Fine-tuningFluxSDXLLoRADreamBoothReplicateFAL

Fine-tuning Image Models: Flux, SDXL, and DreamBooth LoRA Explained

From DreamBooth to Flux LoRA — how image fine-tuning actually works, what your dataset needs to look like, trigger words, choosing between Replicate and FAL.ai, and what makes a good image adapter.

May 18, 2026·11 min read

Why image fine-tuning is different from language fine-tuning

When you fine-tune a language model, you are adjusting probability distributions over tokens — teaching the model which words are likely in your domain. Image fine-tuning is fundamentally different: you are adjusting latent representations in a diffusion process, teaching the model to associate a concept (a specific person, product, style, or object) with a visual pattern it has never seen before.

The good news: image fine-tuning converges faster than language fine-tuning. A strong LoRA adapter for a specific subject can be trained in as few as 15–30 images and 1,000 training steps. The bad news: image quality is highly sensitive to dataset quality — blurry inputs produce blurry outputs, and inconsistent lighting or backgrounds confuse the model about what you are actually teaching it.

DreamBooth vs LoRA — what is the difference?

DreamBooth (Ruiz et al., 2022) was the original technique for personalizing diffusion models. It fine-tunes the entire U-Net on a small set of subject images, using a technique called prior preservation loss to prevent the model from forgetting what the class looks like in general while it learns your specific subject.

LoRA applied to image models does the same thing but adds the low-rank constraint from language fine-tuning: instead of updating all weights in the U-Net, it trains two small matrices per target layer. The result is a much smaller file (~50–200 MB vs 2–6 GB for a full checkpoint), faster training, and the ability to stack multiple LoRAs at inference time.

Today, LoRA is the standard. DreamBooth is mostly used as a training recipe (the prior preservation loss and dataset structure) while LoRA is used as the weight update method. Most tools — including Replicate, FAL.ai, and Claycable — use DreamBooth-style datasets with LoRA weight updates.

The dataset: what you actually need

The most common mistake is treating image fine-tuning like image generation: collecting 200 random photos and hoping the model figures it out. It won't. The model needs signal, not variety.

Subject LoRA (person, product, character)

15–30 images is enough. More is not always better — 100 images of the same person in the same outfit teaches less than 20 images with variation.
Variation in pose, lighting, angle, and background. A single static angle at 20 images will produce a model that can only generate that exact angle.
Cropping matters. For face/person LoRAs, include a mix of tight face crops (512px) and full-body shots. For product LoRAs, include shots on plain backgrounds, in context, from different angles.
Quality over quantity. Blurry, low-resolution, or heavily compressed images will degrade the adapter. At minimum 512px on the short side; 1024px is better for Flux.

Style LoRA

50–200 images in your target style. More is better here since you are teaching a visual grammar, not a specific subject.
Images should be visually consistent — same color palette, line weight, rendering style.
Caption each image accurately. The captions teach the model which visual elements map to which words.

Trigger words

A trigger word is a token you include in prompts at inference time to activate the LoRA. At training time, your captions are written as a photo of TOK, where TOK is your trigger word. The model learns to associateTOK with the visual pattern in your images.

Best practices:

Use a short, unique token: TOK, sks, ohwx, or your brand abbreviation.
Avoid common English words — dog as a trigger word will conflict with the model's prior knowledge of dogs.
Be consistent: use the exact same trigger word in every caption and at inference time.

Choosing a base model

The base model determines the ceiling on your adapter's quality. Claycable currently supports three:

Flux Dev — the highest quality open model as of 2025. 12B parameters, 1024px native resolution. Use this when quality matters. Training takes longer (1,000–2,000 steps recommended) but the results are significantly sharper than SDXL.
SDXL 1.0 — the reliable workhorse. 6.6B parameters, 1024px native. Widely supported, fast inference, huge ecosystem of existing LoRAs you can stack with yours. Good choice for style adapters and product shots.
SD 1.5 — smallest and fastest. 0.86B parameters, 512px native. Use when you need fast turnaround or lightweight deployment. Quality is noticeably lower than SDXL/Flux but training is cheap and inference is instant.

Replicate vs FAL.ai — which to choose

Both are serverless GPU providers with REST APIs. The differences:

Replicate — per-second billing on A40/A100 GPUs. Slightly slower cold starts but very transparent pricing. Flux Dev trainer (ostris/flux-dev-lora-trainer) is the gold standard. Good for non-time-sensitive batch training.
FAL.ai — faster queue clearing, often 3–8 minutes for Flux LoRA jobs. Slightly less transparent pricing but competitive. Better for tight loops where you want to iterate quickly.
Google Colab — free T4 GPU with the Claycable notebook. Slower than either provider (~2–4 hours for a Flux job on T4) but completely free. Good for experimentation.

Recommendation: Start on Replicate for serious work (better documentation, more model options), use FAL for speed, use Colab when exploring.

Training steps vs epochs

Image fine-tuning is typically configured in steps, not epochs. A step is one gradient update on one batch of images. The right number of steps depends on your dataset size:

15–20 images: 1,000–1,500 steps
30–50 images: 1,500–2,500 steps
100+ images (style LoRA): 2,000–4,000 steps

Over-training is real. At too many steps, the model overfits to your training images — it can only generate compositions and poses it has seen, rather than generalizing. The symptom is that all your generations look exactly like one of your training images regardless of what you prompt.

What to do with your adapter

Once your adapter is trained and in the Vault, you can:

Generate images by passing the adapter weights to any Flux/SDXL inference endpoint
Chain it with other LoRAs — most inference engines support stacking (e.g. your subject LoRA + a style LoRA)
Pass it to an agent — in Claycable, an Armstrong-style agent can call your image adapter to generate product shots, marketing assets, or training data for other models
Export to Hugging Face to make it publicly available or use with ComfyUI/Automatic1111

The Armstrong use case

This is worth calling out explicitly because it's non-obvious. If you train a visual LoRA on a company's product catalog, you can give that LoRA to an orchestrating agent as a tool. The agent can then generate marketing images on demand — not from a stock prompt, but from the exact products, in the exact brand aesthetic, with the exact trigger word baked in.

Combined with a text adapter trained on the brand voice, the agent can write copy and generate images for a full campaign — no human in the loop for the initial draft. This is what multimodal fine-tuning unlocks that prompt engineering never will: the model's weights actually contain the brand, not just a reference to it.

CL^YC^BLE

warming up...

← All posts