CL^YC^BLE
warming up...
CL^YC^BLE
warming up...
From DreamBooth to Flux LoRA — how image fine-tuning actually works, what your dataset needs to look like, trigger words, choosing between Replicate and FAL.ai, and what makes a good image adapter.
When you fine-tune a language model, you are adjusting probability distributions over tokens — teaching the model which words are likely in your domain. Image fine-tuning is fundamentally different: you are adjusting latent representations in a diffusion process, teaching the model to associate a concept (a specific person, product, style, or object) with a visual pattern it has never seen before.
The good news: image fine-tuning converges faster than language fine-tuning. A strong LoRA adapter for a specific subject can be trained in as few as 15–30 images and 1,000 training steps. The bad news: image quality is highly sensitive to dataset quality — blurry inputs produce blurry outputs, and inconsistent lighting or backgrounds confuse the model about what you are actually teaching it.
DreamBooth (Ruiz et al., 2022) was the original technique for personalizing diffusion models. It fine-tunes the entire U-Net on a small set of subject images, using a technique called prior preservation loss to prevent the model from forgetting what the class looks like in general while it learns your specific subject.
LoRA applied to image models does the same thing but adds the low-rank constraint from language fine-tuning: instead of updating all weights in the U-Net, it trains two small matrices per target layer. The result is a much smaller file (~50–200 MB vs 2–6 GB for a full checkpoint), faster training, and the ability to stack multiple LoRAs at inference time.
Today, LoRA is the standard. DreamBooth is mostly used as a training recipe (the prior preservation loss and dataset structure) while LoRA is used as the weight update method. Most tools — including Replicate, FAL.ai, and Claycable — use DreamBooth-style datasets with LoRA weight updates.
The most common mistake is treating image fine-tuning like image generation: collecting 200 random photos and hoping the model figures it out. It won't. The model needs signal, not variety.
A trigger word is a token you include in prompts at inference time to activate the LoRA. At training time, your captions are written as a photo of TOK, where TOK is your trigger word. The model learns to associateTOK with the visual pattern in your images.
Best practices:
TOK, sks, ohwx, or your brand abbreviation.dog as a trigger word will conflict with the model's prior knowledge of dogs.The base model determines the ceiling on your adapter's quality. Claycable currently supports three:
Both are serverless GPU providers with REST APIs. The differences:
Recommendation: Start on Replicate for serious work (better documentation, more model options), use FAL for speed, use Colab when exploring.
Image fine-tuning is typically configured in steps, not epochs. A step is one gradient update on one batch of images. The right number of steps depends on your dataset size:
Over-training is real. At too many steps, the model overfits to your training images — it can only generate compositions and poses it has seen, rather than generalizing. The symptom is that all your generations look exactly like one of your training images regardless of what you prompt.
Once your adapter is trained and in the Vault, you can:
This is worth calling out explicitly because it's non-obvious. If you train a visual LoRA on a company's product catalog, you can give that LoRA to an orchestrating agent as a tool. The agent can then generate marketing images on demand — not from a stock prompt, but from the exact products, in the exact brand aesthetic, with the exact trigger word baked in.
Combined with a text adapter trained on the brand voice, the agent can write copy and generate images for a full campaign — no human in the loop for the initial draft. This is what multimodal fine-tuning unlocks that prompt engineering never will: the model's weights actually contain the brand, not just a reference to it.