In this example, we'll be training a Flux.1 LoRA model using the SimpleTuner toolkit.
Flux requires a lot of system RAM in addition to GPU memory. Simply quantising the model at startup requires about 50GB of system memory. If it takes an excessively long time, you may need to assess your hardware's capabilities and whether any changes are needed.
When you're training every component of a rank-16 LoRA (MLP, projections, multimodal blocks), it ends up using:
- a bit more than 32G VRAM when not quantising the base model
- a bit more than 20G VRAM when quantising to int8 + bf16 base/LoRA weights
- a bit more than 13G VRAM when quantising to int2 + bf16 base/LoRA weights
To have reliable results, you'll need:
- at minimum a single 3090 or V100 GPU
- ideally multiple A6000s
Luckily, these are readily available through providers such as LambdaLabs which provides the lowest available rates, and localised clusters for multi-node training.
Unlike other models, AMD and Apple GPUs do not work for training Flux.
Make sure that you have python installed; SimpleTuner does well with 3.10 or 3.11. Python 3.12 should not be used.
You can check this by running:
python --version
If you don't have python 3.11 installed on Ubuntu, you can try the following:
apt -y install python3.11 python3.11-venv
For Vast, RunPod, and TensorDock (among others), the following will work on a CUDA 12.2-12.4 image:
apt -y install nvidia-cuda-toolkit libgl1-mesa-glx
If libgl1-mesa-glx
is not found, you might need to use libgl1-mesa-dri
instead. Your mileage may vary.
Clone the SimpleTuner repository and set up the python venv:
git clone --branch=release https://github.com/bghira/SimpleTuner.git
cd SimpleTuner
# if python --version shows 3.11 you can just also use the 'python' command here.
python3.11 -m venv .venv
source .venv/bin/activate
pip install -U poetry pip
Note: We're currently installing the release
branch here; the main
branch may contain experimental features that might have better results or lower memory use.
Depending on your system, you will run one of 3 commands:
# MacOS
poetry install -C install/apple
# Linux
poetry install
# Linux with ROCM
poetry install -C install/rocm
The following must be executed for an AMD MI300X to be useable:
apt install amd-smi-lib
pushd /opt/rocm/share/amd_smi
python3 -m pip install --upgrade pip
python3 -m pip install .
popd
To run SimpleTuner, you will need to set up a configuration file, the dataset and model directories, and a dataloader configuration file.
An experimental script, configure.py
, may allow you to entirely skip this section through an interactive step-by-step configuration. It contains some safety features that help avoid common pitfalls.
Note: This doesn't configure your dataloader. You will still have to do that manually, later.
To run it:
python configure.py
⚠️ For users located in countries where Hugging Face Hub is not readily accessible, you should addHF_ENDPOINT=https://hf-mirror.com
to your~/.bashrc
or~/.zshrc
depending on which$SHELL
your system uses.
If you prefer to manually configure:
Copy config/config.json.example
to config/config.json
:
cp config/config.json.example config/config.json
There, you will possibly need to modify the following variables:
-
model_type
- Set this tolora
. -
model_family
- Set this toflux
. -
pretrained_model_name_or_path
- Set this toblack-forest-labs/FLUX.1-dev
.- Note that you will probably need to log in to Huggingface and be granted access to download this model. We will go over logging in to Huggingface later in this tutorial.
-
output_dir
- Set this to the directory where you want to store your checkpoints and validation images. It's recommended to use a full path here. -
train_batch_size
- this should be kept at 1, especially if you have a very small dataset. -
validation_resolution
- As Flux is a 1024px model, you can set this to1024x1024
.- Additionally, Flux was fine-tuned on multi-aspect buckets, and other resolutions may be specified using commas to separate them:
1024x1024,1280x768,2048x2048
- Additionally, Flux was fine-tuned on multi-aspect buckets, and other resolutions may be specified using commas to separate them:
-
validation_guidance
- Use whatever you are used to selecting at inference time for Flux. -
validation_guidance_real
- Use >1.0 to use CFG for flux inference. Slows validations down, but produces better results. Does best with an emptyVALIDATION_NEGATIVE_PROMPT
. -
validation_num_inference_steps
- Use somewhere around 20 to save time while still seeing decent quality. Flux isn't very diverse, and more steps might just waste time. -
--lora_rank=4
if you wish to substantially reduce the size of the LoRA being trained. This can help with VRAM use. -
If training a Schnell LoRA, you'll have to supply
--flux_fast_schedule=true
manually here as well. -
gradient_accumulation_steps
- Previous guidance was to avoid these with bf16 training since they would degrade the model. Further testing showed this is not necessarily the case for Flux.- This option causes update steps to be accumulated over several steps. This will increase the training runtime linearly, such that a value of 2 will make your training run half as quickly, and take twice as long.
-
optimizer
- Beginners are recommended to stick with adamw_bf16, though optimi-lion and optimi-stableadamw are also good choices. -
mixed_precision
- Beginners should keep this inbf16
Inside config/config.json
is the "primary validation prompt", which is typically the main instance_prompt you are training on for your single subject or style. Additionally, a JSON file may be created that contains extra prompts to run through during validations.
The example config file config/user_prompt_library.json.example
contains the following format:
{
"nickname": "the prompt goes here",
"another_nickname": "another prompt goes here"
}
The nicknames are the filename for the validation, so keep them short and compatible with your filesystem.
To point the trainer to this prompt library, add it to TRAINER_EXTRA_ARGS by adding a new line at the end of config.json
:
"--user_prompt_library": "config/user_prompt_library.json",
A set of diverse prompt will help determine whether the model is collapsing as it trains. In this example, the word <token>
should be replaced with your subject name (instance_prompt).
{
"anime_<token>": "a breathtaking anime-style portrait of <token>, capturing her essence with vibrant colors and expressive features",
"chef_<token>": "a high-quality, detailed photograph of <token> as a sous-chef, immersed in the art of culinary creation",
"just_<token>": "a lifelike and intimate portrait of <token>, showcasing her unique personality and charm",
"cinematic_<token>": "a cinematic, visually stunning photo of <token>, emphasizing her dramatic and captivating presence",
"elegant_<token>": "an elegant and timeless portrait of <token>, exuding grace and sophistication",
"adventurous_<token>": "a dynamic and adventurous photo of <token>, captured in an exciting, action-filled moment",
"mysterious_<token>": "a mysterious and enigmatic portrait of <token>, shrouded in shadows and intrigue",
"vintage_<token>": "a vintage-style portrait of <token>, evoking the charm and nostalgia of a bygone era",
"artistic_<token>": "an artistic and abstract representation of <token>, blending creativity with visual storytelling",
"futuristic_<token>": "a futuristic and cutting-edge portrayal of <token>, set against a backdrop of advanced technology",
"woman": "a beautifully crafted portrait of a woman, highlighting her natural beauty and unique features",
"man": "a powerful and striking portrait of a man, capturing his strength and character",
"boy": "a playful and spirited portrait of a boy, capturing youthful energy and innocence",
"girl": "a charming and vibrant portrait of a girl, emphasizing her bright personality and joy",
"family": "a heartwarming and cohesive family portrait, showcasing the bonds and connections between loved ones"
}
ℹ️ Flux is a flow-matching model and shorter prompts that have strong similarities will result in practically the same image being produced by the model. Be sure to use longer, more descriptive prompts.
Tested on Apple and NVIDIA systems, Hugging Face Optimum-Quanto can be used to reduce the precision and VRAM requirements, training Flux on just 20GB.
Inside your SimpleTuner venv:
pip install optimum-quanto
For config.json
users:
"base_model_precision": "int8-quanto",
"text_encoder_1_precision": "no_change",
"text_encoder_2_precision": "no_change",
"lora_rank": 16,
"max_grad_norm": 1.0,
"base_model_default_dtype": "bf16"
#################################################
#################################################
# When training 'mmdit', we find very stable training that makes the model take longer to learn.
# When training 'all', we can easily shift the model distribution, but it is more prone to forgetting and benefits from high quality data.
# When training 'all+ffs', all attention layers are trained in addition to the feed-forward which can help with adapting the model objective for the LoRA.
# - This mode has been reported to lack portability, and platforms such as ComfyUI might not be able to load the LoRA.
# The option to train only the 'context' blocks is offered as well, but its impact is unknown, and is offered as an experimental choice.
# - An extension to this mode, 'context+ffs' is also available, which is useful for pretraining new tokens into a LoRA before continuing finetuning it via `--init_lora`.
"--flux_lora_target": "all",
# If you want to use LoftQ initialisation, you can't use Quanto to quantise the base model.
# This possibly offers better/faster convergence, but only works on NVIDIA devices and requires Bits n Bytes and is incompatible with Quanto.
# Other options are 'default', 'gaussian' (difficult), and untested options: 'olora' and 'pissa'.
"--lora_init_type": "loftq",
⚠️ Image quality for training is more important for Flux than for most other models, as it will absorb the artifacts in your images first, and then learn the concept/subject.
It's crucial to have a substantial dataset to train your model on. There are limitations on the dataset size, and you will need to ensure that your dataset is large enough to train your model effectively. Note that the bare minimum dataset size is train_batch_size * gradient_accumulation_steps
as well as more than vae_batch_size
. The dataset will not be useable if it is too small.
ℹ️ With few enough images, you might see a message no images detected in dataset - increasing the
repeats
value will overcome this limitation.
Depending on the dataset you have, you will need to set up your dataset directory and dataloader configuration file differently. In this example, we will be using pseudo-camera-10k as the dataset.
Create a --data_backend_config
(config/multidatabackend.json
) document containing this:
[
{
"id": "pseudo-camera-10k-flux",
"type": "local",
"crop": true,
"crop_aspect": "square",
"crop_style": "center",
"resolution": 512,
"minimum_image_size": 512,
"maximum_image_size": 512,
"target_downsample_size": 512,
"resolution_type": "pixel_area",
"cache_dir_vae": "cache/vae/flux/pseudo-camera-10k",
"instance_data_dir": "datasets/pseudo-camera-10k",
"disabled": false,
"skip_file_discovery": "",
"caption_strategy": "filename",
"metadata_backend": "discovery"
},
{
"id": "dreambooth-subject",
"type": "local",
"crop": false,
"resolution": 1024,
"minimum_image_size": 1024,
"maximum_image_size": 1024,
"target_downsample_size": 1024,
"resolution_type": "pixel_area",
"cache_dir_vae": "cache/vae/flux/dreambooth-subject",
"instance_data_dir": "datasets/dreambooth-subject",
"caption_strategy": "instanceprompt",
"instance_prompt": "the name of your subject goes here",
"metadata_backend": "discovery"
},
{
"id": "dreambooth-subject-512",
"type": "local",
"crop": false,
"resolution": 512,
"minimum_image_size": 512,
"maximum_image_size": 512,
"target_downsample_size": 512,
"resolution_type": "pixel_area",
"cache_dir_vae": "cache/vae/flux/dreambooth-subject-512",
"instance_data_dir": "datasets/dreambooth-subject",
"caption_strategy": "instanceprompt",
"instance_prompt": "the name of your subject goes here",
"metadata_backend": "discovery"
},
{
"id": "text-embeds",
"type": "local",
"dataset_type": "text_embeds",
"default": true,
"cache_dir": "cache/text/flux",
"disabled": false,
"write_batch_size": 128
}
]
ℹ️ Running 512px and 1024px datasets concurrently is supported, and could result in better convergence for Flux.
Then, create a datasets
directory:
mkdir -p datasets
pushd datasets
huggingface-cli download --repo-type=dataset bghira/pseudo-camera-10k --local-dir=pseudo-camera-10k
mkdir dreambooth-subject
# place your images into dreambooth-subject/ now
popd
This will download about 10k photograph samples to your datasets/pseudo-camera-10k
directory, which will be automatically created for you.
Your Dreambooth images should go into the datasets/dreambooth-subject
directory.
You'll want to login to WandB and HF Hub before beginning training, especially if you're using --push_to_hub
and --report_to=wandb
.
If you're going to be pushing items to a Git LFS repository manually, you should also run git config --global credential.helper store
Run the following commands:
wandb login
and
huggingface-cli login
Follow the instructions to log in to both services.
From the SimpleTuner directory, one simply has to run:
./train.sh
This will begin the text embed and VAE output caching to disk.
For more information, see the dataloader and tutorial documents.
Note: It's unclear whether training on multi-aspect buckets works correctly for Flux at the moment. It's recommended to use crop_style=random
and crop_aspect=square
.
In ComfyUI, you'll need to put Flux through another node called AdaptiveGuider. One of the members from our community has provided a modified node here:
(external links) IdiotSandwichTheThird/ComfyUI-Adaptive-Guidan... and their example workflow here
Inferencing the CFG-distilled LoRA is as easy as using a lower guidance_scale around the value trained with.
The Dev model arrives guidance-distilled out of the box, which means it does a very straight shot trajectory to the teacher model outputs. This is done through a guidance vector that is fed into the model at training and inference time - the value of this vector greatly impacts what type of resulting LoRA you end up with:
- A value of 1.0 (the default) will preserve the initial distillation done to the Dev model
- This is the most compatible mode
- Inference is just as fast as the original model
- Flow-matching distillation reduces the creativity and output variability of the model, as with the original Flux Dev model (everything keeps the same composition/look)
- A higher value (tested around 3.5-4.5) will reintroduce the CFG objective into the model
- This requires the inference pipeline to have support for CFG
- Inference is 50% slower and 0% VRAM increase or about 20% slower and 20% VRAM increase due to batched CFG inference
- However, this style of training improves creativity and model output variability, which might be required for certain training tasks
We can partially reintroduce distillation to a de-distilled model by continuing tuning your model using a vector value of 1.0. It will never fully recover, but it'll at least be more useable.
- This has the end impact of either:
- Increasing inference latency by 2x when we sequentially calculate the unconditional output, eg. with two separate forward pass
- Increasing the VRAM consumption equivalently to using
num_images_per_prompt=2
and receiving two images at inference time, accompanied by the same percent slowdown.- This is often less extreme slowdown than sequential computation, but the VRAM use might be too much for most consumer training hardware.
- This method is not currently integrated into SimpleTuner, but work is ongoing.
- Inference workflows for ComfyUI or other applications (eg. AUTOMATIC1111) will need to be modified to also enable "true" CFG, which might not be currently possible out of the box.
- Minimum 8bit quantisation is required for a 24G card to train this model - but 32G (V100) cards suffer a more tragic fate.
- Without quantising the model, a rank-1 LoRA sits at just over 32GB of mem use, in a way that prevents a 32G V100 from actually working
- Using the optimi-lion optimiser may reduce training just enough to make the V100 work.
- Quantising the model doesn't harm training
- It allows you to push higher batch sizes and possibly obtain a better result
- Behaves the same as full-precision training - fp32 won't make your model any better than bf16+int8.
- As usual, fp8 quantisation runs more slowly than int8 and might have a worse result due to the use of
e4m3fn
in Quanto- fp16 training similarly is bad for Flux; this model wants the range of bf16
e5m2
level precision is better at fp8 but haven't looked into how to enable it yet. Sorry, H100 owners. We weep for you.
- When loading the LoRA in ComfyUI later, you must use the same base model precision as you trained your LoRA on.
int4
is weird and really only works on A100 and H100 cards due to a reliance on custom bf16 kernels
- If you get SIGKILL after the text encoders are unloaded, this means you do not have enough system memory to quantise Flux.
- Try loading the
--base_model_precision=bf16
but if that does not work, you might just need more memory..
- Try loading the
- Direct Schnell training really needs a bit more time in the oven - currently, the results do not look good
- If you absolutely must train Schnell, try the x-flux trainer from X-Labs
- Ostris' ai-toolkit uses a low-rank adapter probably pulled from OpenFLUX.1 as a source of CFG that can be inverted from the final result - this will probably be implemented here eventually after results are more widely available and tests have completed
- Training a LoRA on Dev will however, run just fine on Schnell
- Dev+Schnell merge 50/50 just fine, and the LoRAs can possibly be trained from that, which will then run on Schnell or Dev
ℹ️ When merging Schnell with Dev in any way, the license of Dev takes over and it becomes non-commercial. This shouldn't really matter for most users, but it's worth noting.
- It's been reported that Flux trains similarly to SD 1.5 LoRAs
- However, a model as large as 12B has empirically performed better with lower learning rates.
- LoRA at 1e-3 might totally roast the thing. LoRA at 1e-5 does nearly nothing.
- Ranks as large as 64 through 128 might be undesirable on a 12B model due to general difficulties that scale up with the size of the base model.
- Try a smaller network first (rank-1, rank-4) and work your way up - they'll train faster, and might do everything you need.
- If you're finding that it's excessively difficult to train your concept into the model, you might need a higher rank and more regularisation data.
- Other diffusion transformer models like PixArt and SD3 majorly benefit from
--max_grad_norm
and SimpleTuner keeps a pretty high value for this by default on Flux.- A lower value would keep the model from falling apart too soon, but can also make it very difficult to learn new concepts that venture far from the base model data distribution. The model might get stuck and never improve.
- Higher learning rates are better for LoKr (
1e-3
with AdamW,2e-4
with Lion) - Other algo need more exploration.
Flux will immediately absorb bad image artifacts. It's just how it is - a final training run on just high quality data may be required to fix it at the end.
When you do these things (among others), some square grid artifacts may begin appearing in the samples:
- Overtrain with low quality data
- Use too high of a learning rate
- Overtraining (in general), a low-capacity network with too many images
- Undertraining (also), a high-capacity network with too few images
- Using weird aspect ratios or training data sizes
- Training for too long on square crops probably won't damage this model too much. Go nuts, it's great and reliable.
- On the other hand, using the natural aspect buckets of your dataset might overly bias these shapes during inference time.
- This could be a desirable quality, as it keeps aspect-dependent styles like cinematic stuff from bleeding into other resolutions too much.
- However, if you're looking to improve results equally across many aspect buckets, you might have to experiment with
crop_aspect=random
which comes with its own downsides.
- Mixing dataset configurations by defining your image directory dataset multiple times has produced really good results and a nicely generalised model.
The users of Terminus Research who worked on this probably more than their day jobs to figure it out
Lambda Labs for generous compute allocations that were used for tests and verifications for large scale training runs
Especially @JimmyCarter and @kaibioinfo for coming up with some of the best ideas and putting them into action, offering pull requests and running exhaustive tests for analysis - even daring to use their own faces for DreamBooth experimentation.