Z-Image Turbo LoRA Training Guide | Create Custom LoRAs with AI Toolkit

Z-Image Turbo LoRA Training Guide | Create Custom LoRAs with AI Toolkit

Once you understand the basics of LoRA, the next step is creating your own. This guide walks you through training a LoRA for Z-Image Turbo using Ostris AI Toolkit, from dataset preparation to inference testing.

What You’ll Learn

  • Required environment and setup for Z-Image Turbo LoRA training
  • AI Toolkit installation steps
  • Dataset preparation methods and best practices
  • Training parameter configuration and execution
  • Using trained LoRAs in ComfyUI
  • Measured timings on Apple Silicon (MPS)

Prerequisites

  • Understanding of LoRA basics
  • Basic ComfyUI operation skills
  • Basic terminal/command line familiarity

Key Concepts for Z-Image Turbo LoRA Training

Why Training Adapter Is Needed

Z-Image Turbo is a distilled model. While standard models require 20-50 steps to generate an image, Z-Image Turbo is optimized to generate in just 8 steps.

This distillation is efficient, but it creates a problem for LoRA training. Training a LoRA directly on a distilled model breaks the fast generation capability acquired through distillation. This is called “Turbo Drift.”

The Training Adapter solves this by temporarily reversing the distillation effect during training. At inference time, you remove the adapter and use only the LoRA.

Training Approaches

MethodInference SpeedDifficultyNotes
Turbo + Training Adapter v28 steps (fastest)LowRecommended for beginners. Most popular
De-Turbo model training20-30 stepsMediumNo adapter needed. Better for extended fine-tuning
Base model trainingHigh quality but slowerHighBest likeness according to community reports

This guide uses the most common Turbo + Training Adapter v2 approach.

Environment Setup

Requirements

ItemMinimumRecommended
GPU (NVIDIA)12GB VRAM24GB VRAM
GPU (Apple Silicon)32GB unified memory64GB unified memory
Python3.10+3.10-3.11
PyTorch2.0+2.8+ (MPS support)
Disk50GB100GB+

Required Models

ModelSizePurpose
Z-Image Turbo BF16~12GBBase model
Training Adapter v2~324MBDe-distillation adapter
Qwen 3 4BincludedText encoder
VAE (ae.safetensors)includedImage encode/decode

AI Toolkit Setup

AI Toolkit Installation
git clone https://github.com/ostris/ai-toolkit.git cd ai-toolkit git submodule update --init --recursive pip install -r requirements.txt

Download Training Adapter v2:

Training Adapter v2 Download
# Download from HuggingFace # ostris/zimage_turbo_training_adapter repository # zimage_turbo_training_adapter_v2.safetensors (324MB)

Setup Timings (Measured)

StepTime
AI Toolkit clone~10 sec
pip install~20 sec (depends on dependencies)
Training Adapter v2 download~5 sec (depends on connection)
Total~35 sec

Dataset Preparation

Image Requirements

The quality of your LoRA depends entirely on your dataset.

PurposeRecommended CountNotes
Minimum test5-15 imagesFor verification only
Style training30-120 images~45 is a good balance
High-quality character70-80 imagesReproduces skin texture

Dataset rules:

  • Resolution: 1024px+ recommended (512px works but lower quality)
  • Diversity: Include different poses, angles, expressions, backgrounds
  • Consistency: Keep the learning target (subject identity, etc.) consistent
  • Backgrounds: For character LoRAs, vary the backgrounds
  • Avoid: Blurry, low-res, watermarked, or multi-subject images

Composition Distribution (Character LoRA)

FramingProportionReason
Close-up (face-centered)40-50%Prioritize facial features
Medium shot (upper body)30-40%Learn body type and clothing
Full body10-20%Overall proportions

Creating Captions

Create a text file (.txt) with the same name for each image.

datasets/
└── my_dataset/
    ├── image1.jpg
    ├── image1.txt  ← "sks dog, a photo of a cute shiba inu dog"
    ├── image2.jpg
    ├── image2.txt
    └── ...

The trigger word (e.g., sks) should be a unique string that doesn’t conflict with existing vocabulary. Include it in all captions to activate the LoRA effect during inference.

Training Configuration

YAML Config

AI Toolkit manages training settings via YAML files. Template for Z-Image Turbo:

Z-Image Turbo LoRA Training Config (24GB GPU)
--- job: extension config: name: "my_zimage_lora_v1" process: - type: 'sd_trainer' training_folder: "output" device: cuda:0 trigger_word: "sks" network: type: "lora" linear: 16 linear_alpha: 16 save: dtype: float16 save_every: 250 max_step_saves_to_keep: 4 datasets: - folder_path: "/path/to/your/dataset" caption_ext: "txt" caption_dropout_rate: 0.05 cache_latents_to_disk: true resolution: [ 512, 768, 1024 ] train: batch_size: 1 steps: 3000 gradient_accumulation_steps: 1 train_unet: true train_text_encoder: false gradient_checkpointing: true noise_scheduler: "flowmatch" optimizer: "adamw8bit" lr: 1e-4 ema_config: use_ema: true ema_decay: 0.99 dtype: bf16 model: name_or_path: "Tongyi-MAI/Z-Image-Turbo" arch: "zimage" assistant_lora_path: "ostris/zimage_turbo_training_adapter/zimage_turbo_training_adapter_v2.safetensors" quantize: true sample: sampler: "flowmatch" sample_every: 250 width: 1024 height: 1024 prompts: - "sks, portrait photo, natural lighting" seed: 42 guidance_scale: 1 sample_steps: 8

Key Parameters

ParameterRecommendedDescription
linear (Rank)8-16LoRA rank. Higher = more expressive but larger file
lr1e-4 to 5e-5Learning rate. Too high = overfitting, too low = underfitting
steps3,000-5,600Total steps. Adjust based on dataset size
batch_size1-2Use 1 for small datasets
optimizeradamw8bitMemory efficient. Use adamw for Apple Silicon
resolution[512, 768, 1024]Multi-resolution bucketing for size variety
cache_latents_to_disktrueCache VAE encodings for speed
gradient_checkpointingtrueRequired for VRAM savings (24GB or less)

Apple Silicon (MPS) Notes

When running AI Toolkit on Apple Silicon MPS, these config changes are needed:

  • device: Change to mps:0
  • optimizer: adamw (adamw8bit is CUDA-only)
  • quantize: false (MPS doesn’t support quantization; 64GB unified memory allows unquantized training)
  • num_workers: 0 (add to dataset config; MPS tensors don’t support multiprocess sharing)

Running Training

Training Command
cd ai-toolkit python run.py config/your_config.yaml

When training starts, the following steps execute in order:

  1. Model loading: Load transformer, text encoder, and VAE
  2. Training Adapter merge: Integrate the de-distillation adapter
  3. LoRA network creation: Build training network at specified rank
  4. Latent caching: Save VAE-encoded dataset images to disk
  5. Training loop: Execute training for the specified number of steps

Monitoring Training

Monitor loss values in the logs. Normal training shows a gradual decrease in loss.

Measured Results: 100 Steps with 5 Training Images

We ran 100 steps of training on Apple Silicon M4 Pro (64GB) and performed the following verification.

Loss trend: Average loss for first 20 steps was 0.383, last 20 steps was 0.383. No significant loss decrease was observed at 100 steps. Per-step variance was large (0.21-0.60).

LoRA weight changes: LoRA B matrices (initialized at 0) moved to a mean norm of 0.16, confirming that gradient updates did occur.

Inference impact: Comparing images generated with the same prompt and seed, with and without LoRA, 98% of pixels showed differences. However, the mean difference was only 3.4/255 — the composition and subject were identical, with only subtle texture and color tone changes.

Conclusion: 5 images and 100 steps are sufficient to verify the pipeline works, but not enough to learn subject identity (e.g., steering toward the training data’s Shiba Inu). For practical LoRAs, we recommend at least 15+ images and 1,000+ steps.

Signs of overfitting:

  • Extremely low loss values
  • Sample images identical to training data
  • No response to prompt changes

Solutions: Reduce steps, lower learning rate, add more data.

Measured Training Speed (Apple Silicon M4 Pro, 64GB)

ProcessTime
Model loading~20 sec
Training Adapter merge~2 sec
Text encoder loading~1 sec
Latent caching (5 images × 3 resolutions)~15 sec
Per step~25 sec (512-1024px mixed)
100 steps~42 min
500 steps~3.5 hours
3,000 steps~21 hours

With an NVIDIA RTX 4090, expect ~10-15 seconds per step, completing 3,000 steps in roughly 8-12 hours.

Using Trained LoRAs

ComfyUI Inference

After training completes, place the .safetensors file from the output directory into ComfyUI’s models/loras/.

ComfyUI workflow structure:

UNETLoader (Z-Image Turbo)
  ↓ MODEL
LoRA Loader (trained LoRA)
  ↓ MODEL              ↓ CLIP
  KSampler ← CLIPTextEncode (prompt with trigger word)
  ↓ LATENT
VAEDecode → SaveImage

Strength Adjustment

  • LoRA strength: Start at 0.5-0.8 and adjust
  • If too strong causes artifacts, reduce to ~0.5
  • If too weak shows no effect, increase to 0.9-1.0

Stacking with Existing LoRAs

Multiple LoRAs can be combined:

Style LoRA (0.6) + Character LoRA (0.3) = Total 0.9

Keep total weight below 1.0 for best results.

Troubleshooting

Common Issues

IssueCauseSolution
Only generates training imagesOverfittingReduce steps. Lower LR. Add more data
LoRA has no visible effectUndertrainedIncrease steps. Raise LR
Black/noisy samplesConfig errorVerify cfg=1, steps=8
MPS DataLoader errorMultiprocess unsupportedSet num_workers: 0
Out of MemoryModel too largeSet quantize: true, lower resolution

Useful Resources