Skip to content

Add Z-Image Text-to-Image Generation Support#3261

Merged
ivarflakstad merged 9 commits intohuggingface:mainfrom
SpenserCai:z_image_support
Jan 2, 2026
Merged

Add Z-Image Text-to-Image Generation Support#3261
ivarflakstad merged 9 commits intohuggingface:mainfrom
SpenserCai:z_image_support

Conversation

@SpenserCai
Copy link
Contributor

@SpenserCai SpenserCai commented Dec 24, 2025

Summary

This PR introduces support for Z-Image, Alibaba's ~24B parameter text-to-image generation model using Flow Matching. The implementation follows Candle's architecture conventions and includes the full inference pipeline.

Model Overview

Z-Image is a state-of-the-art text-to-image model featuring:

  • Transformer: 24B parameter DiT with 30 main layers + 2 noise refiner + 2 context refiner
  • Text Encoder: Qwen3-based encoder (outputs second-to-last hidden states)
  • VAE: AutoEncoderKL with diffusers format weights
  • Scheduler: FlowMatchEulerDiscreteScheduler with dynamic timestep shifting
  • Position Encoding: 3D RoPE (Frame/Height/Width axes)

Model Links:

🔧 Usage Examples

Basic Usage (CUDA)

cargo run --features cuda --example z_image --release -- \
    --model-path weights/Z-Image-Turbo \
    --prompt "A beautiful landscape with mountains and a lake" \
    --width 1024 --height 768 \
    --num-steps 8

Using Metal (macOS)

cargo run --features metal --example z_image --release -- \
    --model-path weights/Z-Image-Turbo \
    --prompt "A futuristic city at night with neon lights" \
    --width 1024 --height 1024 \
    --num-steps 9

Files Changed

New Files

File Lines Description
candle-transformers/src/models/z_image/mod.rs 34 Module exports
candle-transformers/src/models/z_image/transformer.rs 940 Core Transformer (Config, TimestepEmbedder, RopeEmbedder, ZImageAttention, ZImageTransformerBlock, FinalLayer, ZImageTransformer2DModel)
candle-transformers/src/models/z_image/text_encoder.rs 453 Qwen3-based Text Encoder
candle-transformers/src/models/z_image/vae.rs 684 AutoEncoderKL (diffusers format)
candle-transformers/src/models/z_image/scheduler.rs 237 FlowMatchEulerDiscreteScheduler
candle-transformers/src/models/z_image/sampling.rs 133 Sampling utilities (noise generation, shift calculation)
candle-transformers/src/models/z_image/preprocess.rs 169 Input preprocessing (image postprocessing)
candle-examples/examples/z_image/main.rs 393 Complete inference example
candle-examples/examples/z_image/README.md 128 Example documentation

Modified Files

File Change
candle-transformers/src/models/mod.rs Added pub mod z_image;

Implementation Highlights

1. Optimized Patchify/Unpatchify

The implementation uses optimized 6D tensor operations for the F=1 (single frame) case, avoiding Candle's 7D+ dimension limitations:

// Patchify: (B, C, 1, H, W) → (B, num_patches, patch_dim)
// Matches Python: permute(1, 3, 5, 2, 4, 6, 0)
let x = x.permute((0, 2, 4, 3, 5, 1))?;  // (B, H_t, W_t, pH, pW, C)

2. 3D RoPE Position Encoding

Implements 3D Rotary Position Embeddings with pre-computed sin/cos caches:

pub struct RopeEmbedder {
    axes_dims: Vec<usize>,  // [32, 48, 48] for Frame/H/W
    axes_lens: Vec<usize>,  // [1536, 512, 512] max positions
    cos_cached: Vec<Tensor>,
    sin_cached: Vec<Tensor>,
}

3. AdaLN Modulation with Tanh Gate

// Z-Image specific: tanh gate instead of sigmoid
let gate_msa = gate_msa.tanh()?;
let gate_mlp = gate_mlp.tanh()?;

4. Dynamic Timestep Shifting

pub fn calculate_shift(seq_len: usize, base_seq: usize, max_seq: usize, base_shift: f64, max_shift: f64) -> f64 {
    let m = (max_shift - base_shift) / (max_seq - base_seq) as f64;
    base_shift + m * (seq_len - base_seq) as f64
}

Image Size Requirements

Image dimensions must be divisible by 16:

  • ✅ 1024×1024, 1024×768, 768×1024, 512×512, 1280×720
  • ❌ 1920×1080 (1080 is not divisible by 16)

Latent size formula: latent = 2 × (image_size ÷ 16)

📝 Testing Status

Test Status
cargo check --features metal ✅ Pass
cargo clippy --workspace --tests --examples --benches -- -D warnings ✅ Pass
cargo fmt --all -- --check ✅ Pass
Inference test (1024×768, Metal) ✅ Pass
Inference test (1024×1024, Metal) ✅ Pass

Sample Output

Metal

34b1e832d17ba98bb7ee3500327c5fbe

Cuda

70225eeb4ec55d9a85a64ad84c7a369f

Checklist

  • Code compiles without errors
  • Passes cargo clippy --workspace --tests --examples --benches -- -D warnings
  • Passes cargo fmt --all -- --check
  • Example runs successfully
  • README documentation added
  • Follows Candle architecture conventions
  • Weight mapping matches original implementation

References

Z-Image
Diffusers

Additional Fix: Clippy Warning in candle-nn

While implementing SDPA support for Z-Image, I discovered a minor clippy warning in candle-nn/src/ops.rs:1040 introduced by PR #3196. @EricLBuehler

Issue: clippy::nonminimal_bool warning

// Before
let supports_sdpa_full_mask = !self.mask.is_some() || q_seq <= k_seq;

// After
let supports_sdpa_full_mask = self.mask.is_none() || q_seq <= k_seq;

@SpenserCai SpenserCai mentioned this pull request Dec 24, 2025
@AlpineVibrations
Copy link

awesome! stoked.

@SpenserCai
Copy link
Contributor Author

Consistency Test

I additionally used the online inference of modelscope and examples from Rust implementation to conduct consistency testing with the same prompt words and cfg. Almost identical images were generated, indicating that the current candle implementation is completely consistent with the original diffusers.

764d0402349e58c0db1c91a34ce9ca23

Copy link
Member

@ivarflakstad ivarflakstad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! 🙌
I've verified the output on cuda and it looks great.

Most of my comments are nits or just that documentation is slightly off. Solid work.

@SpenserCai
Copy link
Contributor Author

Thank you for your review. I will repair the relevant content later.

SpenserCai and others added 3 commits January 2, 2026 11:50
Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>
Copy link
Member

@ivarflakstad ivarflakstad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! 👌

@ivarflakstad ivarflakstad merged commit 3a0d1cb into huggingface:main Jan 2, 2026
9 checks passed
@SpenserCai SpenserCai deleted the z_image_support branch January 4, 2026 01:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants