arXiv 2026

Asymmetric Flow Models

Scalable high-dimensional pixel generation. Latent-to-pixel finetuning beats its latent base.

AsymFLUX.2 klein generated image grid

AsymFLUX.2 klein generations. AsymFlow finetunes FLUX.2 klein into a pixel-space flow model, producing highly realistic images with rich visual styles and fine detail.

Abstract

Flow-based generation in high-dimensional pixel spaces is difficult because velocity prediction requires modeling high-dimensional noise, even when data has strong low-rank structure. We present Asymmetric Flow Modeling (AsymFlow), a rank-asymmetric velocity parameterization that restricts noise prediction to a low-rank subspace while keeping data prediction full-dimensional. From this asymmetric prediction, AsymFlow analytically recovers the full-dimensional velocity without changing the network architecture or training and sampling procedures. On ImageNet 256×256, AsymFlow achieves a leading 1.57 FID, outperforming prior DiT/JiT-like pixel diffusion models by a large margin. AsymFlow also provides the first route for finetuning pretrained latent flow models into pixel-space models: aligning the low-rank pixel subspace to the latent space gives a seamless initialization that preserves the latent model's high-level semantics and structure, so finetuning mainly improves low-level mismatches rather than relearning pixel generation. Finetuned from FLUX.2 klein, AsymFLUX.2 klein establishes a new state of the art for pixel-space text-to-image generation, beating its latent base on HPSv3, DPG-Bench, and GenEval while qualitatively showing substantially improved visual realism.

1.57 FID

#1 in ImageNet 256 pixel flow models

AsymFlow achieves 1.76 FID with JiT-H/16 network and 1.57 FID with an additional REPA loss, outperforming prior DiT/JiT-like pixel diffusion models by a large margin.

10.66 HPSv3

#1 on HPSv3 text-to-image benchmark

Finetuned pixel AsymFLUX.2 klein shows improved visual styles and realism, beating the latent FLUX.2 klein base model on HPSv3, DPG-Bench, and GenEval.

Rank-Asymmetric Flow Parameterization

Standard velocity prediction asks the network to predict \(u = \epsilon - x_0\), so both the data term and the noise term are full-dimensional. In uncompressed pixel space, this forces a transformer to represent high-dimensional noise that wastes model capacity.

AsymFlow changes the target to asymmetric velocity \(u_\mathrm{A} = P\epsilon - x_0\), where \(P\) is a patch-wise low-rank projector. The model predicts noise in a low-rank subspace and data in the full-dimensional space, then recovers the full velocity analytically. This reduces the burden of noise prediction.

AsymFlow parameterization and velocity recovery diagram

AsymFlow keeps the data term full-dimensional while restricting noise to a low-rank subspace.

Orthogonal Component View

\(u_\mathrm{A}\) can be decomposed into the low-rank subspace \(\mathrm{Im}(P)\) and its orthogonal complement \(\mathrm{Im}(I-P)\). The model effectively predicts \(u\) in the low-rank subspace and \(x_0\) in the orthogonal complement.

Thus, \(x_0\)-prediction is the zero-rank special case, and \(u\)-prediction is the full-rank special case. A small but non-zero rank is optimal as it retains the numerical stability and controllability of \(u\)-prediction in a meaningful subspace while reducing noise prediction difficulty.

AsymFlow parameterization family

Varying the rank yields a parameterization family whose endpoints recover full \(x_0\)-prediction and full \(u\)-prediction.

Finetuning Latent Models into Pixel Models

By aligning the patch-wise low-rank pixel subspace with the latent space, AsymFlow can initialize a pixel model directly from a pretrained latent model. The initialized pixel denoising trajectory is coupled to the latent trajectory, so it already preserves high-level semantics and structure. Finetuning therefore mainly corrects low-level details and texture, rather than relearning text-to-image generation.

Using this initialization, we finetune FLUX.2 klein 9B into AsymFLUX.2 klein for pixel-space text-to-image generation. A variance-reduced flow matching objective with perceptual correction is adopted to improve fine detail and texture.

Latent-to-pixel initialization comparison

The lifted low-rank pixel initialization preserves the semantics and structure of the latent model.

Results

ImageNet pixel diffusion comparison chart

One ImageNet, AsymFlow improves over the \(x_0\)-prediction JiT baseline by a clear margin, reaching the best FID with patch rank 8.

ImageNet 256×256 Pixel Diffusion Comparison

Method Pred. (±) Params GFLOPs FID↓
Hierarchical CNNs (skip connections / U-Net-like)
ADM-G \(\epsilon\) 554M 2240 4.59
Hierarchical transformers (skip connections / U-ViT-like)
RIN \(\epsilon\) 320M 668 3.42
SiD, UViT/2 \(\epsilon\) 2B 1110 2.44
VDM++, UViT/2 \(\epsilon\) 2B 1110 2.12
SiD2, UViT/2 \(\epsilon\) - 274 1.73
EPG-G/16 \(x_0\) 1.4B 642 1.58
SiD2, UViT/1 \(\epsilon\) - 1306 1.38
Hierarchical transformers (decoder head / DDT-like)
PixNerd-XL/16 \(\epsilon - x_0\) 700M 268 2.15
DiP-XL/16 \(\epsilon - x_0\) 631M - 1.79
DeCo-XL/16 \(\epsilon - x_0\) 682M 245 1.62
PixelDiT-XL/16 \(\epsilon - x_0\) 797M 311 1.61
Plain transformers (DiT-like)
PixelFlow-XL/4 \(\epsilon - x_0\) 677M 5818 1.98
JiT-H/16 \(x_0\) 953M 363 1.86*
PixelGen-XL/16 \(x_0\) 676M 260 1.83
JiT-G/16 \(x_0\) 2B 766 1.82*
PixelREPA-H/16 \(x_0\) 953M 363 1.81*
AsymFlow-H/16 \(P\epsilon - x_0\) 953M 363 1.57

FLOP estimation follows the convention in PixelDiT. * denotes JiT evaluation protocol, which may have slightly better FID than ADM.

System-Level Text-to-Image Diffusion Comparison

Method HPSv3↑ DPG↑ GenEval↑
Latent diffusion models
SDXL 8.20 74.7 0.55
PixArt-\(\Sigma\) 9.37 80.5 0.54
Hunyuan-DiT 8.19 78.9 0.63
FLUX.1 dev 10.43 84.0 0.67
Qwen-Image 9.52 87.8 0.86
FLUX.2 klein Base 9.50 85.2 0.80
Pixel diffusion models
PixelDiT-T2I 8.95 83.5 0.74
AsymFLUX.2 klein 10.66 86.8 0.82

All text-to-image evaluations generate 1024×1024 images.

Qualitative comparison of text-to-image models

Pixel AsymFLUX.2 klein produces more realistic styles and textures that distinguish it from latent diffusion models.

Interactive Demo

If the embedded demo does not load, please visit the HuggingFace Space.

Citation

BibTeX
@article{chen2026asymmetric,
  title={Asymmetric Flow Models},
  author={Hansheng Chen and Jan Ackermann and Minseo Kim and Gordon Wetzstein and Leonidas Guibas},
  journal={arXiv preprint arXiv:2605.12964},
  url={https://arxiv.org/abs/2605.12964},
  year={2026},
}