← Content
AI · 5 min read · April 25, 2026

StyleVAR: Autoregressive Style Transfer via Discrete Latent Codes

Researchers build conditional image synthesis into VAR framework using blended cross-attention, achieving texture transfer while preserving content structure across multiple scales.

Source: arxiv/cs.AI · Liqi Jing, Dingming Zhang, Peinian Li, Lichen Zhu · open original ↗ ↗
Share: X LinkedIn

StyleVAR applies autoregressive modeling to style transfer by tokenizing images and conditioning generation on style and content signals through blended attention.

  • Images decomposed into multi-scale tokens via VQ-VAE, then modeled autoregressively by transformer.
  • Blended cross-attention mechanism lets target representation attend to its own history while style/content guide emphasis.
  • Scale-dependent blending coefficient balances style texture and content structure at each generation stage.
  • Two-stage training: supervised fine-tuning on triplet datasets, then reinforcement learning with DreamSim reward.
  • Outperforms AdaIN baseline on Style Loss, Content Loss, LPIPS, SSIM, DreamSim, and CLIP metrics.
  • Handles landscapes and architecture well; struggles with internet images and human faces due to content diversity gaps.
  • GRPO reinforcement stage improves perceptual metrics beyond supervised baseline.

Frequently asked

  • StyleVAR uses autoregressive discrete modeling in a learned latent space with blended cross-attention, whereas AdaIN operates in feature space via instance normalization. StyleVAR's multi-scale tokenization and reinforcement learning alignment enable better texture transfer while preserving content structure. StyleVAR also outperforms AdaIN on perceptual metrics like LPIPS and DreamSim.

Related