Yapay Zeka · 5 dk okuma · 25 Nisan 2026
StyleVAR: Autoregressive Style Transfer via Discrete Latent Codes
Researchers build conditional image synthesis into VAR framework using blended cross-attention, achieving texture transfer while preserving content structure across multiple scales.
StyleVAR applies autoregressive modeling to style transfer by tokenizing images and conditioning generation on style and content signals through blended attention.
- — Images decomposed into multi-scale tokens via VQ-VAE, then modeled autoregressively by transformer.
- — Blended cross-attention mechanism lets target representation attend to its own history while style/content guide emphasis.
- — Scale-dependent blending coefficient balances style texture and content structure at each generation stage.
- — Two-stage training: supervised fine-tuning on triplet datasets, then reinforcement learning with DreamSim reward.
- — Outperforms AdaIN baseline on Style Loss, Content Loss, LPIPS, SSIM, DreamSim, and CLIP metrics.
- — Handles landscapes and architecture well; struggles with internet images and human faces due to content diversity gaps.
- — GRPO reinforcement stage improves perceptual metrics beyond supervised baseline.
Sık sorulanlar
- StyleVAR uses autoregressive discrete modeling in a learned latent space with blended cross-attention, whereas AdaIN operates in feature space via instance normalization. StyleVAR's multi-scale tokenization and reinforcement learning alignment enable better texture transfer while preserving content structure. StyleVAR also outperforms AdaIN on perceptual metrics like LPIPS and DreamSim.