AI · 8 min read · April 17, 2026
Three-Phase Transformer: Structural Prior for Decoder Efficiency
A residual-stream architecture using cyclic channel partitioning and phase-aligned rotations achieves 7% perplexity gains with minimal parameter overhead.
Three-Phase Transformer partitions hidden vectors into cyclic channels with phase-respecting operations, gaining perplexity improvements at negligible parameter cost.
- — Hidden vector split into N equally-sized cyclic channels, each with per-channel RMSNorm.
- — 2D Givens rotations between attention and FFN layers rotate channels by theta + i*(2*pi/N).
- — Fixed Gabriel's horn profile injected as DC subspace, orthogonal to RoPE's relative positioning.
- — At 123M parameters, achieves 7.2% perplexity reduction over RoPE baseline with only 1,536 extra params.
- — Convergence speedup of 1.93x steps, 1.64x wall-clock time on WikiText-103.
- — N=3 phase design borrowed from balanced three-phase AC electrical systems.
- — Channel partitioning acts as self-stabilizing equilibrium without explicit geometric enforcement.
- — Rotation angle drift follows U-shaped depth profile across 12 layers.
Frequently asked
- Three-Phase Transformer (3PT) partitions the hidden vector into N equally-sized cyclic channels, each processed with phase-respecting operations including per-channel normalization and 2D Givens rotations. Unlike standard Transformers, it injects a fixed geometric structure (Gabriel's horn profile) into a DC subspace orthogonal to RoPE. This structural prior improves convergence and perplexity with minimal parameter overhead, rather than adding external modules.