What is GEM and how does it differ from ReLU?

GEM (Geometric Monomial) is a family of smooth activation functions using log-logistic gating that approximates ReLU behavior while maintaining continuous derivatives up to order 2N. Unlike ReLU's sharp kink at zero, GEM's smoothness reduces gradient discontinuities, improving optimization in deep networks. It uses only rational arithmetic, avoiding expensive transcendental operations.

Should I use N=1 or N=2 smoothness for my model?

Use N=1 for standard-depth CNNs (ResNet-56 and similar); it reduces the GELU accuracy gap most efficiently. Use N=2 for transformers (GPT-2, BERT) where deeper architectures benefit from higher smoothness. The choice reflects a tradeoff: shallow networks prefer minimal smoothing overhead, while depth-compensated architectures exploit smoother gradients.

What epsilon value should I choose for E-GEM?

Use small epsilon (10^{-4} to 10^{-6}) for deep CNNs and large transformers; use large epsilon (epsilon=10) for shallow transformers like BERT-small. The epsilon parameter controls ReLU approximation tightness. Smaller values approximate ReLU more closely but may increase optimization difficulty; larger values smooth more aggressively, helping shallow networks with unconstrained gradients.

← İçerik

Yapay Zeka · 8 dk okuma · 24 Nisan 2026

GEM activation functions match ReLU speed with smoother gradients

Krause proposes rational activation functions with tunable smoothness that reduce optimization friction in deep networks while maintaining computational efficiency.

Kaynak: arxiv/cs.AI · Eylon E. Krause · orijinali aç ↗ ↗

Paylaş: X LinkedIn

Krause introduces GEM, a family of smooth rational activation functions that approximate ReLU performance with better gradient flow for deep architectures.

— GEM uses log-logistic CDF gating to achieve C^{2N}-smoothness without sacrificing ReLU-like behavior.
— Three variants: base GEM, E-GEM (epsilon-parameterized for arbitrary L^p approximation), SE-GEM (piecewise with smooth junctions).
— N=1 smoothness optimal for standard CNNs; N=2 preferred for transformers, revealing architecture-dependent tradeoffs.
— On CIFAR-100 + ResNet-56, E-GEM closes GELU gap from 6.10% to 0.62% accuracy deficit.
— SE-GEM surpasses GELU on CIFAR-10 (92.51% vs 92.44%), first GEM variant to outperform GELU baseline.
— GPT-2 (124M) achieves lowest perplexity with GEM (72.57 vs 73.76 GELU); BERT-small validation loss improves to 6.656.
— Epsilon parameter reveals scale-dependent optimum: small epsilon for deep CNNs, large epsilon for shallow transformers.
— Purely rational arithmetic enables efficient hardware implementation without transcendental operations.

Sık sorulanlar

GEM (Geometric Monomial) is a family of smooth activation functions using log-logistic gating that approximates ReLU behavior while maintaining continuous derivatives up to order 2N. Unlike ReLU's sharp kink at zero, GEM's smoothness reduces gradient discontinuities, improving optimization in deep networks. It uses only rational arithmetic, avoiding expensive transcendental operations.

#activation #neural-networks #optimization #smoothness #rational-functions

GEM activation functions match ReLU speed with smoother gradients

Sık sorulanlar

Synthetic Computers Enable Agent Training at Scale

ActiNet: Self-Supervised Model Improves Wrist Activity Classification

Mixed Precision Training Stabilizes Neural ODEs