Why does random search beat sophisticated algorithms for feature preprocessing?

The preprocessing search space appears to be irregular and high-dimensional, making it difficult for surrogate models and bandit algorithms to build accurate predictive models of performance. Random search avoids the overhead of model building and explores the space more uniformly. Evolution-based methods succeed because they adapt through mutation and selection without relying on learned surrogates.

Can I use AutoML tools directly for automated feature preprocessing?

Popular AutoML tools have limitations when applied to preprocessing automation alone. The study found that integrating Auto-FP into existing AutoML frameworks reveals gaps in how these tools handle preprocessing search. You may achieve better results by treating preprocessing optimization as a separate, focused problem before feeding data to your AutoML pipeline.

What is the computational cost of automating feature preprocessing?

The paper does not provide detailed wall-clock time comparisons between algorithms. However, evolution-based methods and random search are generally cheaper than surrogate-model approaches because they avoid the overhead of training predictive models. For practical deployment, measure the cost of your chosen algorithm on your dataset size before committing to production automation.

← Content

AI · 8 min read · April 17, 2026

Automating Feature Preprocessing Beats Manual Tuning for Tabular ML

Study of 15 search algorithms on 45 datasets reveals evolution and random search outperform complex surrogate models for automated feature pipeline construction.

Source: arxiv/cs.AI · Danrui Qi, Jinglin Peng, Yongjun He, Jiannan Wang · open original ↗ ↗

Share: X LinkedIn

Automating feature preprocessing selection and ordering outperforms manual construction; evolution-based and random search algorithms lead.

— Feature preprocessing order and selection critically affect classical ML model performance on tabular data.
— Manual pipeline construction requires data scientists to make many sequential decisions with unclear payoff.
— Auto-FP problem maps to hyperparameter optimization or neural architecture search frameworks.
— Evolution-based algorithms achieve best average ranking across 45 public datasets.
— Random search performs surprisingly well, beating many sophisticated surrogate-model approaches.
— Bandit-based and surrogate-model algorithms underperform for Auto-FP despite success in HPO and NAS.
— Bottleneck analysis identifies gaps between current algorithms and optimal preprocessing discovery.
— AutoML tools show limitations when integrated with automated preprocessing pipelines.

Frequently asked

The preprocessing search space appears to be irregular and high-dimensional, making it difficult for surrogate models and bandit algorithms to build accurate predictive models of performance. Random search avoids the overhead of model building and explores the space more uniformly. Evolution-based methods succeed because they adapt through mutation and selection without relying on learned surrogates.

#tabular #preprocessing #automl #feature-engineering #optimization

Automating Feature Preprocessing Beats Manual Tuning for Tabular ML

Frequently asked

Synthetic Computers Enable Agent Training at Scale

ActiNet: Self-Supervised Model Improves Wrist Activity Classification

Mixed Precision Training Stabilizes Neural ODEs