How does ADC reduce annotation costs compared to hiring human labelers?

ADC uses LLMs to design classes and generate search queries automatically, then collects samples from search engines without manual labeling. This eliminates the per-sample human annotation cost. The Clothing-ADC dataset (1M+ images) was built with negligible cost, whereas manual labeling would require thousands of hours. Remaining costs are limited to quality assurance and noise detection on the automated output.

What is label noise and why does ADC still have it at 10.7%?

Label noise occurs when a sample is tagged with an incorrect class. ADC achieves 10.7% noise (down from 22.2%) because automated search and collection inherently produce mislabeled samples—a search for 'red dress' may return images of other red clothing. The paper treats this as acceptable and provides tools to detect and handle noisy labels during model training, rather than eliminating noise entirely upfront.

Can ADC be applied to non-image datasets like text or time-series data?

The paper focuses on image classification as a proof of concept. The core idea—using LLMs to design classes and generate collection queries—is domain-agnostic and could extend to text (e.g., generating search terms for document collection) or other modalities. However, the paper does not provide evidence or benchmarks for non-visual domains, so practitioners would need to adapt and validate the approach independently.

← İçerik

Yapay Zeka · 6 dk okuma · 21 Nisan 2026

Automating Dataset Creation with LLMs and Search Engines

Researchers propose ADC, a method to build large labeled datasets automatically using language models and web search, reducing manual annotation work and cost.

Kaynak: arxiv/cs.LG · Minghao Liu, Zonglin Di, Jiaheng Wei, Zhongruo Wang, Hengxiang Zhang, Ruixuan Xiao, Haoyu Wang, Jinlong Pang, Hao Chen, Ankit Shah, Hongxin Wei, Xinlei He, Zhaowei Zhao, Haobo Wang, Lei Feng, Jindong Wang, James Davis, Yang Liu · orijinali aç ↗ ↗

Paylaş: X LinkedIn

ADC automates dataset construction by using LLMs to design classes and generate search queries, collecting and curating samples with minimal human effort.

— LLMs design detailed class hierarchies and generate search code to collect images from search engines automatically.
— Clothing-ADC dataset contains 1M+ images across 12 main classes and 12,000 fine-grained subclasses built via automation.
— Automated curation achieves 79% agreement with human annotators, reducing label noise from 22.2% to 10.7%.
— Real-world challenges remain: label errors persist and data distributions become imbalanced across classes.
— Open-source toolkit includes methods for label error detection and robust learning under noisy conditions.
— Three benchmark datasets created for label noise detection, noise-robust learning, and class-imbalanced learning evaluation.
— Existing methods evaluated on benchmarks to establish baselines and guide future research directions.

Sık sorulanlar

ADC uses LLMs to design classes and generate search queries automatically, then collects samples from search engines without manual labeling. This eliminates the per-sample human annotation cost. The Clothing-ADC dataset (1M+ images) was built with negligible cost, whereas manual labeling would require thousands of hours. Remaining costs are limited to quality assurance and noise detection on the automated output.

#dataset #llm #automation #labeling #curation

Automating Dataset Creation with LLMs and Search Engines

Sık sorulanlar

Synthetic Computers Enable Agent Training at Scale

ActiNet: Self-Supervised Model Improves Wrist Activity Classification

Mixed Precision Training Stabilizes Neural ODEs