← İçerik
Yapay Zeka · 6 dk okuma · 21 Nisan 2026

Automating Dataset Creation with LLMs and Search Engines

Researchers propose ADC, a method to build large labeled datasets automatically using language models and web search, reducing manual annotation work and cost.

Kaynak: arxiv/cs.LG · Minghao Liu, Zonglin Di, Jiaheng Wei, Zhongruo Wang, Hengxiang Zhang, Ruixuan Xiao, Haoyu Wang, Jinlong Pang, Hao Chen, Ankit Shah, Hongxin Wei, Xinlei He, Zhaowei Zhao, Haobo Wang, Lei Feng, Jindong Wang, James Davis, Yang Liu · orijinali aç ↗ ↗
Paylaş: X LinkedIn

ADC automates dataset construction by using LLMs to design classes and generate search queries, collecting and curating samples with minimal human effort.

  • LLMs design detailed class hierarchies and generate search code to collect images from search engines automatically.
  • Clothing-ADC dataset contains 1M+ images across 12 main classes and 12,000 fine-grained subclasses built via automation.
  • Automated curation achieves 79% agreement with human annotators, reducing label noise from 22.2% to 10.7%.
  • Real-world challenges remain: label errors persist and data distributions become imbalanced across classes.
  • Open-source toolkit includes methods for label error detection and robust learning under noisy conditions.
  • Three benchmark datasets created for label noise detection, noise-robust learning, and class-imbalanced learning evaluation.
  • Existing methods evaluated on benchmarks to establish baselines and guide future research directions.

Sık sorulanlar

  • ADC uses LLMs to design classes and generate search queries automatically, then collects samples from search engines without manual labeling. This eliminates the per-sample human annotation cost. The Clothing-ADC dataset (1M+ images) was built with negligible cost, whereas manual labeling would require thousands of hours. Remaining costs are limited to quality assurance and noise detection on the automated output.

İlgili