← İçerik
Yapay Zeka · 8 dk okuma · 2 Mayıs 2026

Benchmark Rubrics Shift LLM Scores in Financial NLP Tasks

How wording changes in evaluation criteria and metric selection alter model rankings on financial text benchmarks, requiring governance over gold-label assumptions.

Kaynak: arxiv/cs.AI · Sidi Chang, Peiying Zhu, Yuxiao Chen, Rongdong Chai · orijinali aç ↗ ↗
Paylaş: X LinkedIn

Gold labels in financial NLP benchmarks are not objective; rubric wording and metric choice materially shift model rankings and require explicit governance.

  • Rubric wording changes alter model-assigned labels by 13–30 percentage points, especially near decision boundaries.
  • Not all metrics remain informative under real class distributions; within-one accuracy and worst-class accuracy are unreliable.
  • Exact accuracy, macro-F1, and weighted kappa are defensible metrics for the Japanese Financial Implicit-Commitment Recognition benchmark.
  • Ranking disagreement emerges when using all five metrics but vanishes when restricted to identifiable metrics.
  • Measurement risk arises from confounded rubric variants that mix semantics, examples, and verbosity without isolating causes.
  • Supervised financial benchmarks need explicit reporting discipline on rubric governance and metric selection, not just new leaderboards.

Sık sorulanlar

  • Rubric wording changes alter how annotators and models interpret boundary cases, especially near decision thresholds (e.g., whether a statement is an implicit commitment or not). The study found agreement between rubric variants ranged from 70% to 83%, meaning 17–30% of labels shifted. This is not random noise; it reflects genuine ambiguity in the rubric itself, not the model's capability.

İlgili