- Yapay Zeka · arxiv/cs.AI · 8 dk
Benchmark Rubrics Shift LLM Scores in Financial NLP Tasks
How wording changes in evaluation criteria and metric selection alter model rankings on financial text benchmarks, requiring governance over gold-label assumptions.
2 Mayıs 2026 Oku → → - Yapay Zeka · arxiv/cs.AI · 6 dk
Measuring Where Chatbots Beat Humans on Tests
Researchers apply psychometric methods to identify test items where LLMs systematically outperform human learners, revealing assessment vulnerabilities.
17 Nisan 2026 Oku → →