- AI · arxiv/cs.AI · 8 min
Benchmark Rubrics Shift LLM Scores in Financial NLP Tasks
How wording changes in evaluation criteria and metric selection alter model rankings on financial text benchmarks, requiring governance over gold-label assumptions.
May 2, 2026 Read → → - AI · arxiv/cs.AI · 6 min
Measuring Where Chatbots Beat Humans on Tests
Researchers apply psychometric methods to identify test items where LLMs systematically outperform human learners, revealing assessment vulnerabilities.
April 17, 2026 Read → →