LLM Confidence Calibration Breaks Down on Hard Tasks, New Benchmark Finds
A preregistered study reports that LLM confidence is not a reliable proxy for correctness: models are overconfident on hard tasks and underconfident on easy ones. The authors introduce LifeEval to test calibration across difficulty levels.