Five leading large language models (LLMs) disagreed on the verdict for 67% of 1,000 real-world fact-check claims, according to a study by Lenz Research (lenz.io). The analysis highlights significant inconsistency among top AI models when evaluating the truthfulness of user-submitted claims.
Lenz Research presented 1,000 recent claims to five frontier LLMs, asking each to assign one of four verdicts: True, Mostly True, Misleading, or False. The claims were not from standard benchmarks but real user submissions to a fact-checking platform. The models’ verdicts diverged in 672 cases, indicating at least one model’s rating conflicted with the majority. In 34% of claims, the disagreement spanned two or more verdict categories, reflecting substantive differences rather than minor calibration shifts. The overall inter-rater reliability measured by Krippendorff’s alpha was 0.639, showing moderate but limited agreement.
This finding underscores challenges in relying on LLMs for consistent fact verification, especially since the models agreed mainly on definitive verdicts but fractured on more nuanced middle categories. The study provides a rare quantitative measure of disagreement among leading AI systems on real-world factual claims, emphasizing the complexity of automated fact-checking in practice.
Future work will likely focus on improving model alignment and calibration to reduce such discrepancies. Monitoring how these models evolve in handling ambiguous or borderline claims will be crucial for applications in journalism, content moderation, and information verification, where accuracy and consistency are paramount, Lenz Research noted.