Researchers have developed IntentGrasp, a comprehensive benchmark for evaluating the intent understanding capabilities of Large Language Models (LLMs) (S1).
IntentGrasp is constructed from 49 open-licensed corpora across 12 domains (S1).
The benchmark includes a training set of 262,759 instances and two evaluation sets: the All Set with 12,909 test cases and the Gem Set with 470 cases (S1).
Evaluations on 20 LLMs across 7 families, including GPT-5.4, Gemini-3.1-Pro, and Claude-Opus-4.7, showed unsatisfactory performance (S1).
Scores were below 60% on the All Set and below 25% on the Gem Set (S1).
Notably, 17 out of 20 models performed worse than a random-guess baseline on the Gem Set, while human performance was estimated at ~81.1% (S1).
To improve intent understanding, the researchers propose Intentional Fine-Tuning (IFT), fine-tuning models on the IntentGrasp training set (S1).
IFT yielded significant gains of 30+ F1 points on the All Set and 20+ points on the Gem Set (S1).
Leave-one-domain-out experiments demonstrated the strong cross-domain generalizability of IFT (S1).
The IntentGrasp data is available on Hugging Face, and the code is on GitHub (S1).
ops.llm_calls. Every fact traces to a citation. If a fact looks wrong, write to corrections.