SalesSim framework benchmarks AI models as retail customer simulators

Researchers have introduced SalesSim, a framework and testbed for evaluating how Multimodal Large Language Models simulate realistic, persona-driven customer behaviour in multi-turn, multi-modal, tool-augmented online retail conversations ¹. The paper, authored by Yada Pruksachatkun, Elaine Wan, Lyanna Chen, Kai-Wei Chang, and Chien-Sheng Wu, was submitted to arXiv on 8 May 2026 ¹.

SalesSim models retail interaction and decision-making as a grounded, agentic process. Shoppers with diverse backgrounds, preferences, and dealbreakers interact with a sales agent, seek clarifications, and make informed purchasing decisions ¹. The framework centres evaluation on decision alignment—measuring consistency between the simulator's actions and its persona specifications—as well as conversational quality ¹.

Benchmarking six open and closed-source state-of-the-art models revealed significant behavioural gaps ¹. While models produce fluent conversations, they display substantially lower lexical diversity and overdisclose criteria across personas compared to human interactions ¹. They also tend to be persuaded by sales agent suggestions and drift from persona specifications ¹.

Even the strongest model achieves less than 79% average alignment with its underlying persona specifications ¹. This highlights a fundamental challenge in current multimodal language models: maintaining consistency with assigned customer profiles during extended retail interactions ¹.

To address these limitations, the researchers propose UserGRPO, a multi-turn, multi-objective reinforcement learning recipe that optimises both conversational fluency and decision alignment under persona specifications ¹. Experiments show UserGRPO boosts the baseline model's decision alignment by 13.8% while improving conversational quality ¹.

The paper is categorised under Computation and Language with the arXiv identifier arXiv:2605.08334 ¹. By introducing SalesSim, the researchers provide a new testbed for the community to investigate and improve user simulator adherence in goal-oriented settings ¹.

How this was made. This article was assembled by Startupniti's editorial AI from the source listed in the right rail. The synthesis ran through our 4-model cascade (Gemini Flash Lite → GPT-4o-mini → DeepSeek → Llama 3.3 70B), logged to ops.llm_calls. Every fact traces to a citation. If a fact looks wrong, write to corrections.