Jiacheng Xu and 10 other authors submitted a paper titled "VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing" to arXiv on May 7, 2026 (arXiv:2605.06765) 1.
The paper introduces VITA-QinYu, an end-to-end spoken language model (SLM) designed to go beyond natural conversation to support both role-playing and singing generation 1.
VITA-QinYu utilizes a hybrid speech-text paradigm, incorporating interleaved text-audio modeling with multi-codebook audio tokens 1.
This design enables richer paralinguistic representation while maintaining a clear separation between modalities to avoid interference 1.
The researchers developed a comprehensive data generation pipeline to synthesize 15.8K hours of data for training, including natural conversation, role-playing, and singing data 1.
VITA-QinYu demonstrated superior expressiveness, outperforming peer SLMs by 7 percentage points on objective role-playing benchmarks 1.
It also surpassed peer models by 0.13 points on a 5-point MOS scale for singing 1.
Furthermore, the model achieved state-of-the-art conversational accuracy and fluency, exceeding prior SLMs by 1.38 and 4.98 percentage points on the C3 and URO benchmarks, respectively 1.
The authors have open-sourced the code and models and provided an easy-to-use demo with full-stack support for streaming and full-duplex interaction 1.
ops.llm_calls. Every fact traces to a citation. If a fact looks wrong, write to corrections.