VITA-QinYu: New Spoken Language Model Supports Role-Playing and Singing

Jiacheng Xu and 10 other authors submitted a paper titled "VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing" to arXiv on May 7, 2026 (arXiv:2605.06765) ¹.

The paper introduces VITA-QinYu, an end-to-end spoken language model (SLM) designed to go beyond natural conversation to support both role-playing and singing generation ¹.

VITA-QinYu utilizes a hybrid speech-text paradigm, incorporating interleaved text-audio modeling with multi-codebook audio tokens ¹.

This design enables richer paralinguistic representation while maintaining a clear separation between modalities to avoid interference ¹.

The researchers developed a comprehensive data generation pipeline to synthesize 15.8K hours of data for training, including natural conversation, role-playing, and singing data ¹.

VITA-QinYu demonstrated superior expressiveness, outperforming peer SLMs by 7 percentage points on objective role-playing benchmarks ¹.

It also surpassed peer models by 0.13 points on a 5-point MOS scale for singing ¹.

Furthermore, the model achieved state-of-the-art conversational accuracy and fluency, exceeding prior SLMs by 1.38 and 4.98 percentage points on the C3 and URO benchmarks, respectively ¹.

The authors have open-sourced the code and models and provided an easy-to-use demo with full-stack support for streaming and full-duplex interaction ¹.

How this was made. This article was assembled by Startupniti's editorial AI from the source listed in the right rail. The synthesis ran through our 4-model cascade (Gemini Flash Lite → GPT-4o-mini → DeepSeek → Llama 3.3 70B), logged to ops.llm_calls. Every fact traces to a citation. If a fact looks wrong, write to corrections.