LibriTTS-VI: A Public Corpus and Novel Methods for Efficient Voice Impression Control

Abstract

Numerical voice impression (VI) control (e.g., scaling brightness) enables fine-grained control in text-to-speech (TTS). However, it faces two challenges: no public corpus and impression leakage, where reference audio biases synthesized voice away from the target VI. To address the first challenge, we introduce LibriTTS-VI, the first public VI corpus built on LibriTTS-R. For the second, we hypothesize a single reference causes leakage by entangling speaker identity and VI. To mitigate this, we propose 1) disentangled training with two utterances from the same speaker for speaker and VI conditioning, and 2) a reference-free method controlling the impression solely via target VI. Experimentally, our best method improves controllability: 11-dimensional VI mean squared error drops from 0.61 to 0.41 objectively and 1.15 to 0.92 subjectively. A comparison with a prompt-based TTS reveals imprecise numerical control and entanglement between VI and text semantics, which our methods overcome.

Overview

Fig. 1: Overview of the VITS-based VIC systems (base, dis, and srf) and the Qwen3-TTS prompt-based inference pipeline.

Methods:

VIC-base (a) Baseline: [1] adapted to VITS. A single reference utterance provides both speaker identity and target VI.
VIC-dis (b) Proposed method: mitigates impression leakage in (a) by using two distinct utterances for speaker identity and target VI during training.
VIC-srf (c) Proposed method: removes the reference audio dependency in (a) by generating speaker embeddings solely from the target VI.
QVD-z QVD-f (d) Qwen3-TTS [3] (LLM-based TTS) with VI control via natural language prompts. QVD-z uses the zero-shot VoiceDesign model; QVD-f uses a model fine-tuned with VI prompts.

Audio Demos for speaker 8555

Each row corresponds to one VI dimension. Use the slider to set the modulation level (-3 to +3), then click a method button to play.

GT Reference (8555_284447_000019_000002.wav):

VI	Modulation Level

Audio Demos for speaker 1089

Each row corresponds to one VI dimension. Use the slider to set the modulation level (-3 to +3), then click a method button to play.

GT Reference (1089_134686_000009_000005.wav):

VI	Modulation Level

LLM-generated Voice Impressions (speaker 8555)

Following the methodology of [1], an LLM (Gemini 2.5 Pro) generates target VI values for a given speaking style, based on speaker 8555's neutral VI. The VITS-based methods (VIC-base, dis, srf) then synthesize speech from these VI values.

Target Style	Audio Samples & Details
Neutral
Neutral	Show neutral VIs Neutral VIs `# Impressions ## { ## "low-high": 5.640702247619629, ## "m-f": 6.366395950317383, ## "clear-hoarse": 3.642221450805664, ## "calm-restless": 3.2887918949127197, ## "powerful-weak": 4.879675388336182, ## "youthful-aged": 3.544085741043091, ## "thick-thin": 3.981926679611206, ## "firm-relaxed": 4.565456390380859, ## "dark-bright": 3.752795934677124, ## "cold-warm": 4.111004829406738, ## "speed": 4.218106858506785, ## },`
Sleepy
Sleepy	Show Prompt & LLM Output Prompt to LLM # Task ## We have data evaluating the impression of a voice based on the following 11 impressions, using a continuous scale from 1 to 7. ## The score for each impression represents the score in a neutral state. ## Generate scores for when the speaker speaks in the target speaking style. ## The output should be in JSON format for each impression. # Impressions ## { ## "low-high": 5.640702247619629, ## "m-f": 6.366395950317383, ## "clear-hoarse": 3.642221450805664, ## "calm-restless": 3.2887918949127197, ## "powerful-weak": 4.879675388336182, ## "youthful-aged": 3.544085741043091, ## "thick-thin": 3.981926679611206, ## "firm-relaxed": 4.565456390380859, ## "dark-bright": 3.752795934677124, ## "cold-warm": 4.111004829406738, ## "speed": 4.218106858506785, ## }, # Target ## Sleepy LLM Output `{ "low-high": 3.140702247619629, "m-f": 5.366395950317383, "clear-hoarse": 5.642221450805664, "calm-restless": 1.2887918949127197, "powerful-weak": 6.379675388336182, "youthful-aged": 4.044085741043091, "thick-thin": 2.981926679611206, "firm-relaxed": 6.565456390380859, "dark-bright": 2.252795934677124, "cold-warm": 4.911004829406738, "speed": 1.7181068585067854 }`
Urgent, attention grabbing
Urgent, attention grabbing	Show Prompt & LLM Output Prompt to LLM # Task ## We have data evaluating the impression of a voice based on the following 11 impressions, using a continuous scale from 1 to 7. ## The score for each impression represents the score in a neutral state. ## Generate scores for when the speaker speaks in the target speaking style. ## The output should be in JSON format for each impression. # Impressions ## { ## "low-high": 5.640702247619629, ## "m-f": 6.366395950317383, ## "clear-hoarse": 3.642221450805664, ## "calm-restless": 3.2887918949127197, ## "powerful-weak": 4.879675388336182, ## "youthful-aged": 3.544085741043091, ## "thick-thin": 3.981926679611206, ## "firm-relaxed": 4.565456390380859, ## "dark-bright": 3.752795934677124, ## "cold-warm": 4.111004829406738, ## "speed": 4.218106858506785, ## }, # Target ## Urgent, attention grabbing LLM Output `{ "low-high": 6.640702247619629, "m-f": 6.666395950317383, "clear-hoarse": 1.642221450805664, "calm-restless": 5.78879189491272, "powerful-weak": 2.3796753883361816, "youthful-aged": 3.544085741043091, "thick-thin": 2.981926679611206, "firm-relaxed": 1.5654563903808594, "dark-bright": 5.752795934677124, "cold-warm": 3.6110048294067383, "speed": 6.718106858506785 }`

References

[1] K. Fujita, et al., "Voice Impression Control in Zero-Shot TTS", Interspeech 2025.

[2] Y. Koizumi, et al., "LibriTTS-R: Restoration of a Large-Scale Multi-Speaker TTS Corpus", Interspeech 2023.

[3] H. Hu, et al., "Qwen3-TTS Technical Report", arXiv:2601.15621, 2026.