LibriTTS-VI: A Public Corpus and Novel Methods for Efficient Voice Impression Control

Abstract

Fine-grained control over voice impressions (e.g., making a voice brighter or calmer) is a key frontier for creating more controllable text-to-speech. However, this nascent field faces two key challenges. The first is the problem of impression leakage, where the synthesized voice is undesirably influenced by the speaker's reference audio, rather than the separately specified target impression, and the second is the lack of a public, annotated corpus. To mitigate impression leakage, we propose two methods: 1) a training strategy that separately uses an utterance for speaker identity and another utterance of the same speaker for target impression, and 2) a novel reference-free model that generates a speaker embedding solely from the target impression, achieving the benefits of improved robustness against the leakage and the convenience of reference-free generation. Objective and subjective evaluations demonstrate a significant improvement in controllability. Our best method reduced the mean squared error of 11-dimensional voice impression vectors from 0.61 to 0.41 objectively and from 1.15 to 0.92 subjectively, while maintaining high fidelity. To foster reproducible research, we introduce LibriTTS-VI, the first public voice impression dataset released with clear annotation standards, built upon the LibriTTS-R corpus.

Overview

Fig 1: Overview of the baseline (VIC-base) and the two proposed methods (VIC-sep and VIC-rfg).

Audio Demos for speaker 8555

Move the slider to select a modification level, then click a button to play the corresponding audio.

VI	Modulation Level

Audio Demos for speaker 1089

Move the slider to select a modification level, then click a button to play the corresponding audio.

VI	Modulation Level

LLM-generated Voice Impressions (speaker 8555)

Following the methodology of Voice Impression Control (VIC) [1], these voice impressions were generated by an LLM (Gemini 2.5 Pro) based on speaker 8555's neutral voice. Click a button to play the corresponding audio.

Target Style	Audio Samples & Details
Neutral
Neutral	Show neutral VIs Neutral VIs `# Impresseions ## { ## "low-high": 5.640702247619629, ## "m-f": 6.366395950317383, ## "clear-hoarse": 3.642221450805664, ## "calm-restless": 3.2887918949127197, ## "powerful-weak": 4.879675388336182, ## "youthful-aged": 3.544085741043091, ## "thick-thin": 3.981926679611206, ## "firm-relaxed": 4.565456390380859, ## "dark-bright": 3.752795934677124, ## "cold-warm": 4.111004829406738, ## "speed": 4.218106858506785, ## },`
Sleepy
Sleepy	Show Prompt & LLM Output Prompt to LLM # Task ## We have data evaluating the impression of a voice based on the following 11 impressions, using a continuous scale from 1 to 7. ## The score for each impression represents the score in a neutral state. ## Generate scores for when the speaker speaks in the target speaking style. ## The output should be in JSON format for each impression. # Impresseions ## { ## "low-high": 5.640702247619629, ## "m-f": 6.366395950317383, ## "clear-hoarse": 3.642221450805664, ## "calm-restless": 3.2887918949127197, ## "powerful-weak": 4.879675388336182, ## "youthful-aged": 3.544085741043091, ## "thick-thin": 3.981926679611206, ## "firm-relaxed": 4.565456390380859, ## "dark-bright": 3.752795934677124, ## "cold-warm": 4.111004829406738, ## "speed": 4.218106858506785, ## }, # Target ## Sleepy LLM Output `{ "low-high": 3.140702247619629, "m-f": 5.366395950317383, "clear-hoarse": 5.642221450805664, "calm-restless": 1.2887918949127197, "powerful-weak": 6.379675388336182, "youthful-aged": 4.044085741043091, "thick-thin": 2.981926679611206, "firm-relaxed": 6.565456390380859, "dark-bright": 2.252795934677124, "cold-warm": 4.911004829406738, "speed": 1.7181068585067854 }`
Urgent, attention grabbing
Urgent, attention grabbing	Show Prompt & LLM Output Prompt to LLM # Task ## We have data evaluating the impression of a voice based on the following 11 impressions, using a continuous scale from 1 to 7. ## The score for each impression represents the score in a neutral state. ## Generate scores for when the speaker speaks in the target speaking style. ## The output should be in JSON format for each impression. # Impresseions ## { ## "low-high": 5.640702247619629, ## "m-f": 6.366395950317383, ## "clear-hoarse": 3.642221450805664, ## "calm-restless": 3.2887918949127197, ## "powerful-weak": 4.879675388336182, ## "youthful-aged": 3.544085741043091, ## "thick-thin": 3.981926679611206, ## "firm-relaxed": 4.565456390380859, ## "dark-bright": 3.752795934677124, ## "cold-warm": 4.111004829406738, ## "speed": 4.218106858506785, ## }, # Target ## Urgent, attention grabbing LLM Output `{ "low-high": 6.640702247619629, "m-f": 6.666395950317383, "clear-hoarse": 1.642221450805664, "calm-restless": 5.78879189491272, "powerful-weak": 2.3796753883361816, "youthful-aged": 3.544085741043091, "thick-thin": 2.981926679611206, "firm-relaxed": 1.5654563903808594, "dark-bright": 5.752795934677124, "cold-warm": 3.6110048294067383, "speed": 6.718106858506785 }`

References

[1] K. Fujita, et al., "Voice Impression Control in Zero-Shot TTS", Interspeech 2025.

[2] Y. Koizumi, et al., "LibriTTS-R: Restoration of a Large-Scale Multi-Speaker TTS Corpus", Interspeech 2023.