Find in Library
Search millions of books, articles, and more
Indexed Open Access Databases
Masked Duration Model for Utterance Duration-Controllable Text-to-Speech
oleh: Taewoo Kim, Choongsang Cho, Young Han Lee
Format: | Article |
---|---|
Diterbitkan: | IEEE 2024-01-01 |
Deskripsi
Recent advances in neural text-to-speech (TTS) systems have achieved high-quality speech synthesis, but maintaining naturalness while controlling utterance duration remains a challenging task. Existing approaches for controlling utterance duration rely on post-processing techniques compromising naturalness. These techniques are not effective in all scenarios, particularly when addressing variable utterance durations. This study presents a novel masked duration model that enables controllable utterance duration in TTS synthesis. This approach utilizes audio prompts, text prompts, and masks to predict phone durations within the masked span, which corresponds to the utterance duration. This enables precise control of the utterance duration by determining the target duration initially and predicting the phone durations. The model allows for fine-grained control over utterance duration, enabling more nuanced and realistic speech outputs. Additionally, an adversarial training strategy is employed to enhance the robustness of the alignment between audio and text prompts. The experimental results demonstrate that the proposed model outperformed the baseline model regarding utterance duration control. Ablation studies validate the effectiveness of adversarial training in enhancing model performance. This technology is suitable for applications requiring precise control over utterance duration, such as automatic voice dubbing.