Thai Text To Speech with Tacotron2 | Lifelike Speech Synthesis

This AI clones your voice with just a few hours of recording!

Prim Wong
6 min readJun 29, 2021

Our main goal of this text to speech, we don’t just make Machine Talks but we make the machine talks in the sensational conversation in the nature of how human speak. We will be able to spend our life in our own mother-tounge language assistants, with hardly knowing the differences.

Training Text to Speech in Thai using Tacotron2

ยินดีที่ได้รู้จัก นี่คือเสียงจากปัญญาประดิษฐ์

GitHub

Enhancing Quality Education of the Blind

I believe in equality of Education, reducing limitations for the Blind; making them independent learner and providing an easier life with Thai_TTS.

Equality of Education

Tacotron 2

Tacotron is the generative model to synthesized speech directly from characters, presenting key techniques to make the sequence-to-sequence framework perform very well for text to speech.

Furthermore, the model Tacotron2 consists of mainly 2 parts; the spectrogram prediction, convert characters’ embedding to mel-spectrogram, and wavenet vocoder which changes the mel-spectrogram to be the waveform.

The Tacotron 2 is the improvement of Tacotron by changing the “linear-scale spectogram” to be the “mel-scale spectogram”.

Pipeline

  • Processing the Tsync Library ( Thai Datasets)
  • Checking the configuration files (Delete unused character Embedding)
  • Trimming out the silences (optional)
  • Trying out different sampling rates (22050 is the best) or quantization
  • Train the model

Processing the Tsync Library ( Thai Datasets)

Thank you AI for Thai for the great contribution of having the Thai language corpus to train Text to Speech.

https://aiforthai.in.th/corpus.php

There are 2 folders once you downloaded the files.
1. wav — containing the wave files
2. wrd_ph — the corresponding text (script files)

According to tacotron2 format, we only need the files’ path of the wav file and the script.

Format Tsync

*The script file (txt) shouldn’t have any blank/empty lines*

Trimming out the silences

We trimmed out the silences in the audio file so that the model would learn the better alignment., We used sr(sampling rate)=22050 and threshold=20.

librosa trim

Our Final Model’s output

Train the model

Everything is ready! Let’s train the model!

STAGE 1 — Experimental

We have experiment training in both Thai and English datasets by cold-start (training from scratch). The results are alike, none of them is much better than another.

The primary result of LJSpeech Training from Scratch. The alignment does not learn. Nevertheless, it is clear that the mel spectrogram prediction is supremly well and indistinguishable between one another.

Sample Audio in English

What is it? It’s no reason. Even in the past.

STAGE 2 — Training with Warm Start

Warm Start training will helps the model to learn the better alignment as it is extremely important in the text to speech model.

Training using a pre-trained model can lead to faster convergence. By default, the dataset dependent text embedding layers are ignored

Pretrained using warm-start https://github.com/Prim9000/Thai_TTS#training-using-pre-trained-warm-start

We have the precise accuracy of the mel-predict and mel-traget spectogram with the wonderful training loss. Furthermore, using warm start established the learning alignment of the model.

Sample Audio

สวัสดีคะ นี่คือเสียงสังเคราะห์

STAGE 3 — Fluently speaking TTS

These are the results from the tensorboard of training for 10000 steps. The results illustrates the great model which lowering the loss.

The well alignments learned and has the “diagonal” alignment which is the good alignment. The tacotron model uses the “Attention Mechanism” so its shows the attention score of the corresponding text and the wave files.

Well alignment was learned

Sample Audio

สวัสดีคะ ฉันชื่อพริม

สวัสดีคะ นี่คือเสียงสังเคราะห์โดยพริม

มะลิ้งกิ๊งก๊องสามารองก๊องแก๊งป๊ะริ๊งปิ๊งป๊องเมี้ยงปร๊ะเมี้ยง

Hi. Welcome to text synthesizer by Prim

OUR JOUREY

It’s began with the slurred voice synthesized throughout the very first trial of training TTS, then we are experiencing the warm-start, to have the better alignment learned. Elatedly, the fluently speaker came out from our native TTS speaker. Astonishing!

Let’s try the result!

Feel free to visit our Colaboratory and the inference of this model. All you need to do is just change the text!

Training Notebook

Inference Notebook

Trained with and without Full Stops

With Full stops

Without Full stops

Synthesized voice with full stops are relatively better than training the Tacotron2 without full stops. In my opinion, once the model has trained with full stops, its gives the idea that statement has ended of the sentences.

“That’s one small step for man,

one giant leap for mankind.”

Neil Armstrong

We will keep developing our model to make an even better TTS with the more lifelike human speech synthesizer.

Evaluation

Mean Opion Scores to measures the quality of the synthesized speech by rating from 1- 5. MOS are a standard measure for subjective sound quality tests, and were obtained in blind tests with human subjects.

https://deepmind.com/blog/article/wavenet-generative-model-raw-audio

Future Plan

Training with WaveGlow + more dataset (speaking more naturally)

--

--