Thai Text To Speech with Tacotron2 | Lifelike Speech Synthesis
This AI clones your voice with just a few hours of recording!
Our main goal of this text to speech, we don’t just make Machine Talks but we make the machine talks in the sensational conversation in the nature of how human speak. We will be able to spend our life in our own mother-tounge language assistants, with hardly knowing the differences.
ยินดีที่ได้รู้จัก นี่คือเสียงจากปัญญาประดิษฐ์
GitHub
Enhancing Quality Education of the Blind
I believe in equality of Education, reducing limitations for the Blind; making them independent learner and providing an easier life with Thai_TTS.
Tacotron 2
Tacotron is the generative model to synthesized speech directly from characters, presenting key techniques to make the sequence-to-sequence framework perform very well for text to speech.
Furthermore, the model Tacotron2 consists of mainly 2 parts; the spectrogram prediction, convert characters’ embedding to mel-spectrogram, and wavenet vocoder which changes the mel-spectrogram to be the waveform.
The Tacotron 2 is the improvement of Tacotron by changing the “linear-scale spectogram” to be the “mel-scale spectogram”.
Pipeline
- Processing the Tsync Library ( Thai Datasets)
- Checking the configuration files (Delete unused character Embedding)
- Trimming out the silences (optional)
- Trying out different sampling rates (22050 is the best) or quantization
- Train the model
Processing the Tsync Library ( Thai Datasets)
Thank you AI for Thai for the great contribution of having the Thai language corpus to train Text to Speech.
There are 2 folders once you downloaded the files.
1. wav — containing the wave files
2. wrd_ph — the corresponding text (script files)
According to tacotron2 format, we only need the files’ path of the wav file and the script.
*The script file (txt) shouldn’t have any blank/empty lines*
Trimming out the silences
We trimmed out the silences in the audio file so that the model would learn the better alignment., We used sr(sampling rate)=22050 and threshold=20.
Our Final Model’s output
Train the model
Everything is ready! Let’s train the model!
STAGE 1 — Experimental
We have experiment training in both Thai and English datasets by cold-start (training from scratch). The results are alike, none of them is much better than another.
The primary result of LJSpeech Training from Scratch. The alignment does not learn. Nevertheless, it is clear that the mel spectrogram prediction is supremly well and indistinguishable between one another.
Sample Audio in English
What is it? It’s no reason. Even in the past.
STAGE 2 — Training with Warm Start
Warm Start training will helps the model to learn the better alignment as it is extremely important in the text to speech model.
Training using a pre-trained model can lead to faster convergence. By default, the dataset dependent text embedding layers are ignored
We have the precise accuracy of the mel-predict and mel-traget spectogram with the wonderful training loss. Furthermore, using warm start established the learning alignment of the model.
Sample Audio
สวัสดีคะ นี่คือเสียงสังเคราะห์
STAGE 3 — Fluently speaking TTS
These are the results from the tensorboard of training for 10000 steps. The results illustrates the great model which lowering the loss.
The well alignments learned and has the “diagonal” alignment which is the good alignment. The tacotron model uses the “Attention Mechanism” so its shows the attention score of the corresponding text and the wave files.
Sample Audio
สวัสดีคะ ฉันชื่อพริม
สวัสดีคะ นี่คือเสียงสังเคราะห์โดยพริม
มะลิ้งกิ๊งก๊องสามารองก๊องแก๊งป๊ะริ๊งปิ๊งป๊องเมี้ยงปร๊ะเมี้ยง
Hi. Welcome to text synthesizer by Prim
OUR JOUREY
It’s began with the slurred voice synthesized throughout the very first trial of training TTS, then we are experiencing the warm-start, to have the better alignment learned. Elatedly, the fluently speaker came out from our native TTS speaker. Astonishing!
Let’s try the result!
Feel free to visit our Colaboratory and the inference of this model. All you need to do is just change the text!
Training Notebook
Inference Notebook
Trained with and without Full Stops
With Full stops
Without Full stops
Synthesized voice with full stops are relatively better than training the Tacotron2 without full stops. In my opinion, once the model has trained with full stops, its gives the idea that statement has ended of the sentences.
“That’s one small step for man,
one giant leap for mankind.”
Neil Armstrong
We will keep developing our model to make an even better TTS with the more lifelike human speech synthesizer.
Evaluation
Mean Opion Scores to measures the quality of the synthesized speech by rating from 1- 5. MOS are a standard measure for subjective sound quality tests, and were obtained in blind tests with human subjects.
Future Plan
Training with WaveGlow + more dataset (speaking more naturally)
Special Thanks
My Contact
Email : prim9000@gmail.com
GitHub : https://github.com/Prim9000
Medium : https://prim9000.medium.com/
Facebook : https://www.facebook.com/prim.wong.98/
Hope you have fun with ThaiTTS!