WaveNet: Deep learning based speech and audio synthesis on steroids

Deep learning based machine learning algorithms have raised a lot of interests in the last years. The modeling power of deep architectures surpasses the accuracy of previous techniques in various research fields. Besides image processing, audio and speech modeling is one of such fields.
Traditionally speech parameters were extracted from raw audio with various vocoder algorithms and these parameters were used to model speech, eg. with hidden Markov-models or deep neural networks. A novel deep learning architecture by Google, called WaveNet, is capable of disposing the need of parameter extraction and with a powerful regression technique model raw waveform with high quality.
In this presentation the basics of deep learning based speech modeling is introduced, the WaveNet model is described and our experiments with WaveNet applied to Hungarian speech synthesis are discussed.

Dr. Tóth Bálint Pál
assistant professor, BME

Bálint Pál Tóth is an assistant professor at Budapest University of Technology and Economics, Department of Telecommunications and Media Informatics (BME TMIT). He has been active in machine learning since 2007. He was the first to create hidden Markov-model based Text-To-Speech (TTS) system in Hungarian and wrote his PhD on this topic. In the last years he is making research in deep learning. He is involved in numerous R&D projects.
Bálint Pál Tóth received NVidia hardware grant in 2015 and with his leadership his department was nominated official NVidia GPU Education Center based on his deep learning related research and education activities.