Hosted by IDTechEx
An expert outlook on the world of AI
HomeEventsReportsTVCareersAbout UsSign-up or LoginIDTechExTwitterFacebookLinkedInGoogle+YoutubeRSSForward To Friend
Posted on July 04, 2017

Innovative voice creation based on deep learning

Portal
Neural Networks have revolutionized artificial vision and automatic speech recognition. This machine learning revolution is holding its promises as it enters the Text to Speech arena.
 
Acapela Group is actively working on Deep Neural Networks (DNN) and are very enthusiastic and proud to present the first achievements of their research in this fascinating field, creating new opportunities for voice interfaces.
 
Their R&D lab has developed Acapela DNN, an engine capable of creating a voice using a limited amount of existing or new speech recordings.
 
"Acapela DNN represents 'Acapela's ultimate talking machine', benefiting from our speech expertise and learning from our vast voice and language databases to model voice identities and reproduce speech, in many languages. This is much more than concatenating speech recordings from the studio like we used to do with unit selection. We are talking about creating a voice signal and persona from scratch and in many languages and it is happening now. We need only one week to release a new voice based on a few minutes of speech recordings", says Vincent Pagel, R&D and Linguistic Group manager, Acapela Group.
 
While synthetic voice creation was usually based on rich audio material recorded by a professional voice actor, in a professional studio and under the supervision of a linguistic expert, Acapela can now create a voice with an average of 10 to 15 minutes of clean audio recordings and the associated text transcription of the audio samples.
 
Webinars Generic Banner
Voices can be created based on minutes or hours of speech recordings, depending on the targeted usage. In specific cases such as voice replacement for patients, Acapela DNN can work with a few minutes of speech. For professional usage, such as creating a voice for a video game or for a passenger information system, Acapela DNN will need more recordings. Obviously, the more data there is, the more the DNN can learn from specific habits and create a voice that matches the original.
 
The first results of voices created using this approach are impressive.
 
The group have worked on voice recordings of well-known people and have also created voices for individuals who cannot speak correctly anymore due to surgery or disease. They will be the first ones to speak with voices created with Acapela DNN.
 
Other ongoing experiments include among others voices for video games or robots. Creating voices based on DNN is limitless. With this new approach, Acapela will push the boundaries of technology allowing everyone to have a voice.
 
Material needed: average of 10-15 min of clean recordings + text transcription
 
Acapela DNN is trained offline with all the many different voices in the catalogue. The group feed it all the text and acoustic databases they have for all of their voices. This means Acapela DNN knows a lot about human speech in general but doesn't yet know anything about a specific person's voice and will need to hear this voice for a while before reproducing it.
 
  • 1st pass algorithm: 'Voice ID' parameters to define the digital signature (or sonority) of the vocal tract of the speaker.
  • 2nd pass algorithm: Acapela DNN additional training to match the imprint of the voice with its fine grain details (accents, speaking habits, etc.)
  • Creation of a new voice based on limited amount of audio data
 
About DNN
A deep neural network (DNN) is an artificial neural network (ANN) with multiple hidden layers between the input and output layers. DNNs can model complex non-linear relationships. We use them in Text-to-Speech to learn the relationship between a set of input texts and their acoustic realizations by different speakers.
 
Neural networks are a set of algorithms, modelled loosely after the human brain, that are designed to recognize patterns. They interpret sensory data through a kind of machine perception, labelling or clustering raw input. The patterns they recognize are numerical, contained in vectors, into which all real-world data, be it images, sound, text or time series, must be translated.
 
Source and top image: Acapela Group