How much can we infer about a person's looks from the way they speak? A group of researchers from MIT has created an artificial intelligence system that can reconstruct people's faces by listening to their voices. For more information see the IDTechEx report on Voice, Speech, Conversation-Based User Interfaces 2019-2029: Technologies, Players, Markets.
Called Speech2Face, the deep neural network was tasked with reconstructing a facial image of a person from a short audio recording of that person speaking. The team designed and trained a deep neural network to perform this task using millions of natural videos of people speaking from the Internet and Youtube. During training, the model learns audiovisual, voice-face correlations that allow it to produce images that capture various physical attributes of the speakers such as age, gender and ethnicity. This was a purely academic investigation and there are ethical considerations due to the sensitivity of facial information.
The method cannot identify the true identity of a person from their voice because the model is trained to capture visual features that are common to many individuals, and only in cases where there is strong enough evidence to connect those visual features with vocal/speech. The model---as is the case with any machine learning model---is affected by an uneven distribution of data.
High-pitched males, including children, were incorrectly identified as females, and low-pitched women were identified as men
Voice recognition may be used in the future for phone banking, as in the Voice ID programme launched by last year.
Source and top image: Speech2Face