A new world enabled by speech/voice user interface

Speech user interface has appeared in our life for a long time. We may still have the memory of trying the terrible speech control system in old cars, which usually had poor recognition capability and gave very slow response. The high-end vehicles are always equipped with displays, interacted via touch screens or mechanical buttons. However, the landscape is changing, and it is not just limited in automotive application, as described in IDTechEx Research's report Voice, Speech, Conversation-Based User Interface 2019-2029: Technology, Player, Market.

Speech recognition enables hearing ability of a machine

Speech recognition (SR) is the "ear" of a machine, which is the basis for speech user interface as the input enables the whole interaction process. Since 1993, the accuracy of speech recognition had been stagnated around 70% based on traditional model, which led to the poor user experiences as users could easily get frustrated and lose patience during the process. It was machine learning, or more specifically, deep learning, that significantly increased the accuracy of speech recognition in 2010s when they have been proved to be effective in improving the recognition accuracy. In 2016, Microsoft reported a speech recognition system reached human parity with a word error rate of 5.9%, and in 2017 Google reported an accuracy of 95%. The technology improvement indicates machines can be as good as human beings in terms of "hearing" and now speech recognition has become a commodity.

Besides the "ear", it is also vital for the machines to have the "brain", "mouth" and other organs to realize natural language speech interactions. Speech synthesis (also known as text to speech TTS) becomes a commodity as well and now breakthroughs are needed in terms of natural language understanding and real-world knowledge integration, which are considered as the core of machine "brain".

These developments are largely benefited from artificial intelligence (AI). Two important initial applications of AI are speech and vision, where deep learning (DL) is applied. Giants such as Apple, Amazon, IBM, Google and Microsoft, all have efforts on smart speech.

Speech/Voice-enabled applications

With the high recognition accuracy, spoken ability and intelligence, speech user interface can disrupt various applications. Coming back to the automotive application mentioned in the beginning, many functions and services can be achieved by voice. A few examples include communication (dialling and messaging), navigation, entertainment (such as music play), vehicle control of air conditioning, windscreen wipers, sunroof, mirrors, etc., vehicle and life information check (e.g. check the tyre pressure, miles, weather, traffic, etc.)

Home automation is another example. Since 2017, CES has become the stage of Amazon Alexa, with massive devices integrating Alexa for voice control. This expands the voice control ability from a central voice-activated smart speaker (such as Amazon Echo and Google Home) or a smart phone, to various home appliances like light, oven, air conditioner, television, etc. In CES 2019, we again saw various voice-powered home appliances with AI integration.

Apart from automotive and home automation, speech/voice interface also disrupts applications in healthcare, travel & hotels, banking, finance and insurance, education, game & entertainment, etc. Speech interface also plays an important role in spoken machine translation.

The disruptive technologies reshape our daily life and form new business models. By 2029, the total market can reach & 15.5 billion. More applications, technology introduction and market analysis can be found in the report Voice, Speech, Conversation-Based User Interface 2019-2029: Technology, Player, Market.

Authored By: Dr Xiaoxi He