A simple introduction to speech recognition

By Elena Nisioti
A simple introduction to speech recognition

Have you ever tried to watch a foreign movie in a language that you’ve recently learned without subtitles? Or conversed on the phone with someone that is eating at the same time? Speech recognition is a task where even humans face difficulties if the conditions are not ideal. But, with recent successes in Artificial Intelligence (AI), one cannot but wonder: how is it possible that we have not perfected voice-to-text technology yet?

Speech recognition: useful or unreliable?

Speech recognition is in fact one of the frontiers of AI, being one of the oldest abilities of humans and most desirable functionalities for software. In contrast to other related technological problems, research does not aim at solely improving, but at perfecting it. As Andrew Ng, AI expert and Google Brain Co-founder, proclaimed, getting speech recognition from 95 per cent to 99 per cent accuracy is what makes the difference between an annoyingly unreliable and an incredibly useful tool.

Applications of speech recognition

Automatically converting speech to text can change the way we interact with our digital surroundings in a number of ways. One must begin by considering cases where speech recognition is not a luxury, but a necessity. Social groups such as people with disabilities or elderly population, can find in speech recognition the assistive technology that will enable them to seamlessly interact with their personal devices. Furthermore, speech-to-text technology can be used to create hands-free interfaces, which are essential for tasks that require the continuous attention of users, such as driving.

Nevertheless, the most common example of speech recognition is probably its wide adoption by smartphones, with services such as virtual assistants, voice dictation, voice search and voice typing being considered an integral part of every smartphone’s software. In such a fast-moving technological reality, one can easily miss the forest for the trees. In order to fully appreciate its potential, we must embrace the fact that speech recognition is not just a solution for people that are too lazy to type, but signifies a new level in way of interacting with technology. As it can give the impression of a more anthropomorphic and natural interaction, it has already been leveraged in augmented reality applications, such as computer games and Google Maps.

Voice-to-text tools

With a variety of applications, comes a variety of tools that people can find in today’s market. It only takes a close look at large technological companies, such as Apple, Google and IBM, to observe that voice-to-text is an important arrow in their consumer-targeting quiver. In particular, Apple has laid significant importance in exploring ways that their users can interact with their software, with products such as the Apple Watch, Siri and the Apple pencil.

At the other spectrum of the market lies IBM’s approach. Watson, an intelligent question answering system, does not primarily target individual consumers, but has already managed to transform the healthcare industry and government, by offering software support systems that can assist humans in their decision-making.

How does speech recognition work?

Speech recognition is a task that is to date unsolved and has absorbed decades of work from electrical engineers, mathematicians and computer scientists. However, at its core, speech recognition is the simple problem of trying to understand how speech signals can be translated to text. It is characteristics of real speech, such as accent, background noise and ambiguity, that have made it hard to come up with perfect solutions. But let’s try to get a high level understanding of the problem.

What is speech?

Speech is a sound wave, which we can picture as a signal that evolves in time. Before a computer can handle it, we first need to sample it in order to convert it to digital, and then, segment it into signals of smaller duration. Also, most of our processing techniques need to recognize the frequencies that are present in a person’s voice, similar to the musical tones in a song. At the end, speech looks like the bars that you see in the equalizer of a music player, where each bar represents the volume of a different frequency in the speech of a person.

What is text?

Since the computer is trying to convert speech to text, one would expect that the output of such a system will be sentences, words, or even, letters. However, when one considers all the possible accents and pronunciations of a word, one can suspect that mapping a word to a sound is hard. Instead, speech-to-text systems recognize phonemes, which are units of sound that all speakers of a particular language use. For example, the words “cat” and “skill”, contain the same phoneme /k/, which can account for the sound of the letters c or k in most words of the English language.

What is speech recognition?

Each speech-to-text software is trying to answer the following question: “Based on the sound that I just heard, which is the most probable phoneme that the person used?” This is a purely statistical problem that has traditionally been solved using Hidden Markov Models, as this is the simplest and most powerful mathematical tool that we have to describe how observations (speech) depend on hidden information (phonemes).

How can AI help?

The introduction of neural networks did not change the core mechanism of speech recognition, but helped in improving the performance of these systems. Understanding different accents, ignoring background noise and parallelly processing huge amounts of data are a few of the advantages that deep learning brought, and serve as the reasons why the tools of today’s market are so successful.

Looking into the future

It is expected that, by 2020, 75 per cent of homes in the USA will have a smart speaker at their homes. This profound eagerness with which people welcomed home assistants, such as Amazon Echo, into their homes is revealing. Having a device that listens to, and most importantly stores in the cloud, every word that you say could have appeared uncomfortable or privacy-invading in the past. Nevertheless, everyone today, from consumers to companies and the public sector, is eager to adopt this technology. Judging by the rapid progress of deep learning, keyboards are probably not far from joining the team of technological nostalgia, along with cassettes and vinyl, in the shelves of vintage shops.