In a Nutshell

Voice Recognition VAs (Almost) Come of Age

By Tim Lindner

We are seeing 2017 is starting off with a significant media focus on voice recognition technology, with prominent publications such as The Economist double-downing on its coverage by making it the cover story and sole subject of its first Technology Quarterly of the New Year. This level of attention is the result of an explosion of new applications that embed AI-infused personal assistants such as Apple’s Siri, Amazon’s Alexa, Google’s Assistant, and Microsoft’s Cortana. Indeed, the rising crescendo of news coverage has resulted in many publications including the IIIE Spectrum and globally recognized technology analysts such as J. Walter Thompson Intelligence citing the CES 2017 as the start of the year of voice recognition.

How thrilling! Yet, with fame sometimes comes notoriety as we saw in San Diego, Calif., recently. At the very same time that CES was touting the year of voice recognition, a morning TV news broadcast in San Diego about a little girl in Texas accidentally ordering a dollhouse and 4-lbs. of cookies through Alexa itself triggered Alexas in a number of homes of listeners to order dollhouses. This is not as surprising as it may appear; Alexa is always listening in the background for user questions and commands. When called out by name with a question or command immediately following, it swings into action to answer the question or try to act on the command.

Imagine if the news story was about a child accidentally buying a gun rather than a dollhouse.

The two powerful technologies that have come together at the beginning of 2017 and are responsible for the explosion of applications using voice recognition are AI (artificial intelligence) and speech recognition engines. Together, they make virtual assistants like Siri and Alexa possible.

Together they deliver the extraordinary, but taken for granted, capability that approximates what humans do naturally: Have a conversation. When we talk to each other, we mutually understand what we are talking about. My brain interprets the words my partner’s vocal cords and mouth are making audible to understand what she is telling me, and vice versa. Simply put, a conversation is the result of two intelligent beings using rules-based sounds to communicate concepts that allow us to understand a common idea.

Humans also have the natural ability to distinguish between what is being said and who is saying it. For example, if I listen to a speech by someone I do not personally know, I can understand the idea the speaker is communicating because I understand the language and the meaning of the words she is using. At a general level of engagement, I can understand any human who speaks my language and properly uses the words of that language to convey an idea in a context I can understand. This is the essence of what speaker “independent” speech recognition engines do.

However, if I know someone’s voice profile, it is because I have had a relationship at some level with them, and in that relationship, I have gained an understanding of the way that person thinks and expresses ideas. I can anticipate how he might respond to what I am thinking by the words I am using or even how I say those words (emotional context). I have moved from a general understanding to a nuanced understanding of what is being communicated. This is the essence of what speaker “dependent” speech recognition engines do.

B2B (business-to-business) voice applications, with or without AI, have been predominantly based on speaker dependent recognition engines. These applications have been in actual commercial operation since the early 1990s in verticals such as food distribution, aftermarket auto parts, and retail distribution. Companies offering B2B applications based on voice technology, such as Voxware Inc. and Honeywell (Vocollect), have focused on the development and use of speaker dependent recognition engines.

Speaker dependence makes sense for a number of reasons. Commercial applications are found in very noisy places like factories and warehouses. Having detailed knowledge of a person’s unique voice profile helps the recognition engine “find” the user amid the noise. Commercial applications are also very focused on specific work processes, such as picking products from shelves in warehouses to fulfill customer orders. This is a highly repetitive process, so only a limited number of words and phrases are needed. A user can be trained on these typically in less than an hour. Globally, the number of people using speaker dependent voice applications is in the millions, not billions.

Apple, Google, Microsoft, IBM, and Amazon have invested heavily during the past two decades in both AI and speech recognition. They are the companies most written about in relationship to advances in the use of voice technology, and this makes sense because the largest market for voice is in consumer applications with the potential for deployment to billions of people.  Their speech engine development focus has been on speaker “independent” speech recognition engines used for Siri, Alexa, and the others in the consumer virtual assistant space.

Putting the Star Trek Universal Communicator aside, isn’t this the basic challenge to Siri, Alexa, and the others? They have to be able to recognize the language, structure of that language and meaning of the words used in that language vocalized by any human with which they engage in a conversation. That is a very tall challenge when you consider that Siri is expected to work immediately when you turn on your new iPhone or Alexa when you turn on your new Amazon Echo device.

The point here is that there is no allowance for time to train Siri to uniquely recognize your voice among the billions of iPhone users. It must generalize. In order to function, it is not enough just to be able to say words, Siri must understand their meaning. This is what AI makes increasingly possible in speaker independent virtual assistants. AI is how virtual assistants will overcome the lack of training time with an individual speaker to jump start their ability to anticipate what that individual is thinking. AI, using massive amounts of data from conversations with virtual assistants like Siri and Alexa (have you read the Terms of Use for these virtual assistants recently?) will become more capable of anticipating anyone’s conversational direction.

This is great for the business of recommending goods and services to buy, and Apple, Amazon, and the other developers of virtual assistants are in the business of selling you more goods and services.

While voice recognition has come a very long way, and is generally valuable to us in everyday life, as the dollhouse incident reminds us, AI has not conquered all of the challenges associated with speaker independent recognition engines. All of those households in San Diego tuned to that morning’s TV broadcast with their Alexas’ listening in the background uncovered the flaw that a speaker dependent recognition engine would have had a better chance of catching: Knowing who was actually speaking.

By | 2017-10-24T21:00:23+00:00 2/21/2017|

Leave A Comment