Why Haven't We Seen a Killer Voice App?

If your team makes voice tech, where's your killer voice app?

"Killer voice app" is a phrase we've heard time and again at Spokestack. It shows up in the question above, but also in related questions like, "What is the killer voice app?" and, "If your team makes voice tech, where's your killer voice app?"

OK, so I haven't personally heard that last one. People tend to be pretty polite in this space, presumably because they wonder if you have the killer voice app up your sleeve, and they wouldn't want to alienate you if you do.

The problem is that people who say "killer voice app" sometimes don't know what they're really saying, so I'm going to clear that up. We'll have to take a little detour through literary theory and cognitive science, but stick with me, and it'll all come together.

I'll give you the punchline first, as a bit of motivation.

There's no such thing as a "voice app", at least not in the way that people commonly use the term.

What's a "Skill" Anyway?

It's important to understand that at this point in the history of consumer technology, the word "app" is part of our shared vocabulary, and that's due to the smartphone. Before smartphones, we had "applications" for computers, but the smartphone was a smaller computer, so they made the word "application" smaller to match. We sometimes forget, though, that voice tech has been on smartphones nearly as long as Apple's App Store itself. The App Store launched in 2008, and Siri launched as an app in 2010, but apps and voice as an interface didn't develop in tandem partially because the developer community at large couldn't interact with consumers via voice until Amazon released the Echo in 2014.

Amazon called voice experiences mediated by Alexa "skills", but they were too late — we already knew what an app was, and these were apps. Google wasn't far behind with the Google Home and its "actions", but those, of course, were apps too.

With the advent of smart speakers, people started wondering about "killer voice apps", but what they really meant — and mean to this day, because imagination is hard — is "dominant experience mediated by a smart speaker" or, even more simply, "Alexa skill".

I promised you some technical jargon earlier, and here comes the first part. This conceptual switcheroo of "app" for "skill" is a prime example of metonymy, which is naming an entity or concept using a word or phrase for something that's closely related. We use this literary device all the time without thinking, even in stuffy formal contexts like newswire dispatches. The US media love to report on the latest intrigue in "Washington" when they're really talking about members of a government that happens to be located in a city named Washington.

When we use metonymy, we're just giving something a different name. If you nitpick with a reporter about their terminology, they'll tell you that of course they're talking about people, not a place. The word substitution may be semi-unconscious, but the speaker knows the difference between the two concepts. Similarly, people talking about "voice apps" can tell you they're actually talking about something you access via a smart speaker, not a piece of software you download to and run from a device you're holding.

From Books to Brains

Now let's take it up a notch, so to speak. "Metonymy" is often used in effete literary analysis alongside other fun words like "synecdoche" and "metalepsis" (see the Wikipedia article linked above), but it also figures prominently in cognitive science literature alongside words like "metaphor", which is more interesting to our discussion here.

One book that traffics heavily in both metonymy and metaphor is Women, Fire, and Dangerous Things by George Lakoff. I highly recommend it for anyone interested in:

  • Learning what the three words in the title might have in common

  • Discovering the rich nuances of words they thought they understood

  • Mocking a couple thousand years of traditional categorization theory

"Metaphor" is a more common word for many of us. Put simply, it involves mapping terms from one domain onto another, so that our experience with domain A can help us better understand domain B. To take an example from Lakoff, a common metaphor is "more is up; less is down", which maps the domain of verticality or direction onto that of quantity. He illustrates with phrases like "the crime rate keeps rising" and "that stock has fallen again" (p. 276, emphasis in original).

In metaphor, the two domains involved aren't necessarily related — and often aren't — but we relate them in our minds because one exists at a more basic level of our human experience than the other or helps us understand the other one. Crime rates and stocks are abstract, but "up" and "down" are wired into our experience of the world.

The sneaky thing about metaphor, though, is that once a really good one has "clicked" in our brains, we run the risk of unconsciously equating the two domains and forgetting how different they actually are. Details blur, and crucial nuance begins to evaporate. This is what's happened with "apps" and "skills".

Now, Back to Apps

Apps are collections of features built on top of a general-purpose computing platform, typically a mobile operating system. In other words, if a computer can produce a particular result, an app can present it to a user. Operating systems themselves do place some security- and privacy-related restrictions on what apps can do on or to a given device, but that's beside the point for our discussion here.

Smart speaker skills are limited to audio input and output, and in some cases a structured visual display. The display language used for Echo Shows has gotten more complex recently, but still doesn't approach the freedom of a blank canvas. General computation can happen on the backend of a smart speaker skill, but the interface with the user is highly limited. If you're writing a smart speaker skill, you often want to assume that you don't have a visual component at all so you can reach the most users possible. Not everyone has an Echo Show.

We can talk about how Apple and Google wield tyrannical market power in their app stores, but at the end of the day an "app" is still a self-contained experience between a developer and a user. We don't open the app store to use an app, only to acquire it. An app doesn't automatically inherit the look and feel of the app store. We don't get confused about whether we're interacting with an app or the host operating system. Yet all of those things are true of smart speaker skills.

So the "voice app" earworm is two phenomena put together, one benign and one that's holding back the industry. It starts with the simple renaming of "smart speaker skill" to "voice app" (metonymy). Every time we use the name "app", though, we further entrench the much subtler and more broadly scoped metaphor that maps general purpose computers onto smart speakers. I would bring in some set theory terms to say why this mapping doesn't work, but we've used enough jargon already. Suffice it to say that while there's certainly sophisticated technology at play, the smart speaker ecosystem doesn't allow the developer nearly the same latitude as a computer's operating system.

What Should We Say Instead?

"Voice app" is an enticing shorthand. You might even spot it on our site from time to time. We usually use it to mean "a mobile app with a voice interface", but the other usage might have slipped in a bit too. Really, though, "skill" is a better term, so let's just go with that.

I'm also open to suggestions for new names, but whatever we decide on, let's stop fooling ourselves by saying that skills are apps and that "voice" needs a breakout hit of some kind. They're not, and it doesn't. Voice is another interface that's coming into its own, just like the keyboard, mouse, touch, and multi-finger gestures. Just like those, it will be clunky at first. Users will work around the clunkiest bits, developers will find ways to fuse the new interface with the old ones in unexpected and delighting ways, and those of us working on the fundamental tech will keep refining until one day a voice interface feels as natural as scrolling down a web page does to most of us right now.

There are no "voice apps". There are apps waiting to be enhanced by a voice interface and apps waiting to be built from the ground up with voice as a first-class interface citizen. Don't keep them waiting any longer!