We’ve been having conversations for thousands of years. Whether to convey information, conduct transactions, or simply to check in on one another, people have yammered away, chattering and gesticulating, through spoken conversation for countless generations. Only in the last few millennia have we begun to commit our conversations to writing, and only in the last few decades have we begun to outsource them to the computer, a machine that shows much more affinity for written correspondence than for the slangy vagaries of spoken language.
Computers have trouble because between spoken and written language, speech is more primordial. To have successful conversations with us, machines must grapple with the messiness of human speech: the disfluencies and pauses, the gestures and body language, and the variations in word choice and spoken dialect that can stymie even the most carefully crafted human-computer interaction. In the human-to-human scenario, spoken language also has the privilege of face-to-face contact, where we can readily interpret nonverbal social cues.
In contrast, written language immediately concretizes as we commit it to record and retains usages long after they become obsolete in spoken communication (the salutation “To whom it may concern,” for example), generating its own fossil record of outdated terms and phrases. Because it tends to be more consistent, polished, and formal, written text is fundamentally much easier for machines to parse and understand.
Spoken language has no such luxury. Besides the nonverbal cues that decorate conversations with emphasis and emotional context, there are also verbal cues and vocal behaviors that modulate conversation in nuanced ways: how something is said, not what. Whether rapid-fire, low-pitched, or high-decibel, whether sarcastic, stilted, or sighing, our spoken language conveys much more than the written word could ever muster. So when it comes to voice interfaces—the machines we conduct spoken conversations with—we face exciting challenges as designers and content strategists.
Voice Interactions
We interact with voice interfaces for a variety of reasons, but according to Michael McTear, Zoraida Callejas, and David Griol in The Conversational Interface, those motivations by and large mirror the reasons we initiate conversations with other people, too (). Generally, we start up a conversation because:
- we need something done (such as a transaction),
- we want to know something (information of some sort), or
- we are social beings and want someone to talk to (conversation for conversation’s sake).
These three categories—which I call transactional, informational, and prosocial—also characterize essentially every voice interaction: a single conversation from beginning to end that realizes some outcome for the user, starting with the voice interface’s first greeting and ending with the user exiting the interface. Note here that a conversation in our human sense—a chat between people that leads to some result and lasts an arbitrary length of time—could encompass multiple transactional, informational, and prosocial voice interactions in succession. In other words, a voice interaction is a conversation, but a conversation is not necessarily a single voice interaction.
Purely prosocial conversations are more gimmicky than captivating in most voice interfaces, because machines don’t yet have the capacity to really want to know how we’re doing and to do the sort of glad-handing humans crave. There’s also ongoing debate as to whether users actually prefer the sort of organic human conversation that begins with a prosocial voice interaction and shifts seamlessly into other types. In fact, in Voice User Interface Design, Michael Cohen, James Giangola, and Jennifer Balogh recommend sticking to users’ expectations by mimicking how they interact with other voice interfaces rather than trying too hard to be human—potentially alienating them in the process ().
That leaves two genres of conversations we can have with one another that a voice interface can easily have with us, too: a transactional voice interaction realizing some outcome (“buy iced tea”) and an informational voice interaction teaching us something new (“discuss a musical”).
Transactional voice interactions
Unless you’re tapping buttons on a food delivery app, you’re generally having a conversation—and therefore a voice interaction—when you order a Hawaiian pizza with extra pineapple. Even when we walk up to the counter and place an order, the conversation quickly pivots from an initial smattering of neighborly small talk to the real mission at hand: ordering a pizza (generously topped with pineapple, as it should be).
Alison: Hey, how’s it going?
Burhan: Hi, welcome to Crust Deluxe! It’s cold out there. How can I help you?
Alison: Can I get a Hawaiian pizza with extra pineapple?
Burhan: Sure, what size?
Alison: Large.
Burhan: Anything else?
Alison: No thanks, that’s it.
Burhan: Something to drink?
Alison: I’ll have a bottle of Coke.
Burhan: You got it. That’ll be $13.55 and about fifteen minutes.
Each progressive disclosure in this transactional conversation reveals more and more of the desired outcome of the transaction: a service rendered or a product delivered. Transactional conversations have certain key traits: they’re direct, to the point, and economical. They quickly dispense with pleasantries.
Informational voice interactions
Meanwhile, some conversations are primarily about obtaining information. Though Alison might visit Crust Deluxe with the sole purpose of placing an order, she might not actually want to walk out with a pizza at all. She might be just as interested in whether they serve halal or kosher dishes, gluten-free options, or something else. Here, though we again have a prosocial mini-conversation at the beginning to establish politeness, we’re after much more.
Alison: Hey, how’s it going?
Burhan: Hi, welcome to Crust Deluxe! It’s cold out there. How can I help you?
Alison: Can I ask a few questions?
Burhan: Of course! Go right ahead.
Alison: Do you have any halal options on the menu?
Burhan: Absolutely! We can make any pie halal by request. We also have lots of vegetarian, ovo-lacto, and vegan options. Are you thinking about any other dietary restrictions?
Alison: What about gluten-free pizzas?
Burhan: We can definitely do a gluten-free crust for you, no problem, for both our deep-dish and thin-crust pizzas. Anything else I can answer for you?
Alison: That’s it for now. Good to know. Thanks!
Burhan: Anytime, come back soon!
This is a very different dialogue. Here, the goal is to get a certain set of facts. Informational conversations are investigative quests for the truth—research expeditions to gather data, news, or facts. Voice interactions that are informational might be more long-winded than transactional conversations by necessity. Responses tend to be lengthier, more informative, and carefully communicated so the customer understands the key takeaways.
Voice Interfaces
At their core, voice interfaces employ speech to support users in reaching their goals. But simply because an interface has a voice component doesn’t mean that every user interaction with it is mediated through voice. Because multimodal voice interfaces can lean on visual components like screens as crutches, we’re most concerned in this book with pure voice interfaces, which depend entirely on spoken conversation, lack any visual component whatsoever, and are therefore much more nuanced and challenging to tackle.
Though voice interfaces have long been integral to the imagined future of humanity in science fiction, only recently have those lofty visions become fully realized in genuine voice interfaces.
Interactive voice response (IVR) systems
Though written conversational interfaces have been fixtures of computing for many decades, voice interfaces first emerged in the early 1990s with text-to-speech (TTS) dictation programs that recited written text aloud, as well as speech-enabled in-car systems that gave directions to a user-provided address. With the advent of interactive voice response (IVR) systems, intended as an alternative to overburdened customer service representatives, we became acquainted with the first true voice interfaces that engaged in authentic conversation.
IVR systems allowed organizations to reduce their reliance on call centers but soon became notorious for their clunkiness. Commonplace in the corporate world, these systems were primarily designed as metaphorical switchboards to guide customers to a real phone agent (“Say Reservations to book a flight or check an itinerary”); chances are you will enter a conversation with one when you call an airline or hotel conglomerate. Despite their functional issues and users’ frustration with their inability to speak to an actual human right away, IVR systems proliferated in the early 1990s across a variety of industries (, PDF).
While IVR systems are great for highly repetitive, monotonous conversations that generally don’t veer from a single format, they have a reputation for less scintillating conversation than we’re used to in real life (or even in science fiction).
Screen readers
Parallel to the evolution of IVR systems was the invention of the screen reader, a tool that transcribes visual content into synthesized speech. For Blind or visually impaired website users, it’s the predominant method of interacting with text, multimedia, or form elements. Screen readers represent perhaps the closest equivalent we have today to an out-of-the-box implementat
Recommended Story For You :

GET YOUR VINCHECKUP REPORT

The Future Of Marketing Is Here

Images Aren’t Good Enough For Your Audience Today!

Last copies left! Hurry up!

GET THIS WORLD CLASS FOREX SYSTEM WITH AMAZING 40+ RECOVERY FACTOR

Browse FREE CALENDARS AND PLANNERS

Creates Beautiful & Amazing Graphics In MINUTES

Uninstall any Unwanted Program out of the Box

Did you know that you can try our Forex Robots for free?
