Basics of Voice

Guest post by Susan Westwater.

If you haven't yet started to consider voice as part of your content, user experience, or marketing strategy, it's time you started. Here's a primer on the basics.

What is Voice?

Voice is the new shiny object that is forecasted to have the fastest ever adoption and growth rates of any technology. We first knew it as Siri and Cortana and now as Alexa and Google Home. Siri is still at the party but she currently seems focused on being the DJ (and she does a great job at it although sometimes she wants to play Blink182 a little too often).

Originally voice was a novelty feature that then began to develop utility as users began to use it for search on their smartphones. Voice has continued to evolve thanks to the growing momentum of smart speakers and digital assistants. While voice search is still important, there are many other use cases.

To illustrate this a bit further, at the start of 2016, there were just over 130 Alexa skills (aka apps) and by year's end there were more than 5,000. At the end of 2017, there were more than 25,000. While skills are plentiful, most usage of digital assistant focuses upon playing music, getting information or using the more utilitarian features such as setting timers, alarms, or reminders.

Because of the rapid adoption that is being seen around digital assistants, there are some interesting predictions around voice and voice usage. Gartner has gone on record as stating that by 2021, early adopter brands that redesign their websites to support visual and voice search will increase digital commerce revenue by 30%. A report by Canalys predicts that smart speakers will reach a global installed base of 100 million in 2018 with continued growth to 300 million by 2022. Those growth trends are greater than what was seen with the adoption of mobile.

Does voice need its own strategy?

With a growth trend that surpasses mobile, it's natural to question if voice demands its own strategy. After all, the need to adapt to mobile drove development of mobile-first strategy and there is definitely a need to adapt content into the voice channel. However, the content prioritization that mobile and user-first strategies brought forth very much still apply. If there is one thing we have learned with the evolution of the digital world from website to smartphone it's that a user's information needs don't change because of the interface or device. The nature of their interaction, however, does change and that is amplified with voice.

While voice does not require a new strategy (user-first still very much applies), it does bring its own set of considerations and requirements to ensure a successful experience. While website content is a good place to start for creating voice content, one must keep in mind the loss of visual context. That context can be supporting information around key tasks or simply the relationship each paragraph has to the other. If you are developing a podcast using web content, the end result shouldn't be the equivalent of an audiobook.

With digital assistants, there is a need to include a bit more information in questions and answers so that it's clear what the next question/input relates to as the conversation progresses. Confirming responses helps with clarity but it needs to be done in a natural flow and when appropriate (how much confirmation is needed when say, requesting Echo to turn on the lights?). Excessive confirmation can also make the exchange tedious and too repetitive. This can disrupt the conversation flow and jar the user back into the reality that they are interacting with a device, not a person. In fact, Google just released an update called "Continued Conversation" to reduce the repetition that came from having to say "hey Google" before every request.

When mapping responses, make sure you are mapping the appropriate phrase choice depending on the answer. “Sounds good” to a user saying "no" can be very awkward so it’s important to get into the specifics (I actually saw this example happen in a demo).

Finally, make sure you have a plan for when something unexpected happens (like an odd or out-of-bounds request) so you can repair the conversation and enable someone to go forward versus ending up in an infinite loop of “I can’t help with that” or “I didn’t understand you." There are some existing methodologies that you can use to help with this. Using a similar approach to chatbot response mapping or call center scripting can help map out a voice experience and give a framework of the expected making it easier to identify how and when the unexpected can happen.

Style and tone guidelines still apply when it comes to voice experience although now voice and tone are no longer metaphors. With voice, it's important to have a defined idea of how your brand actually sounds and that sonic branding is critical to building a relationship with your target audience. Instead of being able to use visual cues to indicate what your brand stands for, you now need to use word choice, inflection and tonal expression to build your brand's persona in the user's mind. It's helpful to create a brand persona, complete with a back story to help inform what words your brand would use and how it would say them. The persona should help everyone understand if a brand would be professional or casual, humorous or serious, energetic or controlled, etc. It's also very important to be consistent with that persona so that there is a unified brand personality across all modes and channels. Otherwise there is a risk of audience confusion and harm to your brand.

Some key terms & basics to help navigate the conversations about voice

Having been to three conferences this summer where voice was a topic (voice was actually THE topic at two of them), I noticed a number of terms and acronyms are already invading how development and UX professionals talk about voice and the devices associated with it. Here’s a short list of some of the more frequently used terms to help you follow along.

Main Players & Their Devices

  • Apple: Siri & Home Pod
  • Google: Assistant (available on Home devices and Pixel Android phones)
  • Microsoft: Cortana
  • Samsung: Bixby assistant on Samsung phones
  • Amazon: Alexa & Echo family – Dot, Look, Plus, Show, and Speaker
  • IBM: Watson Experiences

Types of Interactions

  • Voice Only: Applications that have voice as the only input and output. These are not very common as even the Amazon Echo and Google Home have visual cues as part of their experience.
  • Voice First: Instances where voice is the primary input and output but it is not the only input. Examples would be Amazon Show and Echo Spot.
  • Voice-Added: In these experiences, voice is not the primary method for input or output but instead is used as an option for assisting with input. Voice to text on mobile is a common example of this.

Other Terminology:

  • Digital Virtual Assistant (also known as Digital Assistants): Internet-connected in-home devices that respond to voice-only commands and have the ability to capture and broadcast audio content. These also can be called Voice Assistants or Virtual Voice Assistants (VVA). Examples: Amazon Alexa, Google Home, Apple Homepod.
  • Alexa Skills: Skills are voice-driven capabilities that are used to enhance and personalize Alexa devices that enable users to do anything from play a game, get information, or even read a story. There are tens of thousands of skills available but it's also possible to create your own skill for your own need.
  • Alexa Flash Briefing: A Flash Briefing is a news update that Alexa can read or play. Users can customize their Flash Briefings by choosing from a list of news sources and setting the order in which they are read.
  • Google Actions: Actions enable users to extend the functionality of the Google Assistant which powers the Google Home. They let users do things that range from a quick command such as turning on a light or playing music to a longer conversation such as playing a game.
  • Natural Language SEO: A search engine optimization approach where content is presented in a question and answer format that aligns with how users want to search when speaking. For example, it's more likely a person would ask "Where is the nearest store?" as opposed to "Nearest store to me."
  • Long-tail keywords: These are highly targeted search phrases that specifically serve searcher intent. They typically have lower search volume due to the level of detail and specificity they have but they also have a higher conversion rate. However, these types of search will increase in volume as searches move from text to voice and the average length and specificity of questions will increase.
  • Natural Language Processing (NLP): A branch of artificial intelligence that helps computers understand, interpret, and manipulate human language. It refers to the process of analyzing text and extracting data (basically translating voice into text into data that the software can understand).
  • Natural Language Understanding (NLU): The aspect of NLP that deals with machine reading comprehension. A combination of components including a comprehensive lexicon, grammar rules, and semantic theory work together to break down sentences and guide comprehension.
  • Natural Language Generation (NLG): The natural language processing task of generating natural language from a machine representation system. Basically the reverse of NLU, this is the process that translates data back into a written narrative that becomes the spoken response back to a user.
  • Contextual Speech: Refers to the context, or surrounding words, phrases, and paragraph of writing that defines the meaning of a particular word. In contextual speech, this definition expands to also include the pronunciation in speech. For example, "Forty times" when spoken can be interpreted as: 40 times, 4 tee times, or even, 4 tea times but the meaning can be derived from the words around it in requests such as “tell me 40 times” or “book me 4 tee times at Cog Hill.”
  • Utterance: What a user says
  • Intent: What a user actually means
  • Conversation Repair: In conversational analysis, repair refers to the process by which a speaker recognizes a speech error and repeats what has been said with some sort of correction. In the context of voice, repair refers to the process by which the software recognizes an error in understanding or request and then seeks to correct it.