The anatomy of a voice command 🦴

Lau Biondo
Design Globant
Published in
7 min readMar 4, 2021

--

When designing VUIs, designers constantly need to think about the objective of the voice interactions. What is the user trying to accomplish?

Let’s take a closer look 👀

The anatomy of a voice command consists of 4 key factors: intents, entities, context and fulfillment.

1️⃣ Intents

When building a conversational experience, the number one task is to understand what the users are saying. That’s what intents are for.

An intent will have to be created in the developer’s tool (for example Dialogflow) for anything a user might request.

For example for a virtual assistant of an event, I will create an intent for “knowing the schedule”, “searching talks” and “find the way”.

For each intent, examples of the various ways the user might communicate the intention, should be provided. This is called training phrases and only a few examples will be needed to start.

The tool will use this information to train a machine-learning model to understand not just the examples, but also a lot of new phrases that mean the same thing.

So now, whenever the user says something, the model will match it with the intent that best fits.

A typical Dialogflow agent, usually has from one to thousands of intents, each trained to recognise specific user needs.

As people use the agent, the phrases they use, can be incorporated as training examples to the intents. So the more usage it gets, the smarter it becomes.

2️⃣ Entities

Intents determine what a user wants to do. But often when they expresses an intent, they want the agent to act on specific pieces of information contained between the statement. And that is exactly what entities are for. Their function is to extract useful information from what users say to the assistant.

In addition to matching the statement to an intent, it is often helpful to pick up any important fact from dates and times, to names and places. What entities do, is to automatically extract this type of information from what the user says.

In tools like Dialogflow, there are some built-in entities which serve for picking out critical information for common concepts such as dates, names and amounts with units. But of course, you can also define your own entities by providing a list of words or phrases that fit with the given concept.

Now when creating an intent and adding example phrases, the tool will recognize which entities might be present.

When a user says something that matches this intent, the values for any matching entities will be automatically extracted and those values can be used in the backend code to give the user what he is asking for.

This example, matched 3 entities. One of them is a system entity, and the other 2, are developer entities.

  1. @design-talks 👉🏼 matching the words Conversational design
  2. @sys.date 👉🏼 matching the word tomorrow
  3. @stage 👉🏼 matching the words main stage

Dialog flow has 3 types of entities:

  1. System entity
  2. Developer entity
  3. User entity

System entities are built into dialog flow. They cover common use cases, such as dates and times, amounts with units and names.

The developer entities, allows to define the entities based on a list of words either through the dialog flow console, the API, or by uploading a CSV. Any word in the list, will be matched by the entity. Synonyms can also be provided to the words in the list.

Finally, the user entities. These are special entities that can be defined for specific user sessions. They allow matching transitory things like the details of the user’s previous order, or a list of their favourite talks on previous events.

A conversation is a process where two speakers negotiate meaning and understanding through a back and forth called dialogs.

Continuing with the same example..

Imagine our assistant is asked to know at what time are the design talks tomorrow at the main stage. Before the assistant can answer, it needs to know a few things: The type of talk, the date and the place.

It might happen that the user provides all this information in one statement. For ex.

In this case, the assistant knows the type of talk, the date and the place. So he can satisfy the request.

But what if the user gives just part of the information?

We have the date and place, but we still need the type of talk.

When adding entities to an intent, it’s possible to mark them as required. This means that if the user doesn’t say something in the first statement, the model can request it. And to do so, a final step is required, which is to define the questions that the model will ask the user to request such information. This is known as prompt. The prompts can be customize so they sound more natural.

In my example, I will define that the “type of talk” and the “date” are mandatory, so if the user does not provide that information from the beginning, then the answer should look like this:

Prompt: What kind of talk are you interested in?

User: Conversational design talks.

Bot: Great. It’s tomorrow at 4:30 PM at the Main Stage. Do you want to book a seat?

User: Yes!

More than one answer can be defined, so that in case the user does not understand and provides erroneous information, a different question is displayed with more instructions, or even with the possible options. For example

Promt: Would you like to attend a design talk, or a developper talk?

3️⃣ Context

Another fundamental concept to make the conversation natural and fluid is to work with contexts. If we talk to a person and we want to book our place in a talk, if we first ask him for the content of the talk and then we tell him that we want to attend, he won’t ask us again what is the name of the talk that we want to go to.

The idea is that the chatbot works the same way. That is why contexts are needed. To transfer information between different intentions and make the conversation as natural as possible.

4️⃣ Fulfillment

After going through all these steps throughout the conversation, the final step will be fulfillment. Let’s suppose that the main goal was to book a seat at the talk, so the system must confirm the talk, the schedule and the availability. For that, a very useful tool is Dialogflow’s integrations with external systems through an end point where a call is made to check the availability at the talk and the subsequent reservation.

So back to our example, the system has to give the user an answer. As I mentioned earlier, a more complex response can be given, integrating it with the chat reservation systems and adding extra information. Or it can be a very simple and plain text response confirming the reservation of the talk.

Bot: Allright. Your reservation for tomorrow’s Conversational design talk at 4:30 PM at the Main Stage is confirmed. Do you need anything else?

That’s all! What remains is to test the prototype and confirm that there have been no errors, gaps or loops.

Keep in mind that generally the tool is handled by the developer, however it is very important that we, as designers, know how it works in order to provide them with all the information they need to load it in the tool and that there are no bumps or conversations that weren’t thought of.

So maybe it is a good idea to work on this together taking into account not only the happy path, but also handling errors, prompts, intents and training phrases. The crucial thing here is that you have everything mapped.

😲 I’ll be posting about testing VUI’s soon! Don’t miss out

In the meantime… 👇🏼

👀 Keep learning!

  • Actions on Google integrations — Link
  • Dialogflow Intents: Know what your users want — Link
  • Full list of Dialogflow’s system entities — Link
  • Conversational UI links & resources to have on hand 🤖 — Link

Hope you find it useful 💪🏽

Thanks! 👏🏽

--

--