Voice Application Development for Android

Speaker 1

00:00

You know, it's wild how second nature it's become to just talk to our devices. Hey, Google, set a timer, Siri, what's the weather? We barely think about it.

Speaker 2

00:09

Yeah, it really feels like something we just take for granted now, But.

Speaker 1

00:12

Pull back for a second. How does that actually happen? How does your phone hear you, understand what you want, and then you know, do something.

Speaker 2

00:19

It does feel a bit like a magic trick, doesn't it. But behind that simple interaction is this whole layered world of technology.

Speaker 1

00:27

It's quite complex, actually, and that's exactly the world we're diving into today. We're taking a deep dive into how you build these voice based applications, specifically thinking about Android devices.

Speaker 2

00:38

Okay, and our guide for this exploration is fascinating. It is a detailed technical guide published back in twenty thirteen.

Speaker 1

00:45

Twenty thirteen, so a bit of a snapshot from that era exactly.

Speaker 2

00:48

It gives us a really interesting look at the tools and approaches developers we're using them, leveraging Google's own capabilities and also some open source software.

Speaker 1

00:56

Right, So, our mission in this deep dive is to kind of cut through the complexity. We want to unpack the core concepts the essential building blocks.

Speaker 2

01:04

Fundamental, Yeah, the.

Speaker 1

01:05

Fundamentals, and show you the journey developers took to create apps you could talk to all without you needing to pour over the original technical manual yourself.

Speaker 2

01:15

Sounds good.

Speaker 1

01:16

We'll start with the absolute basics, speaking and listening and build up from there, maybe getting into complex conversations and even early virtual assistance.

Speaker 2

01:25

Okay, let's unpack this starting at the very beginning, seems right. Before a device can respond or have any kind of conversation, it first has to be able to speak. And here right.

Speaker 1

01:36

The fundamental capabilities Android provides for this are text to speech TTS and automated speech recognition or ASR.

Speaker 2

01:45

And thinking about why these are important, Well, it opens up so many possibilities. Imagine your hands are full like you're driving and need directions.

Speaker 1

01:53

Oh yeah, it's simple then or and.

Speaker 2

01:55

This is crucial for accessibility. Think about someone with a visual impairment using a green reader that's TTS or ASR helping someone communicate if they have difficulty speaking, much like Stephen Hawking used speech synthesis technology.

Speaker 1

02:10

They're really foundational tools. Okay, let's start with TTS text to speech. Simply put, it turns written text into spoken audio.

Speaker 2

02:19

Right, and the technology works in stages. First, it needs to understand the text itself, things like you know how to pronounce words that look similar but sound different based on context.

Speaker 1

02:28

Like reads versus read just exactly.

Speaker 2

02:31

Or converting numbers and abbreviations into full words. The cool part for developers often is the system handles a lot of this linguistic complexity. You don't always have to get into the weeds.

Speaker 1

02:41

Okay, So it understands the text, then it has to generate the actual sound exactly.

Speaker 2

02:46

A common approach, especially around the time this book was written, was something called concatenative synthesis.

Speaker 1

02:51

Concatenative synthesis okay, okay.

Speaker 2

02:53

Think of it like building a sentence by stitching together tiny prerecorded pieces of speech. These could be sounds, syllables, words, even short phrases.

Speaker 1

03:01

Ah like digital lego bricks for voice.

Speaker 2

03:04

Kind of algorithms select the right pieces and join them smoothly, trying to mimic natural rhythm and intonation. When it's done well, it can sound remarkably natural.

Speaker 1

03:14

It's kind of amazing how they make those pieces fit together. But wait, if they're stitching together pre recorded bits, why not just record a human voice actor saying everything the app might need to say.

Speaker 2

03:24

That's a great point, and sometimes apps do use professional voice actors, you know, for specific tromps where quality and consistency are paramount.

Speaker 1

03:33

Like a standard greeting or instruction exactly.

Speaker 2

03:36

But TTS becomes absolutely essential when the text is dynamic, when you can't possibly pre record everything it might need to say.

Speaker 1

03:43

Ah, right, like reading out a text message.

Speaker 2

03:45

You just got, or a news headline that just updated, or.

Speaker 1

03:48

Someone's name from your contact.

Speaker 2

03:49

List Precisely, you just can't anticipate every single phraser name. So while the quality might sometimes be a trade off compared to say, a perfectly recorded voice, linets offers that vital flexibility for dynamic content.

Speaker 1

04:02

And Android has had this built in for ages. Right, how do developers actually hook into it?

Speaker 2

04:07

Yeah, the capability has been there since Android one point six. Believe or not, developers use the framework provided. A key step is making sure the necessary language data is actually on the user's.

Speaker 1

04:20

Device the voice itself.

Speaker 2

04:22

Yeah, the voice files the rules for that language. The system lets developers check for this using something called an intent, and even prompt the user to install it if it's missing.

Speaker 1

04:31

Okay.

Speaker 2

04:32

The book suggests using a common software design pattern, a singleton, basically ensuring only one instance of the TTS engine is created. This helps manage resources efficiently smart and the examples in the book show how you might use this to say, read back text the user piped in, or maybe read text loaded from a file. You can specify the language too, like English or even regional variation.

Speaker 1

04:52

Okay, so that's how the device speaks. Now the other side hearing us automated Speech recognition ASR. This is turning our spoken words into text.

Speaker 2

05:02

Right, and like TTS, it involves steps. First, the device needs to capture the sound from the microphone and process it. Think of it as cleaning up.

Speaker 1

05:10

The audio, getting rid of background noise.

Speaker 2

05:11

Yeah, removing noise maybe echo, and just preparing it digitally for analysis.

Speaker 1

05:17

Then comes the recognition part itself. Yeah, breaking down the audio into.

Speaker 2

05:22

What sounds basically yeah, into tiny segments phones the basic sounds of the language, and then it tries to match them.

Speaker 1

05:30

Wow.

Speaker 2

05:31

This is where powerful statistical models come into play. These models are trained on massive amounts of recorded speech, learning how different sounds are typically pronounced in different context by different people.

Speaker 1

05:41

Wow.

Speaker 2

05:42

Okay, they build what's called an acoustic model. It's essentially a statistical map of how sounds relate to words.

Speaker 1

05:48

But words can sound exactly alike you mentioned read and read or two and two. How does it know the difference?

Speaker 2

05:55

Ah, good question. That's where another statistical model helps, the language model.

Speaker 1

05:59

Language model.

Speaker 2

06:00

This one understands the probability of words appearing together in sequence. It knows that after I went, the word two is far far more likely than two.

Speaker 1

06:10

Right context exactly.

Speaker 2

06:12

The language model provides that crucial context to help resolve those ambiguities.

Speaker 1

06:16

And the result isn't always just one single interpretation, is it like? It's not always certain?

Speaker 2

06:22

No, definitely not. Typically, the ASR system gives you back a list of possible results, ranked by how confident it is in each one. A list, yeah, it's often called an end best list. Each possibility comes with a confidence score, usually from zero to one. A score near one means the system is pretty.

Speaker 1

06:40

Sure it got it right, and that's useful for the developer.

Speaker 2

06:43

Incredibly valuable. They can just pick the top result. If the confidence is high, or if it's lower, or if the top one doesn't make sense in context, they can look at the others in the list. Or maybe even use the confidence score to decide HM I better ask the user to confirm this.

Speaker 1

06:58

This capability has also been around on end for a while since version two point one. Often when you tap the little microphone icon on the keyboard.

Speaker 2

07:05

Yes exactly, and developers have flexibility here too. You can use a simple built in tool and intent that handles the speak now, prompt and feedback automatically, super easy, quick and dirty, pretty much. Or if you want more control over the look and feel the user interface, you could use a more advanced component a speech recognizer instance. This lets you manage the UI yourself and react to specific recognition events like when the user starts or stops speaking.

Speaker 1

07:32

More control, more work typically yeah.

Speaker 2

07:35

The book again suggests using a library approach here, like an ASRLB, just to keep the code organized and reusable.

Speaker 1

07:42

And you mentioned language models. Can you tell the system what kind of speech to expect, like am I dictating an email or just barking a search query?

Speaker 2

07:51

Exactly? You can specify different language models. There's one design for free form dictation like long sentences, and another optimized for shorter phrases like web search queries.

Speaker 1

08:00

Ah.

Speaker 2

08:01

The book does note though, that even with these models, the input can still be quite open ended, so the developer might need to do more processing afterwards to figure out the specific command or meaning.

Speaker 1

08:11

Oh and because these systems often connect to cloud services for the heavy lifting.

Speaker 2

08:15

Right, the recognition part. Yeah, the app usually needs permission to access the Internet, and you need to handle potential errors like no speech detected or no match found or maybe a network problem.

Speaker 1

08:29

Got it. So we've got the building blocks. The device can speak TTS, and it can listen and turn speech into text ASR even giving us a list of possibilities with confidence scores. How do we actually put those together to build simple interactions?

Speaker 2

08:45

That's the next logical step, right, moving from just hearing or speaking to creating a basic back and forth. Think about those early voice actions.

Speaker 1

08:53

Like on Google Now back in the.

Speaker 2

08:54

Day, exactly telling your phone call mom or go to Wikipedia dot org. These are structured commands, simple cause and.

Speaker 1

09:01

Effect, and they're built just by combining those core TTS and ASR capabilities we just talked about.

Speaker 2

09:06

Pretty much the book provides them straightforward examples. One is an app called voice Search.

Speaker 1

09:10

It just takes whatever you say, listens using ASR right.

Speaker 2

09:13

Grabs the top result from that end best list the one with the highest confidence it seems it's right, and immediately plugs it into a standard Android web search intent. Boom search results appear very.

Speaker 1

09:24

Simple, okay, but that immediately brings up a potential problem which you hinted at. What if the ASR got it wrong exactly? This seems particularly tricky in another example app. The book mentions voice Launch, which tries to launch an installed application based on what the user says. Right, what if you don't say the exact app name, like maybe you say music player, but the app is actually called play Music.

Speaker 2

09:48

This is where the idea of similarity measures comes in. It's a crucial concept. The app needs a way to compare what the user said to the actual names of the apps installed on the device to find the best to match, even if it's not identical.

Speaker 1

10:02

How does it do that? Just check if the letters are similar.

Speaker 2

10:05

That's part of an orthographic similarity looking at the spelling. But crucially, it can also look at phonetic similarity.

Speaker 1

10:11

How words sound alike?

Speaker 2

10:12

Yes, so it could figure out that, I don't know, photos and photos probably refer to the same thing, even if the spelling's different.

Speaker 1

10:20

Okay, that's clever.

Speaker 2

10:21

The book mentions using algorithms like soundex for this phonetic comparison, although it notes the specific implementation they included was primarily tuned for English. The key thing is normalizing the input first, like her moving spaces, making everything lower case before you do. The comparison makes sense.

Speaker 1

10:39

Okay, So even with similarity measures, ASR isn't perfect. That potential for error means you often need to double check with the user right confirm things absolutely.

Speaker 2

10:50

Confirmation is vital for robust interaction. The book includes a simple example building on that Voicer chap. After recognizing something like pizza places right, yeah, might use TTS to ask did you say pizza places? And then it uses ASR again, but this time listening specifically for a simple.

Speaker 1

11:06

Yes or no uh, constraining the expected input exactly.

Speaker 2

11:11

It's a basic but really important step, especially when you're dealing with single critical pieces of data before taking an action.

Speaker 1

11:17

So we can make the device speak, listen, perform. These simple command action pairs handle some ambiguity with similarity, and even ask for basic yes no confirmation. But these interactions still feel quite rigid. You know, you have to say things in a very specific way or is just one command at a time. How do you make the conversation more flexible, like guide the user through collecting multiple pieces of information?

Speaker 2

11:41

Okay, yeah, that takes us into the realm of more structured conversations, often called form filling.

Speaker 1

11:47

Dialogue form filling like on a website.

Speaker 2

11:49

Exactly the same idea. The goal is to gather several distinct pieces of information from the user, one by one, but doing it through voice instead of textboxes and dropdowns.

Speaker 1

11:58

Okay, So, like booking a flow, it might ask what city are you flying from? Then once you answer, what is your destination exactly?

Speaker 2

12:05

And then maybe what date do you want to travel. To manage this, you need a system. You need a way to define the pieces of information you need. Think of these as slots to be.

Speaker 1

12:14

Filled like fields on a form.

Speaker 2

12:15

Precisely, and you need an algorithm, some logic that knows how to navigate the conversation to collect the info for EID slot in some sensible order.

Speaker 1

12:24

The book points to something called VoiceXML as a kind of model for this.

Speaker 2

12:28

Yeah. VoiceXML is or was, a W three C standard for defining these kinds of voice dialogues often used in call center systems. It uses concepts like forms, which contain fields or slots. Each field has a prompt, which is what the system asks the user.

Speaker 1

12:44

What is your destination right?

Speaker 2

12:46

And optionally, fields can have grammars associated with them, which constrain or help interpret what the user can say in response.

Speaker 1

12:53

So for a destination field, the grammar might only accept city names.

Speaker 2

12:57

Potentially yes and VoiceXML uses a concept called the form interpretation algorithm or FIA. It's basically the logic engine that steps through the form, asking for one piece of required information at a time until all the necessary slots are filled. The book uses a simplified subset of these ideas specifically for Android development, and.

Speaker 1

13:16

There's a specific library in the book to help build this.

Speaker 2

13:19

Yes a library called form filip containing classes to represent these forms and fields. It works by parsing XML files that the developer writes. These XML files define the structure of the conversation, what questions to ask, in what order, which fields are needed, so.

Speaker 1

13:35

The conversation logic is separate from the main app code exactly.

Speaker 2

13:39

It uses standard Android tools like XML pull parser handled via another helper library, xml lib to read these definitions. Then a key piece called the dialogue interpreter class steps through this structure, triggering the right TTS prompt and listening for ASR responses to fill each field.

Speaker 1

13:55

Does it handle background tasks like parsing might take time?

Speaker 2

13:59

Good point. It's designed to do the potentially slow work like parsing the XML or waiting for ASR in the background using Android's acing task, so the main app remains responsive. That separation of concerns is really nice.

Speaker 1

14:11

A great example used in the book is the music Brain app. What does that do?

Speaker 2

14:15

Right? The music Brain demo app uses this form filling library. It guides the user through a voice dialogue asking for details like maybe a word that appears in an album title or a start an end.

Speaker 1

14:25

Date range, using that form structure.

Speaker 2

14:27

Exactly once, it collects all the pieces of information needs by filling the slots in its form, and use that collected information to query the music Brain's web service, which is a big online music database.

Speaker 1

14:39

Ah. So it's combining the voice interface with external data a.

Speaker 2

14:43

Mashup, precisely. It shows how you can take destruction data gathered via voice and uses to interact with online services retrieve information. Process it may be filter or sort the results like sorting albums by release date using helper classes, and then present that back to the user, perhaps speaking the results or showing them on screen.

Speaker 1

15:03

Okay, so form filling lets us manage these multi step conversations to get structured data like album name and date range. But you mentioned that the ASR input within each step was still somewhat open ended in these basic examples. How do we make the app understand more than just the words the user says? How do we get it to understand the meaning behind the words?

Speaker 2

15:24

Right? That's a critical step towards more intelligent interaction, and that's where grammars come in, leading us into the field of natural language understanding or NLU.

Speaker 1

15:32

Grammars and NLU Okay.

Speaker 2

15:34

Grammars are tools designed specifically to help the application interpret more complex user inputs. They help extract not just the sequence of words, but the underlying meaning and specific structured pieces of information.

Speaker 1

15:47

So going beyond just recognizing show me flights to London as a sequence of words.

Speaker 2

15:52

So understanding that the user's intent is to see flights and the destination parameter is London.

Speaker 1

15:58

Got it? How do you create these grammars?

Speaker 2

16:01

The book discusses two main approaches. First, there are handcrafted.

Speaker 1

16:05

Grammars written manually by developers.

Speaker 2

16:07

Exactly, you write them yourself, often in an XML format like SRGS Speech Recognition Grammar Specification, though the book uses its own simplified XML format. You define the structure of acceptable phrases using rules, items within rules, alternatives, optional parts, and links between different rules.

Speaker 1

16:25

Can you give an example for that flight query?

Speaker 2

16:28

Sure, you might have a top level rule like rule ID fine flight. Inside that you might have an item for the phrase show flights or fine flights. Then maybe an optional item repeat zero one for the word two, and then crucially a reference to another rule like ruler ref u r F hashtag city which defines all the valid city names.

Speaker 1

16:49

And the hashtag city rule would list London, Paris, New York right.

Speaker 2

16:53

And within those rules you can use special semantic tags. So next to the item London in the city rule, you might have a tag like tag out lhr tag. This tells the system if the user says London, don't just return the word London, return the airport code lhr AH.

Speaker 1

17:09

Extracting structured data directly based on the grammar match. That's powerful, very powerful.

Speaker 2

17:15

But as you can imagine writing these grammars to cover all the different ways a user might phrase.

Speaker 1

17:19

Something flights to London, show me London flights. I want to flight to London exactly.

Speaker 2

17:23

Designing handcrafted grammars for spontaneous, unpredictable speech is incredibly hard and very time consuming. That's the big challenge.

Speaker 1

17:30

So what's the alternative.

Speaker 2

17:32

That's where the second type comes in. Statistical grammars, or more broadly, statistical NLU.

Speaker 1

17:37

Models learn from data.

Speaker 2

17:39

Yes, these aren't written by hand. They're trained on vast amounts of real world language data using machine learning techniques, and.

Speaker 1

17:47

The advantage is they can be much more flexible handle variations you didn't explicitly code for.

Speaker 2

17:52

That's the key benefit. Because they work based on probabilities and patterns learn from how people actually speak, they can often handle more or irregular wording, synonyms, even slightly ungrammatical inputs that would break a strict handcrafted grammar.

Speaker 1

18:07

What's the downside?

Speaker 2

18:09

The main one is they require huge data sets to train effectively, and access to these trained models often comes via cloud services. The book mentions a service from a company called Maluba as an example of a real world statistical NLU system available around that time.

Speaker 1

18:22

And that kind of system tries to identify the core intention and the relevant details the entities.

Speaker 2

18:27

Precisely, you give it a phrase like what's the weather in Belfast for tomorrow, and a statistical NLU system could analyze it and return something structured like categories whether action is, check status, and the entities are location, billfast, and date tomorrow, maybe even resolving tomorrow to the actual calendar date. It's focused on extracting that core meaning, often regardless of the exact sentence structure used.

Speaker 1

18:52

Does the book include a library to help developers work with these different grammar types.

Speaker 2

18:56

It does. An NLU lib. It contains classes for handelling those handcrafted grammars, parsing the XML definitions into Java objects, inverting the rules into patterns, often using regular expressions behind the scenes, and then using Java's matching tools to check user input against these patterns. It also extracts the semantic information based on those tags we talked.

Speaker 1

19:16

About, and for the statistical ones.

Speaker 2

19:18

The library also incling code demonstrating how to connect to external statistical NLU services like that mlluda API, sending the user's text and parsing the structured semantic result that comes back, assuming, of course, the developer has API access.

Speaker 1

19:32

And is there a demo app to play with this?

Speaker 2

19:35

Yes, a grammar test app. It's quite useful. It lets you input text, either typing it or using the results from ASR, and then test that input against either a handcrafted grammar file you provide or by sending it off to the statistical.

Speaker 1

19:48

Service so you can see the difference exactly.

Speaker 2

19:51

It shows you whether the input is considered valid according to the grammar and more importantly, what semantic information or structured representation some extracts. It's a clear way to see the different capabilities and outputs of the two approaches to understanding language.

Speaker 1

20:06

Okay, this is really building up. We've gone from basic speaking and listening to simple commands, structured form filling, and now understanding meaning with grammars and NLU. How do we make these voice apps even more robust and user friendly, maybe for a wider audience or in different situations.

Speaker 2

20:23

Well, two key aspects the book covers next are multi linguality and multimodality.

Speaker 1

20:28

Multi linguality supporting different languages seems obvious, but important.

Speaker 2

20:32

Absolutely essential if you want your app to reach a global audience. It means being able to use TTS and ASR and languages other than.

Speaker 1

20:39

Just the default How do developers handle that.

Speaker 2

20:42

They specify the languages using standard codes like ISO six three nine tosh one codes N for English as for Spanish and so on. The Android system provides ways again using intents to check which languages are actually supported or installed on the user's specific device.

Speaker 1

20:57

Because not all devices might have all languages pre installed, right, and.

Speaker 2

21:02

This ties directly into the broader concept of localization in Android development, you know, providing different text strings, images layouts, and resource folders like resvalues for Spanish users versus Ree's values and for English users. It's about adapting the whole app experience. The book has a very simple silly parrot app just to show switching TTSASR language.

Speaker 1

21:22

Okay, so that's multiple languages. What about multimodality sounds complex?

Speaker 2

21:26

It just means combining voice interaction with the traditional graphical user interface, the GUI, the buttons and screens we're used to tapping.

Speaker 1

21:32

On ah, voice and touch working together. Why do that?

Speaker 2

21:36

Because sometimes tapping is just easier or faster than speaking for certain inputs, or users might want visual feedback confirming what the system understood, or maybe they start by voice and when finished by touch or vice versa. The idea is to create a seamless link, a link between what between the fields of information we were talking about in form filling dialogues and the visual elements on the screen

22:00

like drop down lists, spinners and android lists. You can scroll list views, radio buttons, checkboxes, text entry fields, edit text.

Speaker 1

22:07

Okay, so how does that work in practice?

Speaker 2

22:10

Imagine you have a field in your voice dialogue for say, urgency, with options low, medium, high. You might also have radio buttons on the screen for the same options. Right, If the user says medium urgency, the app not only fills that voice field internally, but it also automatically checks the meeting radio button on.

Speaker 1

22:27

The screen ah synchronization exactly.

Speaker 2

22:30

And conversely, if the user taps the high radio button on the screen, the app knows that the urgency information has been provided, so the voice dialogue system shouldn't ask for it orally anymore. The state is shared.

Speaker 1

22:41

That sounds really useful, but potentially complex to manage.

Speaker 2

22:44

It requires careful design. Grammars become important here too, to ensure that the voice input for a specific field actually matches one of the valid options available in the corresponding GUI element, Like the items.

Speaker 1

22:58

And a drop down list, and the book provides hell for this too.

Speaker 2

23:00

Another library, Yes, building on the form filling one, a multimodal form FILLIP. It extends the basic form library by adding grammar checking within the dialogue flow, so it only accepts voice input that actually matches the grammar defined for the current field. And crucially, it includes methods like oral to gee and GI toral specifically designed to synchronize the state between the internal voice feels and the visual GUI elements.

Speaker 1

23:24

Is there an example app for this?

Speaker 2

23:26

They use a mock send message app as a demonstration. It lets the user provide details for sending a message recipient urgency, maybe the message body itself, either by speaking the.

Speaker 1

23:35

Information following the form filling prompts.

Speaker 2

23:37

Right, or by interacting directly with the GeOI elements on the screen, like picking a contact from a list or tapping a radio button for urgency. The app keeps track of the information consistently, regardless of whether it came via voice or touch. It really highlights how voice and touch don't have to be separate, isolated interaction modes. They can compliment each other within the same task.

Speaker 1

23:59

Okay, this is quite a journey We've layered on all these capabilities speaking, listening, simple commands, handling, ambiguity, multistep forms, really understanding language with Grammars in NLU, adding flexibility with multiple languages, and combining voice with the screen through multimodality. What happens when you take all these pieces and integrate them into one coherent system, Well, that's when.

Speaker 2

24:23

You get into the realm of virtual Personal assistance.

Speaker 1

24:25

Vpas AH the serious, the Google Assistance, the Alexis of the world exactly.

Speaker 2

24:30

These are the conversational agents we're much more familiar with today. They really represent the culmination of all these underlying technologies we've been discussing, designed to understand potentially complex requests and perform a whole range of tasks.

Speaker 1

24:43

And the fundamental challenge for a VPA trying to bring everything together must be accurately figuring out the user's intention from whatever they happen to say, however they say it.

Speaker 2

24:53

That's the core of it. And as we saw when discussing Grammars in NLU, you can approach this understanding in different ways. You might try to classify the user's input using statistical methods trained on lots of.

Speaker 1

25:04

Data, which is good for more open ended questions or requests.

Speaker 2

25:07

Right, or you might use more structured grammars for specific commands, extracting the core intent in any relevant details or parameters. Often modern systems use a hybrid approach.

Speaker 1

25:18

And once the intention is hopefully understood, the VPA needs more logic. Right, it needs a system to decide what to do next.

Speaker 2

25:25

Definitely, it needs dialogue management. Based on the understood intent and the current state of the conversation, the dialogue manager decides the next action. Does it need to ask a clarifying question, Does it have enough information to perform the task? Does it need to access some data?

Speaker 1

25:40

And then it needs to generate a response.

Speaker 2

25:42

Yes, response generation figuring out what to say back to the user via PTS or what information to display on the screen, or what action to actually perform on the device or via web service.

Speaker 1

25:54

This whole idea of conversational AI has a long history, doesn't it even before smartphone?

Speaker 2

25:59

Oh yeah, it goes way back. The book even mentions early chatbots like Eliza from the nineteen sixties, but for building more sophisticated vpas around the time the book was written. It focuses on using a specific platform called Pandora Bots.

Speaker 1

26:11

Pandora Bots, what's that?

Speaker 2

26:13

It's a platform still reunt today actually for creating and hosting conversational agents or chatbots uses a language called AML. AML Artificial Intelligence Markup language is based on XML, and it lets you define the bot's conversational behavior using categories. Each category has a pattern which is basically.

Speaker 1

26:31

What the user might say a potential input phrase.

Speaker 2

26:33

Right, and a template which defines how the bots should respond. You can use wild cards like in the patterns to make them more flexible, matching multiple variations of user input, and there are special tags like srey that let you redirect from one pattern to another, helping to handle synonyms or rephrase requests without duplicating logic.

Speaker 1

26:53

What happens if the user says something that doesn't match any pattern?

Speaker 2

26:57

Good question? AML has the concept of an ultimate default category, a fallback pattern that matches anything else, usually triggering a response like sorry, I didn't understand that or could you rephrase? It's crucial for making the bot seem less brittle.

Speaker 1

27:12

Okay, So the core conversation logic matching input patterns to output templates runs on the Pandora Bots platform. How does that connect to doing things on the Android device.

Speaker 2

27:22

This is where a really interesting feature mentioned in the book comes in, specifically designed for mobile vpas. It's the UB tag oh ob out a band. It's a special tag that the AML developer can embed within the box response template alongside the text the bot is supposed to say,

27:37

how does that work? So the box template might have the text okay, I'll search the web for that for you, but hitting within the same AML template, there could also be an UB tag containing a command like search query the user asked for search okay, the Android app communicates with the Pandora bot online. It sends the user's transcribed speech from ASR. The Pandora bot finds the matching pattern and sends back the AML template response. The Android app

28:01

then parses this response. It takes the regular tax part okay, I'll search and sends it to the TTS engine to be spoken. But it also looks for any ubtags. If it find one, like search, it intercepts that command and executes the corresponding action on the device, in this case launching a web search with a specified query wow.

Speaker 1

28:20

So the conversation, logic and knowledge stays centralized on the Pandora bot server define an AML, but the actual device actions searching, launching apps, making calls, opening URLs are triggered locally on the Android device by these hidden commands embedded in the bot's response exactly.

Speaker 2

28:36

It's a clever way to decouple the conversational intelligence from the specific device functionalities. The book provides a vp lib library to handle this communication, connecting to a specific Pandora bot online using its bodhead sending the ASR input, parsing the XML response from Pandora bots, looking for the spoken part often marked by a FAT tag, and checking for ubtags, and then running the course wonding functions on the device

29:01

based on those tags. The book did note a limitation at the time that the gender of the TTS voice couldn't be controlled programmatically, which might affect the perceived persona of the bot.

Speaker 1

29:10

Does the book include sample vpas built using this Pandora bots approach?

Speaker 2

29:14

It does. It described three Jack, Derek and Stacy. Jack is based on a well known general purpose amail bot called Alice, designed for broad open ended conversation. Derek, on the other hand, is presented as a specialized bot. It trains specifically on a particular knowledge domain. The example uses FAQs about type two diabetes. This highlights how you can encode expert or domain specific knowledge using these AMMEL patterns and templates.

Speaker 1

29:38

So one generalist, one specialist, and Stacy.

Speaker 2

29:41

Stacy is basically Jack the general conversational bot, but enhanced with that oub tag functionality, so Stacey can not only chat, but can also actually control device functions like searching online or launching apps based on understanding those embedded commands within the conversation flow.

Speaker 1

30:00

That really brings it all together. If you look back at how the book illustrates the overall VPA structure, you know, starting with ASR capturing the audio, then spoken language understanding whether that's NLU, grammars or AML pattern matching to figure.

Speaker 2

30:16

Out the intent right the understanding part.

Speaker 1

30:18

Then dialogue management deciding the next step, response generation formulating the output, and finally TTS speaking the response.

Speaker 2

30:25

All while potentially connecting to external data sources, knowledge bases, and triggering those device actions via things like the ouptech.

Speaker 1

30:32

You can really see how all those individual building blocks we discuss, TTS, ASR, forms, grammars, multimodality fit together into that complete VPA system.

Speaker 2

30:42

It is the full picture. Now show how you assemble these components to create those conversational agency.

Speaker 1

30:46

And that really brings our deep dove to a close, doesn't it. We've journeyed all the way from the absolute fundamentals, the device's ability to speak using TTS and listen using ASR.

Speaker 2

30:57

The core building blocks, through.

Speaker 1

30:59

Building those first simple command interactions, bringing in ideas like similarity measures and confirmation.

Speaker 2

31:05

Making them a bit more robust, and.

Speaker 1

31:06

Structuring more complex multi turn conversations using form filling dialogues, managing the flow to gather.

Speaker 2

31:12

Information, getting that structured data, then.

Speaker 1

31:14

Diving into how to understand user input more deeply, moving beyond words to meaning using both handcrafted and statistical grammars via.

Speaker 2

31:23

NLU, extracting intent and entities, and.

Speaker 1

31:27

Then adding layers of flexibility through multilinguality supporting different languages and multimodality, seamlessly combining voice with graphical interfaces.

Speaker 2

31:36

Making the interaction richer and more adaptable.

Speaker 1

31:38

And finally we saw how all these technologies converge in the architecture of virtual personal assistance, using platforms like Pandora bots with AML and leveraging clever techniques like that OOB tag to bridge the gap between conversation and actually doing things on the device.

Speaker 2

31:56

Absolutely, This deep dive, even though it's rooted in a technical guide from over a decade ago, now really gives you a solid conceptual shortcut. It helps you understand the core challenges and the fundamental building blogs involved in bringing voice interaction to life on a device.

Speaker 1

32:13

Yeah, it lays out the pieces really clearly and considering all these elements together, from breaking down sounds and words to understanding complex intent, managing dialogue flow over multiple turns, generating responses, and having the ability to trigger pretty much any device function or connect to any web service. It really opens up your imagination, doesn't it, to the kind of truly personalized voice applications you could potentially create.

Speaker 2

32:36

Right, It makes you think beyond the general purpose assistance we mostly use today. What if a voice assistant wasn't just generic? What if it was deeply, deeply specialized.

Speaker 1

32:45

Yeah, Like, imagine a VPA that understands the incredibly specific vocabulary, the jargon, the unique needs of your particular hobby or your specific job, maybe using a custom built grammar or a finely tuned n LU model, and it could instantly connect you to niche online resources or internal databases that a general assistant wouldn't even know existed, Or.

Speaker 2

33:08

Think about something more personal, maybe an assistant that reads you recipes, but it uses curated audio samples to sound like a comforting, familiar voice, maybe even your grandmother's voice. If you had the recordings and it lets you navigate the cooking steps completely hands free, using just simple voice commands tailored to that specific.

Speaker 1

33:26

Task, that would be amazing. It makes you wonder what kinds of unique, genuinely helpful or even just wonderfully quirky, and specific voice interfaces are still waiting to be built, especially when you start combining these foundational technologies tts ASR NLU dialogue management, device control in new and unexpected ways.

Speaker 2

33:45

Exactly what's the potential beyond the mainstream assistance we interact with now? It definitely leaves you with something to think about.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript