Artificial Intelligence for .NET: Speech, Language, and Search: Building Smart Applications with Mic

Speaker 1

00:00

Welcome to the deep dive. Today, we're really getting into the weeds of artificial intelligence.

Speaker 2

00:06

Specifically for dot net developers.

Speaker 1

00:09

Right exactly, we're looking at how you can actually use things like speech, language and search.

Speaker 2

00:14

All powered by Microsoft's cognitive services based on the source material we've got.

Speaker 1

00:18

Yeah, so think of this as well, moving past the buzzwords.

Speaker 2

00:21

Right from the big ideas down to the practical tools you can use to build smarter apps.

Speaker 1

00:25

Our goal here is, you know, to cut through the noise, pull out the really key concepts, get.

Speaker 2

00:29

Those aha moments, maybe uncover some surprising bits.

Speaker 1

00:33

And do it without you needing to like write a single line of code. Right now, it's.

Speaker 2

00:37

About understanding the essentials, the foundation, the services, so you know what's possible.

Speaker 1

00:43

Okay, so let's lay that foundation AI. It's everywhere often sounds like science fiction. What's the real deal.

Speaker 2

00:50

Well, at its core, it's about building systems that do things needing human.

Speaker 1

00:54

Like intelligence, but maybe not sentient robots just yet.

Speaker 2

00:58

Hah No, it's much more grounded. Think specific tasks, specific capabilities that are available now.

Speaker 1

01:05

And getting here wasn't exactly a smooth ride, was it. There were these AI winters.

Speaker 2

01:10

That's right, two main periods, sort of mid seventies to early eighties and then again late eighties to early nineties. What happened there, Basically, the hype got way ahead of the actual technology. Promises were made that just couldn't be delivered with the computing power.

Speaker 1

01:25

Back then, so uh, funding dried up, progress stalled exactly.

Speaker 2

01:29

People got disillusioned. But then things started picking up again.

Speaker 1

01:33

Why what changed?

Speaker 2

01:34

Computers got faster, cheaper, you know Moore's law and action. Suddenly some older ideas became.

Speaker 1

01:40

Feasible, and we started seeing these specialized systems achieving big things.

Speaker 2

01:45

Yes, like IBM's Deep Blue beating Gary kaspar Off at chess in ninety seven.

Speaker 1

01:50

That was huge, a machine beating the world champ in such a complex game.

Speaker 2

01:53

It showed that focused DAI could really excel. But the current boom that really kicked off more recently.

Speaker 1

01:58

Driven by the tech giants and testing heavily.

Speaker 2

02:00

Absolutely, and then you had moments like IBM Watson winning Jeopardy Yeah in twenty eleven.

Speaker 1

02:05

Right, that really showcased natural language processing in a big way.

Speaker 2

02:09

It did, and after that companies really started productizing these AI capabilities, making them available as services.

Speaker 1

02:16

Which brings us to today, where AI feels like it's baked into so much tech.

Speaker 2

02:21

Your phone, websites, even video game characters reacting to what you do.

Speaker 1

02:26

So it went from big dreams, some setbacks, and now it's practical, usable tech.

Speaker 2

02:32

Pretty much, and that practicality is changing how we even interact with computers.

Speaker 1

02:37

Let's talk about user interfaces. We started way back with the command line, the CLI.

Speaker 2

02:41

Powerful, yeah if you knew the commands, but super intimidating for beginners.

Speaker 1

02:45

Like learning a secret code kind of yeah.

Speaker 2

02:47

Then came the GUI, the graphical user interface.

Speaker 1

02:51

Total game changer, Windows icons, the mouse.

Speaker 2

02:54

Building on work from places like Xerox PRC, then popularized by Apple and Microsoft. It made computing accessible.

Speaker 1

03:01

You could see what you were doing exactly.

Speaker 2

03:03

But now there's another shift happening towards conversation.

Speaker 1

03:06

The conversational user interface or see y right.

Speaker 2

03:09

The idea is you just talk or type to the system, like messaging a friend. No clicking through menus.

Speaker 1

03:14

So ordering pizzas just typing get me a large pepperoni.

Speaker 2

03:18

That's the goal, simple natural interaction. Messaging apps have made us really comfortable.

Speaker 1

03:23

With this, Okay, but that sounds like it needs some serious smarts behind it.

Speaker 2

03:26

It absolutely does. This is where AI is crucial, specifically Natural language understanding NLU.

Speaker 1

03:33

Because the system has to figure out what you actually mean, not just what you typed.

Speaker 2

03:38

Precisely, it needs NLU in the back end to interpret that conversational input.

Speaker 1

03:43

So what's the weather and forecast for today mean the same thing to the system.

Speaker 2

03:49

A good NLU system should understand that. Yes, it identifies the user's goal, the intent.

Speaker 1

03:55

It pulls out the important bits of information.

Speaker 2

03:57

Those are the entities. So weather in London tomorrow. The intena is get weather entities are London and tomorrow.

Speaker 1

04:03

Seems intuitive for us, but you mentioned CUIs aren't perfect.

Speaker 2

04:07

There are challenges, Oh definitely. They struggle with really complex, nuanced conversations, and there are risks.

Speaker 1

04:13

Remember Tay, Microsoft's Twitter bot. Oh yeah, that one sideways fast.

Speaker 2

04:17

It learned from interactions, but unfortunately it learned toxic stuff very quickly and started spewing offensive tweets.

Speaker 1

04:25

Had to be shut down almost immediately.

Speaker 2

04:27

A really stark reminder that AI learning from the real world needs careful controls. You can't just let it loose without safeguards.

Speaker 1

04:34

So maybe it mixes better for now combining CUI and GUI.

Speaker 2

04:38

Yeah, the source suggests a hybrid approach often makes sense. Use conversation for simple things, stick to graphical interfaces for more complex tasks.

Speaker 1

04:47

Okay, let's dig into that NLU piece more. It's fundamental. Why is it considered an AIHRD problem.

Speaker 2

04:54

Because human language is just incredibly complex and subtle. Getting a machine to grasp it properly isn't just one algorithm.

Speaker 1

05:01

It's like computer vision or machine translation requires lots of different techniques working together.

Speaker 2

05:05

Exactly, there are multiple layers of difficulty like way, Well, first, there's syntax, the grammar, the structure of sentences. Machines need to parse that correctly.

Speaker 1

05:14

Okay, sentence rules makes sense.

Speaker 2

05:16

Then semantics. That's the meaning of words and sentences, synonyms, words with multiple meanings.

Speaker 1

05:22

I like apples versus I'm fond of apples, same meaning, different words.

Speaker 2

05:27

Right, The machine needs to get that underlying concept.

Speaker 1

05:30

Any sounds tricky, what's.

Speaker 2

05:32

Next, pragmatics. This is maybe the toughest bit. It's understanding the implied meaning the context that's not being explicitly sex Exactly, if I say wow, it's hot in here, I might actually mean can you open a window? The machine needs situational awareness.

Speaker 1

05:48

Which computers usually lack. They don't have our common sense or world knowledge precisely.

Speaker 2

05:53

And then you've got just plain.

Speaker 1

05:55

Ambiguity words meaning different things.

Speaker 2

05:57

Yeah, like bank riverbank or financial bank, or sentences that can be read multiple ways like I saw a man on a fill with a telescope.

Speaker 1

06:04

The classic who has the telescope? Right?

Speaker 2

06:06

And finally just the sheer variation in language spoken versus written dialects, slang, typos. It's messy, very messy for a machine to handle consistently.

Speaker 1

06:16

So syntax, semantics, pragmatics, ambiguity variation quite a challenge. Were their early attempts to crack this.

Speaker 2

06:24

Yeah, some famous ones Eliza back in the sixties mimicked a therapist using pattern matching. It seems smart, but didn't really.

Speaker 1

06:31

Understand, more like clever tricks.

Speaker 2

06:33

Kind of a bigger step was SHRDLU around nineteen seventy.

Speaker 1

06:37

SHRDLU what did it do?

Speaker 2

06:39

It operated in a tiny virtual world of blocks. You could tell it pick up the blue pyramid or ask questions about the blocks, and understood within that very limited world. Yes, remarkably well. It showed that NLU was possible if you constrain the domain significantly.

Speaker 1

06:55

Okay, so NLU is hard but vital. How do developers like our listeners actually use it today without building it all themselves.

Speaker 2

07:03

That's where cloud services come in, like Microsoft's LUIS Language Understanding Intelligence Service elleweeds.

Speaker 1

07:09

So it's like NLU as a service pretty much.

Speaker 2

07:11

You don't need to be the deep learning expert. Your job is mainly to train it for your specific application.

Speaker 1

07:16

How does that training work?

Speaker 2

07:17

You feeded example sentences. They're called utterances that your users might.

Speaker 1

07:21

Say things like find me a nearby Italian restaurant.

Speaker 2

07:24

Exactly, And for each utterance you tell owe with the user's goal. The intent is like fine restaurant, and you label the key info the.

Speaker 1

07:32

Entities, so Italian would be cuisine type, nearby implies location precisely.

Speaker 2

07:37

You provide lots of examples, Louis learns from them using machine learning algorithms. Then when a new sentence comes in, it predicts the intent and extracts the entities.

Speaker 1

07:47

What are the main bits you can figure in LA?

Speaker 2

07:50

You define your intents the actions users can take, and you define your entities the data points you need.

Speaker 1

07:55

Are there different types of entities?

Speaker 2

07:57

Yes, quite a few, simple entities are ones you define like product category. But Louis also has pre built entities which are super useful.

Speaker 1

08:05

What do they cover?

Speaker 2

08:06

Common stuff like dates, times, numbers, locations, email addresses, percentages. Saves you a lot of effort. It already knows how to recognize next Tuesday at three pm.

Speaker 1

08:16

That's handy. What else?

Speaker 2

08:17

You can create composite entities to group related entities like an order entity containing item and quantity, and hierarchical entities for parent child relationships like person name having first name and last name help.

Speaker 1

08:30

Structure the extracted data. What about phrase lists?

Speaker 2

08:33

Think of them as giving ellewe hints. You list words or phrases that are strong indicators for certain intents or entities, like a list of all your product names or synonyms for book a meeting.

Speaker 1

08:46

It helps boost the signal for important terms exactly.

Speaker 2

08:49

And then there's active learning. This is really important.

Speaker 1

08:51

After you launch, what does that do?

Speaker 2

08:53

Elliwei identifies utterances it wasn't very sure about. It shows them to you. You clarify the correct intents and entities, and that feedback helps retrain and improve the model over time.

Speaker 1

09:04

So the model gets smarter based on real user interactions.

Speaker 2

09:07

Correct It's a continuous improvement cycle and.

Speaker 1

09:10

The overall flow for an app Using Alleyway, your.

Speaker 2

09:13

App gets the user's text sense, sends it to the Louis API. Louis sends back Jason with the predicted intent and entities. Your app uses that info to.

Speaker 1

09:21

Do the right thing, like calling another API, querying a database, whatever.

Speaker 2

09:25

The action is exactly. It integrates nicely with things like the Microsoft Bought framework for building chatbots.

Speaker 1

09:31

Okay. LAOS handles the core understanding. What other text analysis tools are there in cognitive services?

Speaker 2

09:38

Several useful ones. There's the Bing's spell check API, just basic spell check. It's smarter than that. It's contextual. It understands that booking is correct in booking a flight, but maybe not somewhere else. It gets proper nouns like Microsoft, even if slightly misspelled ah.

Speaker 1

09:56

So it considers the surrounding words useful for cleaning up U input definitely.

Speaker 2

10:01

It even handles some slang and common brand name misspellings.

Speaker 1

10:05

What else in the text suite?

Speaker 2

10:06

The text Analytics API bundles a few things. Language detection figures out what language the text is in, useful for routing, support tickets, or filtering content.

Speaker 1

10:15

And sentiment analysis that seems really popular.

Speaker 2

10:17

It is analyzing if text is positive, negative, or neutral. Companies use it constantly for customer reviews, social media monitoring.

Speaker 1

10:24

Getting a pulse on customer opinion at scale.

Speaker 2

10:26

Right, it usually gives a score like point nine for very positive, point one for very negative.

Speaker 1

10:31

Does it do summarization too?

Speaker 2

10:33

Not exactly summarization, but key phrase extraction pulls out the main talking points, the important noun phrases, and topic detection can group large amounts of text like reviews, into underlying themes.

Speaker 1

10:45

The source also mentioned something called the Web language model or web LM.

Speaker 2

10:49

Yeah, that's a language model trained on well huge amounts of web data from bing. It understands common word sequences and probabilities.

Speaker 1

10:57

What's that use?

Speaker 2

10:58

For things like word breaking, splitting buy tickets now into buy tickets now, calculating joint probability? How likely is the phrase natural language processing versus say natural language pineapple okay?

Speaker 1

11:12

Measuring how natural a phrase sounds.

Speaker 2

11:14

And conditional probability predicting the next word given artificial how likely is intelligence to follow? This powers things like autocorrect and tax suggestions.

Speaker 1

11:22

Wow, quite a toolbox for text. Now, what about turning speech into text and back.

Speaker 2

11:27

That's where the speech APIs come in. Speech to text STT converts audio to text text to speech TTS does the reverse.

Speaker 1

11:34

How does STT work? Generally?

Speaker 2

11:36

It analyzes the audio signal, breaks it down into basic sound units called phonemes, and uses acoustic and language models to figure out the most likely sequence of words. Usually gives a confidence score.

Speaker 1

11:48

Two and Microsoft's offerings.

Speaker 2

11:50

There are standard speech APIs, but the really interesting one is the Custom Speech Service criis custom. Howso CRES let's use adapt the speech recognition model to your specific scenario.

Speaker 1

12:03

What does that mean?

Speaker 2

12:04

You can upload your own audio data and accurate transcripts. If your app will be used in a noisy factory or involves lots of specific jargon or product names, you can train a model that's much better at understanding that specific audio environment and vocabulary.

Speaker 1

12:18

Ah, so you tailor it to overcome background noise or specialized language exactly.

Speaker 2

12:23

It can make a huge difference in accuracy for specific use cases compared to a general purpose model.

Speaker 1

12:28

And what about recognizing who is talking?

Speaker 2

12:30

That's speaker recognition two main types. Verification confirms if a voice matches a known person like voice log in usually needs enrollment where the person says specific phrases.

Speaker 1

12:41

Okay, one to one matching.

Speaker 2

12:42

And identification, which tries to figure out which speaker from a pre enrolled group is the one talking. Useful for transcription that notes who said.

Speaker 1

12:51

What, and for the other way text to speech, making the computer talk naturally. Yeah.

Speaker 2

12:55

TTS takes text and generates audio. The source mentions ssmls each synthesis markup language, what's that for? It lets you control how the text is spoken. Things like emphasis, pitch, speaking rate, pauses, even pronunciation of specific words helps make the synthesized voice sound less.

Speaker 1

13:14

Robotic speech text feels like it's gotten way better recently.

Speaker 2

13:17

It really has, largely thanks to deep learning bottles, but accuracy is still a challenge, especially in noisy places or with strong accents. The source notes that even with claims of low error rates like Google's four point nine percent, that's often in ideal conditions.

Speaker 1

13:32

Which is why that custom speech service is valuable for bridging the gap in real world scenarios.

Speaker 2

13:36

Precisely.

Speaker 1

13:37

Okay, shifting focus again, let's talk search and recommendations making information findable.

Speaker 2

13:42

Right, we have explicit search. You type a query, but AI enables implicit search.

Speaker 1

13:48

Where the system anticipates what you need.

Speaker 2

13:50

Yeah, like Amazon showing customers who bought this also bought or related items. It's proactive.

Speaker 1

13:56

The source mentioned the three piece of search.

Speaker 2

13:58

Right search everywhere, predictive, anticipating needs, proactive, giving answers before you ask.

Speaker 1

14:05

That's the ideal, and Microsoft has bing APIs for web image news search. Let's focus on recommendations though, how do those work?

Speaker 2

14:15

The main goal is usually to increase sales or engagement by suggesting relevant things.

Speaker 1

14:20

What kinds are there?

Speaker 2

14:20

Frequent bought together FBT is common items often bought in the same transaction, like a camera and a memory card makes sense. Then item to item, which is a type of collaborative filtering. It suggests items based on what other similar users liked. People who viewed this also viewed.

Speaker 1

14:36

Based on collective behavior.

Speaker 2

14:38

And user to item, which is more personalized. It looks at your past history, views, purchases to recommend things specifically for you.

Speaker 1

14:44

How do you build these using the Microsoft service?

Speaker 2

14:47

You need data two main types. Catalog data, which is info about your items, products, articles, whatever, including features and usage data records of user interactions like clicks, purchases, ratings, so you feed it.

Speaker 1

15:01

Your product list and how people.

Speaker 2

15:02

Interact with exactly. Then you train different recommendation models called builds on that data. There are specific builds for FBT and others like SAR that handle itemed item and user to item, and.

Speaker 1

15:12

The quality of recommendations depends heavily on that input data.

Speaker 2

15:15

Absolutely, good data, good quantity leads to better recommendations.

Speaker 1

15:19

The source also mentioned ranking and offline evaluation.

Speaker 2

15:22

Ranking is crucial how do you order the results. It's often based on relevant scores derived from usage data and item features. Offline evaluation lets you test your train models on historical data before deploying them to see which build performs best.

Speaker 1

15:36

Okay, so we've covered a lot of ground from AI history to interfaces and LU with LIS text analysis, speech tech search recommendations.

Speaker 2

15:47

It's quite a journey, but the key takeaway, I think, is how these advanced AI capabilities are becoming accessible right.

Speaker 1

15:53

Things that needed huge research teams are now available as APIs like cognitive services.

Speaker 2

15:59

Especially for developers already in the dot net world. It lowers the barrier significantly to adding intelligence.

Speaker 1

16:05

It makes you really think about where AI is already working behind the scenes in.

Speaker 2

16:08

Your life or how these tools could reshape industries think about customer service, retail, healthcare.

Speaker 1

16:14

Definitely yeah, and that brings us to the future. The source touches on this idea of AI first organizations.

Speaker 2

16:22

Yeah, companies embedding AI into their core strategy, their products, how their people work.

Speaker 1

16:27

And it addresses the big question about jobs.

Speaker 2

16:29

The perspective offered is interesting tasks, not jobs, will be eliminated. The focus shifts to how human roles will change and work alongside AI.

Speaker 1

16:38

Augmented intelligence not just artificial intelligence replacing humans exactly.

Speaker 2

16:43

Combining the strengths of both human and machines working together can achieve way more than either could alone.

Speaker 1

16:49

It's a vision of AI becoming woven into everything, cars, factories, shopping, daily life.

Speaker 2

16:55

A fundamental transformation.

Speaker 1

16:56

So to wrap up, we've seen how AI is evolve, how tools like cognitive services make it practical for developers, especially with dot net, to build apps that understand language, speech, and user needs through search and recommendations.

Speaker 2

17:10

It really puts powerful capabilities within reach.

Speaker 1

17:13

And thinking about that future, that augmented intelligence idea where tasks change and humans partner with AI. Here's a final thought for you to consider. If AI is set to transform our tasks and merge with our capabilities. What completely new roles, new kinds of expertise, or maybe even entirely new opportunities might emerge from this human machine partnership in the coming years, Things we perhaps can't even quite imagine

Speaker 2

17:37

Today, something definitely worth pondering.

Transcript source: Provided by creator in RSS feed: download file

Artificial Intelligence for .NET: Speech, Language, and Search: Building Smart Applications with Microsoft Cognitive Services APIs

Episode description

Transcript