All Things Voice Recognition and JavaScript with Ian Lavery - RRU 280 | React Round Up podcast

Speaker 1

00:00

Welcome to React Roundup, the podcast where we keep you updated on all things React related. This show is brought to you by Void and top End Devs. Unvoid provides high quality design and software development services on a client friendly business model. Unlike all other software agencies, Unvoid allows clients to only pay after the work is delivered and approved. Visit unvoid dot com to learn more and reach out.

00:28

If you know a company that needs more professionals to help with design and software development, that's u n void dot com and top end Davs helps you stay up to date with cutting edge technologies like JavaScript, Ruby, Elixir, and AI. Visit topandevs dot com to join their AIDV boot camp, weekly community meetups and access expert tutorials. I'm Lucas Paganini, founder of Onvoid and host of this podcast. Thank you for tuning in. Let's jump into the episode.

Speaker 2

01:07

Hey everybody, and welcome to another episode of React Roundup. I am your host today TJ Van Tol and with me on the panel, I have Paige need you House. Hey everyone, and our special guest today is actually a React Round of returning Champion. We have even Lovery here. Ian. Welcome back to the show.

Speaker 3

01:24

Hey, thanks for having me back.

Speaker 2

01:25

Yeah, so why don't you start, you know, for people, I think it's show we're looking back? Is the show is about a year ago. We'll have to look up the episode number and toss it in the show notes. But it's been a while, So why don't you tell people know who you are, what you do in your background while you're famous, all those sorts of things.

Speaker 3

01:41

Yeah. So I work for a speech recognition company called pegle Boys, and we're a developer focused company that tries to power developers all over on any platform to have to bring voice to their platform. So we have a whole variety of different propus that cover speech to text, voice activation, wake word, all that, and we just want everybody to have a voice on their platform. Besides that, I'm a I do like interactive media hard and I play bass in a couple of bands.

Speaker 4

02:16

That's awesome, not just one band but multiple.

Speaker 3

02:20

Yeah, I'm an over at cheap I.

Speaker 2

02:22

Guess well cool. So Peka Voice looks interesting. I remember us talking about it last time, but maybe you can get an overview of like how it works. Like if I if I use Peka Voice, what am I? What am I getting? Am I getting a service that I can send like audio to you, and it comes back with the words like what other features maybe you could give us, like the rundown of everything. It does everything you do.

Speaker 3

02:44

Yeah. So the big thing with us is and our sort of thing that sets us apart from pretty much every other voice service is that we're entirely on device and so there is no there is no service. There's no cloud API that you're calling to send your audio to, which I mean, look look around. That's pretty much every

03:05

single voice thing is just an API. So we're one of the only ones out there that is actually giving you the ability to hold on to your audio data and your user's audio data and process it on the device and return Again. We have like a variety of products. So we have like wakeword detection, where it's just like hey, Siri and okay Google. It's just all it's doing is sitting there processing frames of audio, waiting for you to say the thing, and then when it wakes up, it

03:34

does the thing that you tell it to do. But we also have voice activity detection and which just basically peaks when it hears somebody talking. And obviously speech to text. Everyone wants speech to text, so auto transcription of voice.

Speaker 2

03:49

Yeah, it's very cool. It's also like one of those problems that I feel like is it's becoming more commonplace. We have smart devices in our house. Our phones can listen to wake words and that sort of thing. But I still I'm still sort of fascinated by the underlying technology. Maybe you could just start give us like the world's simplest rundown of like how does how does it actually

04:09

work on the back end? Like do you just have a whole bunch of like low level C code that looks for patterns in audio data or like, I don't know, we don't need it. Sounds like two hours, but I'm.

Speaker 3

04:20

Just no, it's a good question. So, I mean, basically, it's deep learning, right, It's it's it's machine learning. So we teach through machine learning. We teach a machine a statistical model of what a word sounds like, or what

04:32

a series of sounds sounds like. So we basically take audio in our actual When we're teaching our machine, all we're doing is sending it frames of audio that are labeled, and we get it to remember them and like form a little statistical pattern, and then it for something like wakeword. It's just like, hey, remember this pattern of three things. Just remember that and say, hey, I think I saw it.

04:57

So it's a lot more complicated when you get into speech to text because not only are you teaching it every sound in the language, but you're also teaching it every word in the language, because then you're dealing with audio and writing, which are different things. I think people think language is a combination of those things, but really

05:19

they're two entirely separate things. They're like that there's the series of sounds you make with your mouth that other people understand, and then there's the symbols you write them down with and the grammar and punctuation and everything that you put into the written form, and they're different, so we actually have to treat them differently. But you'll see

05:39

a lot of the big cloud providers out there. The reason they got it so right so fast is because they had such large machines in the cloud in order to do this, so sort of like it outpaced the actual progressive voice recognition, and now everything's kind of caught up and we can actually do it on Devine, which is a big win because, to be honest, we were like boiling the ocean for like a while doing speech to text, and now we can do it on like a micro controller.

Speaker 4

06:10

So if you're using something like Peaco Voice, is it something that you as a user have to train the models or the models already there. It's trained. It knows you're speaking English or it knows you're speaking Spanish, and it will just it should be smart enough to be able to take that audio and translate it into the correct written words.

Speaker 3

06:34

Right, So, like for speech to text, for instance, we basically just have a general language model. You just give it. We offer eight different languages, and you just give it the language you want and we'll understand that language. But we actually use this thing called transfer learning, and we have a website Peako Voice Console where you can basically we have sort of a general model, but then you

06:59

sort of do train it yourself. Because for something like wakeword, we have a model that understands a bunch of sounds in whatever language you give it, but then you want it to represent a certain series of sounds like okay, Google, So you literally type that in to our console and hit train, and then it will pop out a model

07:18

that understands that. So that's that's sort of the when we say you train it, it's not like, oh, you have to go out and gather four thousand recordings of this word and you know, submit it to something and watch statistics go and decide. No. No, it's just like we are. We did the hard work.

Speaker 2

07:38

I was gonna say, because by saying that, you're sort of implying that you got went out and have four thousand recordings of these different words, right or like.

Speaker 3

07:47

No, No. So the thing is, it's again we've we've trained the general model, so it understands the sounds we needed to understand. You just tell us which sounds you want us you want to form your wake word, and we pop out a model that's that that just waits for those series of sounds.

Speaker 2

08:02

Interesting because I would have guessed that your building of the model was to get a bunch of people to say like it almost seems it kind of breaks my mind a little bit as possible, right, that you can sort of general.

Speaker 3

08:14

Us the old style the like. So I worked. I worked at a speech recognition company right out of college. And what we did we had one of the early early wakeword engines, and what we would do is we'd it was all b to be the company. We basically enter a contract with the company that says, hey, we're going to go out and gather four thousand recordings of this wake word, and we're going to train it and

08:42

then deliver you the model. And it was very formal, and that was basically state of the art at the time. But we're actually a bit past that now because we're able to use this concept of transfer learning to take a general model and just kind of pointed in the right direction. So we no longer need to do all that all that pounding the pavement asking for people to say a wake word, because that was a lot of work and it took months, like every time somebody signed

09:11

a contract. And I know because I was running the crowdsourcing technology for that company, So I would have to post these jobs and these these people would record it on their on their like mobile device, and I'd have to go through all the recordings and like you know, some people would just yeah. Some people would just you know, speak their manifesto into the phone, and I'd be like, no, no, no, no.

Speaker 4

09:37

So one one thing that I'm curious about is I'm assuming that when you would do these these wake word gatherings, you would have to take into account accents, because I know that that is something that every automated assistant struggles with. This English accents, Scottish accents, Caribbean accents, all speaking English, but all slightly differently. So is PEKO voice able to account for that and be able to interpret, you know, a deep Southern accent versus maybe a New York Boston accent.

Speaker 3

10:11

Yeah, So I mean that that's still a challenge for us. But I think the reason we're a bit more resilient to it is because we've trained this general model on like g'z like ten hundred thousand hours of speech. It's heard all the accents, not not all the accents, but it's heard it's heard a lot of variation, so it

10:34

tends to be a bit more resilient. When I was doing the old style where we would get people to record, that was actually a lot less resilient to it because we only had like, you know, three hundred participants recording these wake words, and how much variety are you going to get between three hundred people? Like? Not enough? But when we train these general models, we have like tens of thousands of different speakers, maybe more, so we tend

11:01

to be a lot more sensitive to the variations. But but it is, it is definitely a challenge because even us as humans, if you hear like a really thick accent that you're not used to, it can be confusing, like like we're we're not perfect either with it. So it's it's it's a challenge.

Speaker 2

11:20

So I think you so you added multiple language depart I believe that's new or at least newish from the last time we talk. So does that that like more generalizability, make that easier or I imagine there's still all sorts of challenges that go into that.

Speaker 3

11:38

Yeah, So when you when you actually work with a totally different language, that's basically starting over because accents is one thing you've already taught it the series of sounds in the language, and you're just looking for a combination of those sounds and those symbols. But when you move into a new language, there's a new set of symbols, and there's a new set of sounds. You know, there's

12:00

everybody has an inventory. We call it a phonemic inventory, and it's basically a series of sounds that you hear in the language, and every language has a different phonemic inventory, and we need to train the machine to understand only that inventory of sounds and all the symbols that go into that. So when we start a new language, we

12:21

have to do it completely from scratch. We have to get new data in that language, We need to get new text in that language, and we need to do our best to even understand the language enough to work with it because we need to listen to these recordings. We need to normalize the text we get and make sure it's not like full of symbols and stuff, but understand it enough so that we actually don't confuse the machine learning process, and that that could be a real challenge. It's a lot of work.

Speaker 2

12:55

Actually, so it's fascinating. Does that mean like when you kick off anywe language, I feel like you almost need to have like a professional linguist on staff for almost each of these languages, right, Like, or do you like bring on somebody who's, like, you know, a world class I don't know, a Spanish linguist to help, or like like how much of it are you able, like as a software developer to sort of test on your own and how much do you have to rely on a

13:22

native speaker as the only person that can actually figure some of these things out.

Speaker 3

13:26

Yeah, so we do have like basically our like machine learning team. They do have to be part linguist, like because if you've studied languages, you at least understand the components, and basically every language is just a combination of the components. So they have a lot of expertise in that field to understand when they approach a new language how it works. But then that's not enough. So usually what we'll do is we'll get somebody, well, we will get a native speaker.

13:59

Usually will basically hire somebody on a contract to work with us to help with the language, because you do need that expertise. Like, the fact is, even somebody who's like a language expert, if they sit down to an entirely new language, they're not going to be able to understand it enough to do the work that needs to be done to actually get it to a production ready state. So we often do need to get a native speaker in there to provide their input and that will really

14:31

speed the process along. We tried to do it without experts a couple times, and it's just like you just don't get the performance and you spend a lot more time, you waste a lot more time. I should say sure.

Speaker 4

14:44

I mean that makes a lot of sense. When you think about getting expertise in anything else, it's a lot. It will almost undoubtedly go much quicker if you have somebody who is proficient in whatever it is that you're trying to do.

Speaker 3

14:58

Yeah, well, they can recognize mistakes, grammar and stuff, the stuff that's really hard to pick up as a non native speaker.

Speaker 4

15:04

Yes, So what languages do you currently offer Peko voice for?

Speaker 3

15:09

So we have I believe last year we announced we had Spanish, French, German, English, and then this year we added four new languages. We added Japanese, Korean, Portuguese, and Italian.

Speaker 4

15:26

These are some tough ones.

Speaker 3

15:28

Yeah, well, especially like when you get into the written forms of Korean and Japanese, they become very challenging. Like you know, we in English we have twenty six characters. Japanese has two alphabets of fifty six and then an additional alphabet of tens of thousands. So yeah, yeah, so that the text representation of that is really difficult. The actual spoken version of Japanese is a lot easier than English because Japanese has fifty six sounds, and they all

16:05

map to a combination of characters. English mapping a combination of characters to the sound is incredibly difficult. Turns out we made some mistakes early on and we didn't really fix them.

Speaker 4

16:20

I mean, just thinking about the amount of spellings that we have for the same sounding word based on context, I can not even imagine how you would be able to figure that out for a transcript.

Speaker 3

16:32

And it's all exceptions in English. It's like, oh, yeah, it's this unless this or this unless this, and like here's three different reasons why this rule is wrong.

Speaker 2

16:41

And yeah, you see this when you have like younger kids that are starting to write, and you look at their writing because they start they don't know the exceptions yet, right, but they can speak it because they know. So you get like, they it's words you don't even think about too, because we internalize them so quickly. Because one of my kids spelled because ron and then you're like, oh, bill,

17:02

because it's pretty easy. But then you think about it for like half a second and you realize, like, actually, the word because makes absolutely no sense, like right.

Speaker 3

17:09

Like if you try and explain it, you suddenly find yourself going just is what it is. Yes, just memorize it, yep.

Speaker 4

17:19

I mean that's really fantastic that you have taken on and it sounds like gotten through some very difficult dialects. Are what are future future languages that you hope to be able to process as well?

Speaker 3

17:32

So we're going, yeah, exactly, So we're going to try next year. We're going to double our language count again, I think, and we're going to do going to do Chinese, Vietnamese, what else? Dutch? I believe, Russian, Polish, I think, yeah, I can't remember all of them. But you basically need to be like a fully inclusive speech recognition company, you basically need like a bare minimum of like fifty languages. So like we're going to get to like twenty of

18:04

the most popular and hold there for a while. Is kind of our plan because that covers a lot of people, Like that covers the majority of people, because because even in the cases where the people might not speak the language, they usually are like, oh but I speak this what this more popular language? But to really get up there, like I mean, you do need to get to like fifty or something. And I mean Google has like one hundred and fifty, so you know, it's it's kind of a never ending thing for us.

Speaker 4

18:37

How about Hindi that's a big one.

Speaker 3

18:39

Oh yeah, that's actually one of the other ones we're going to do next year.

Speaker 2

18:43

Yeah, so I guess I got to ask one last question. Are there any languages like you've come to hate, like because it was like very difficult or.

Speaker 3

18:56

It's funny how much you can hate your own language. No, actually, like seriously, English is the only Like I look at all other languages we've done, and I'm like, these are so much easier, like English is. Actually, it's just it

19:11

came out of a mess of languages. It was a lot of combinations that happened over time, and a lot of them happened during like you know, a lot of English developed during like illiteracy, and so there's like really interesting examples you can find of like stuff where it's just like, oh, yeah, this was just a mistake that happened, you know, two hundred years ago that they kept in or actually, I have a fun fact the word dumb.

19:37

So you look at that, you're like, why does it have a be at the end that apparently was there was a time where the like ruling class of England was trying to make it harder to write English so that the peasantry could like pick it up. And they literally just added some letters to the language here or there, and we're like, this is the proper way to write it, and then just to confuse people. And we literally still have that to this day. So English is so weird.

Speaker 4

20:06

So that's why knife has a K in front of it.

Speaker 3

20:09

Yeah, yeah, like like stuff like that. I think they were just messing with us and now we're just like we have to live with that.

Speaker 2

20:17

So I want to pivot a little bit and talk about the actual web development, like the side where you might actually use a service like this, because I remember last time we chat it a little bit too about common use cases, right, So maybe we could just start with a review, like how we have a lot of web developers listen to this show. What do you think, like, I guess, A, what would using something like this look like? Like how do you actually get it in an app?

20:41

And B I guess like, what are some common use cases that you see for use on the web as well?

Speaker 3

20:46

Right? So one of the big things is obviously on the web, people are a lot more comfortable calling like an API and that is what they've come to expect for speech trek cognition and stuff. But we're actually bringing the We're actually kind of bringing back the power of the browser itself. So the I mean the browser is a virtual environment that can run whatever you want. And we actually can run entirely in the browser on the

21:12

client side. And that's that's big because I mean, in the these days, we're getting a lot of progressive web apps, and the sort of web app is a big thing,

21:24

especially with like SaaS companies and stuff. So if you're running like a SaaS company and you you want to integrate like voice into your console or something, having it on the client side is is I mean, it lowers the latency, It gives you a lot more direct control of what happens when you get boys, and it means you can be robust to connection issues, which like that

21:50

that you know, that's a huge thing. Not everyone has amazing Internet and you don't want to have to be I can calls out to an API and just hoping it comes back for you or feature to work. This will just work. And also on top of all that, it's it's less expensive because we're not calling an API, We're not depending on cloud infrastructure. So you're actually if you're a developer and you integrate Peka Voice into your web app, your client is going to be using their

22:19

machine to do the processing. So I think it's just a win win situation for that. Yeah.

Speaker 2

22:26

I feel it's especially important considering it's audio too, So like bandwidth is like you you're not just shipping off like a couple of things in a query string to some service. You're like uploading.

Speaker 3

22:37

Audio h mega adio.

Speaker 2

22:40

Yeah, so the bandwidth consideration is amplified significantly.

Speaker 3

22:46

Yeah. No, and it makes everything that that actually allows for like something like real time audio. Real time audio is very challenging to do for an API because you basically need to stream it to the service and have responses being streamed back. That's that's really expensive. That's like a constant bandwidth issue. But when you're doing real time audio and it's all running in your browser on the client side, it's it's snappy and you can do things that require timing and.

Speaker 2

23:16

Yeah, very cool. And I know I think I remember from last time too that because one of the ways you keep it snappy is it's not JavaScript code running in the browser, right, it's your I don't remember your exact tech stack, but I know you have some sort of fancy way of doing that. Maybe you could walk people through some of the magic and challenges of how that works.

Speaker 3

23:35

Yeah, so our core code is in like C because we we were trying to keep it as efficient and snappy as possible. Now C code, and when you think of C code next to React, you're like, how does this even work? Like can these two ever talk? But it turns out they can with WAM, And what we do is we compile basically all our core code into a WAHSM binary and then we ship that with our like MPM package. So when you MPM install Peka Voice, part of the part of what's going to be shipped

24:10

with your website is OURASM blob. And basically wasm's really cool because it basically just wraps your native code in JavaScript and then allows you to basically attached to it like any sort of dynamic library, say, here's the functions I want to call, here's the data I'm going to put into it, and then you just call it like

24:37

you would any any other library. It's a little trickier to work with because you're dealing with I mean JavaScript obviously one of the things we're pretty aware of, and I'm sure the listeners of your show are aware of. Is JavaScript is like eh, types whatever, even typescript is like is like yeah, types, but like you know, a number is a number, right? Well, see is like what how many bits is your number? Like he needs to know?

25:06

So you start to need to think about that. When you work with WASM, you start to need to think about okay, is this a thirty two bit and going in here? And you need to start to think of okay, you need to start to think of memory, like okay, I need to have a pointer. I need to pass in a pointer here to get something back and then convert that pointer to like a JavaScript object of some sort.

25:30

So it's challenging to work with, but once you get it working, it's extremely powerful because then we can ship something that's incredibly complex piece of code and just put basically a slim interface of JavaScript around it, and then any jabscript developer can just call is just talking to it like it's JavaScript. They don't need to worry about

25:54

the WAHSM that was our problem. Yeah, So it's challenging to work with, but I do if anybody's thinking of has a challenging problem that requires the efficiency of c Don't be afraid of it. It's not that it's not that hard, and it is pretty amazing when you start working with it.

Speaker 4

26:12

Actually, Okay, so it works or there there is an NPM package if you want to use JavaScript with it. But what if you are a Python developer or maybe you're working with a micro controler like our do we know is there are there options for other other languages like that?

Speaker 3

26:32

Yeah, So, I mean we support since we're a developer focused company, we're pretty obsessed with our SDKs. So I think for our two most popular products, I think we have like twenty SDKs for each one, and it covers all the favorites. And we even have you know, we have three No, we have four different web SDKs. We have Vanilla JavaScript, but we also have React, Angular and view.

27:03

So it allows we basically wanted to be like, use it in your favorite environment, like, yeah, use it like you use anything else in your in your stack, Like we don't want to just disturb that basically.

Speaker 4

27:17

Right, that's awesome. So what are some of the use cases that you've seen people employing it for recently?

Speaker 3

27:25

So we've seen, so we've actually come to some interesting ones lately. So auto content moderation is a big one right now. So let's say you're Minecraft or something, or you're I guess, let's let's go Fortnite, and you have open audio streams hundreds of thousands of players, and you're

27:48

trying to moderate all that. That's that's very difficult. And it turns out a lot of big companies out there are using auto moderation, which basically takes that audio and is basically looking for key phrases let's call them, and it's just looking to flag them and then and then they'll usually have you know, a person go in and inspect the actual content of it and decide whether it was you know, a mistake or whether it is actually like a bannable offense. Yeah, so that that's actually a

28:27

new exciting one. Also, like call centers, it turns out, again we've got open phone lines, like like a whole building full of them, and we're trying to understand, you know, what's being said on all these different calls, and you can't have people listening to all that audio. So a lot of big call center companies need some sort of automated system to take in all the audio from all their phones and do something with it. So we're we're encountering more use cases like that lately.

Speaker 2

28:57

Actually, those are both really fascinating. It's funny. The content moderation one really resonated with me because I play I don't know. My kids are eleven, so they're right at that impressionable age, but they're also right in the age where they want to play like games that are the sort where they have open audio. So there's a game we play that's like five y five, so five people on each team, and it has it has a way

29:21

for you to do audio communication. And the very first thing I did was make sure to shut that off, like disable it, because like I'm a professional Internet user, and that's the first thing to learn is I don't trust anybody. I wouldn't even want to hear it myself, much less my kids, though I know.

Speaker 3

29:38

It like brings me back to like like when I was like you know, eleven or twelve, Like the Internet was like a new exciting thing, and I just would like I remember going to like I like was like really into like going to like like different video game

29:53

websites and stuff. And then there was these just these chat rooms about video games you could go to and it was literally just like a room with like everybody their microphones are on and you just start talking and it was like, when I think of that now, I'm like,

30:08

oh my god, that's frightening. Yeah, but uh yeah, I mean the fact is is we can we could keep those spaces safe with these sorts of tools, because then then air Duell's out there at least get banned when they're when they're being inappropriate or whatever.

Speaker 4

30:28

Oh god, Well, one thing that you put in the show notes today that I would really like to hear more about is a new speech to text engine or engines cheetah and leopards. So maybe you could tell us a little bit more about those.

Speaker 3

30:41

Yeah, so I think, yeah, last time we spoke, we actually didn't have a publicly available speech to text engine, and we were using our speech to intent engine, which was called Rhino, which was basically, like you, basically, yeah, the founder of the company is pretty obsessed with animals, So Rhino basically you teach it a small grammar and then it would understand that grammar, which is great for stuff like you know, controlling a coffee maker or like

31:12

you know, there's only so many functions that needs to understand, but we decided to kind of go that extra mile and bring speech to text to devices. And traditionally language models are in the gigabyte realm of size, and we actually got ours down to twenty megabytes for language and that's sort of the big win for this is like we can run on anything that can take twenty megabytes of memory or of storage. And so Leopard and Cheetah

31:46

are actually two different sides of the same coin. So Leopard is a speech to text engine that takes in a set amount of audio, so like an audio file or something, and gives you a transcript of that, and that's a lot that's an easier problem because you can basically say, okay, this is all the audio I'm going to get, so I'm to look forward, I'm going to look back, I'm going to make inferences based on the

32:10

future in the past and give you a response. But then Cheetah, of course, because it's the fast one, it goes it's real time, so it has zero look ahead, which means it will take in every frame of audio that you give it and it will return what it thinks is being said, so they're both speech to text engines, but they just work at different use cases. So I mean audio files. The accuracy is much better, but of course you sacrifice the sort of real time effect.

Speaker 2

32:38

Yeah, so twenty megs is impressive, but is that still like small enough for a browser to use, Like does a user have to download that to use it in their WebP.

Speaker 3

32:47

So that was a challenge we recently. So we recently did the webstcs for Cheatah and Leopard, and we actually had to kind of redesign our whole system of delivering

32:58

language to the the browser to handle this. So, yes, twenty megabytes is a lot, but we actually separate the language model from the package, so basically we let the developer decide how that's delivered to the user, but we also made it part of our system that it could be either a basicxty four representation that you can bake into your website if you just want it to always be there, or if you want to be kind of smarter about it, what you can do is put it

33:26

in your public folder and have it downloaded to the user's browser on first load, and then cashed in local storage for the rest of the time, so that the next So the very first load, yeah, it'll be a twenty megabyte load, but the second load will be instant because they already have the language model.

Speaker 2

33:44

It's a pretty neat system because I think like it's it's the nature of the beast, because I mean it's it's in a way, it's kind of more of like a native app feature, and native apps are downloading like gigs at times of stuff, and so it's like a feature that helps the web sort of compete with that.

34:01

So I think it makes sense, and I think, like honestly that I think that's the best that this solution is kind of clever because that's kind of all you can do because you can't you can't magically get it to the user ahead of time, like through an app store or something.

Speaker 3

34:15

So and a developer, if they're if they want to be clever about it, they can they can stream it from their public folder asynchronously on the first load, so that it's just like by the time the user wants to activate the voice feature, it's already downloaded. You know. It's it's just the sort of thing you need to you need to handle these sorts of ways. Because you know, we were working with a company recently that they do

34:38

this all the time in their mobile apps. They'll their mobile app actually downloads like stuff all the time to keep their their app working, and it does it all asynchronously, like when you open up the app and you know, the user's none the wiser, but behind the scenes there's all this stuff. So when when you look up, why is this app using three point six gigabytes? When I downloaded it, it was only five hundred miniwtes, that's because they only delivered like the core code and the rest of

35:06

it was downloaded later on. Yeah, so's it's it's just how it's just how you do stuff now is just keep keep the package sizes small, but then just deliver the features kind of as they're being used.

Speaker 2

35:19

Yeah, I know, iOS and Andrew even have like APIs built in to help you do that sort of thing because it's it's such a common model.

Speaker 3

35:27

Yeah, I think all the big companies want that. You know, if you're Spotify, you just you got to have the features. You don't want people to see three point six gayabytes when they go to download your app. There's like a sticker shock thing that happens. So it's kind of a funny thing because it ends up being that it's it's like when you book like an Airbnb and there's all these extra expenses that like get reported later, or like a or a flight where you get like the info later.

35:57

It's sort of like that. It's like reduce the sticker shock and then we will show you the expenses after.

Speaker 2

36:03

So you also have an article in here about writing a podcast at transcription server to struggling to pronounce for some reason, which is a fascinating idea that I think, Like I know when we were talking before the show too, we've done transcriptions and videos. I'm sure there's other people that are call centers is another example, right, the things that you want to transcribe. So does that use Peeter Leopard or how does that work?

Speaker 3

36:29

Yeah, so it used this Leopard because we actually have the ability to get a whole file, like an hour long podcast and transcribe it from start to finish. And Yeah, the reason I kind of came up with that as an idea to sort of demo or technology is like, I know, I've listened to podcasts for years, and like it's so often on a long running show. I'm sure on this show you get the Hey have we talked about that? Did we talk about this? I feel like we've talked about this, and having show notes to go

36:59

back too is probably a really helpful thing. Or like I was thinking of doing a next phase of the article where I actually make a podcast like searchable. So I made it transcribable and basically stored the like text representation. But once you have the text representation, you can make it searchable and then you can start being like, oh, when did I say this? And then it will just

37:21

pop up the episode you set it in. So it was just kind of kind of an idea I came up with because I see a lot of people using Leopard on a server to basically hook into an event that's happening somewhere, whether it be on ourrs feed, RSS feed that's like updated or you know, yeah, like a new audio file or video is uploaded, and it hooks into that event, it runs it through Leopard and then

37:48

stores it in a database. I thought that was like kind of a universal use case, Like it's just so it seems like a fundamental part of the web to like have something like that in a server.

Speaker 4

37:57

Yeah, I mean, it's it be so useful and it would it would help I think everybody from people who just want to reread part of a podcast if they're looking for something specific instead of having to just kind of hop through trying to figure out where it was that that useful bit of information was well.

Speaker 3

38:16

And and you can think too, like you can deliver these like let's say you attached it to your podcast. You can like deliver these transcripts along with your podcast because if you have the server hook in, transcribe it, and then deliver the transcript along with the podcast, suddenly you've got a follow along with the transcript podcast. So these sorts of things are useful for like auto captioning like videos or audio as well for like accessibility.

Speaker 2

38:43

It's accessible. It's also like marketing. People like it for SEO purposes too, because you know audio, Google can index audio, but if you have a transcription, it absolutely can even better.

Speaker 3

38:54

Yeah, one hundred percent correct. Yeah, search engines aren't very good at indixing audio, so you just have to plaster the text somewhere.

Speaker 2

39:03

Can you recognize different speakers because that's the other thing about a transcript, right is knowing who's talking you do you have the ability internally even if you don't know names obviously, but can you say, like this is voice one, voice two.

Speaker 3

39:15

Yeah. So actually we're working right now on a I'll give you guys the scoop right now on a speaker identification system that will basically be able to tell people apart because yeah, when when you think of something like doing like like a Zoom meeting, if you want like to have meeting notes, Yeah, it would be really useful to have like this came from this person, This came from this person, This came from this person, and you can use different I mean, Zoom obviously has the ability

39:44

to know where the audio is coming from, so it can kind of just label it. But if you have an anonymous audio stream with a bunch of different voices, that's that's challenging because you don't know. You just have to base your assumptions on the character of the voice. Who's who's different. That's actually a problem we're working out right now. And I mean that's that's useful for not

40:04

only speaker labeling, but also a speaker verification. So like if we want to voice activate something that only responds to your voice, that's also like another use case for it.

Speaker 4

40:14

That would be cool.

Speaker 2

40:15

Well, this has been a blast. Is there anything that you wanted to discuss today that we have not gotten to at all?

Speaker 3

40:24

No, I think I think we covered a lot here.

Speaker 2

40:26

Yeah, yeah, excellent. So that why don't we move into our picks, and Paige, do you want to kick us off?

Speaker 4

40:32

Sure? So my pick is going to continue the trend that I started last week, which was Star Trek. As many of you who have been listening for a while, I've been on a Star Trek journey through Next Generation and forward. So most recently I've begun watching Star Trek Lower Decks, which is their animated series, and there's only I think there's maybe two, maybe three seasons of it,

40:58

but it is. It is in the stuff of Rick and Morty, and it is the funniest Star Trek that I've ever seen, to the point where I'm actually laughing, which is unusual for anything animated, but it is really good. There's a lot of references to other Star Trek franchises, so if you are familiar with Next Generation or Voyager or Enterprise, they throw in all sorts of little jokes that are related to those characters. So I would definitely

41:28

recommend it. It's it's as family friendly as the rest of the Star Trek a franchise is, and it's also got a much bigger dose of humor than most of them do. So if you're looking for something that's quick twenty twenty five minute episodes, I would definitely say it's a good one, excellent.

Speaker 2

41:46

I still have not gotten into the Star Trek world, so it's at some point. I've had it recommended several times, but I feel like it's such an like you can't like casually wade into it, right, like you kind of have to. Yeah, so if I pick this week is going to be The Great British Bakeoff, which I think was the previous of Year's page. I started off watching it because I just wanted to know what it was about, right,

42:10

just that sort of thing. And then next thing I knew I had watched a few episodes and I didn't even really understand why.

Speaker 3

42:16

It's so it's so comforting that show, Like there's just something so positive and warm about it.

Speaker 2

42:22

It's it's strangely compelling, Like I can't even understand why I ended up watching it, but it's quite good. I think Netflix has like five seasons or so, I mean I don't know how much I'm going to watch, but I've it's a good thing to just have on you're not sure what to do. It's just comforting, nice to have on in the background. So I've been sucked into that as well.

Speaker 3

42:42

That's funny. My wife and I like literally just started watching that like a few weeks ago, and for the same reason, like why are people talking about this so much? And yeah, now we're like shouting out the screen that's not a good bake. Look at the problem mature.

Speaker 2

42:57

Yeah, excellent, Ian. What picks do you have for us?

Speaker 3

43:03

Yeah? So I think last time I brought Mandy a really gnarly horror film. So I figure back up and maybe do something a little different this time and actually go with something tech related. So we've been working with

43:16

Mixed Panel recently, which is an amazing service. It's really helped us because we're trying to add some analytics to our website and our console, but custom analytics that allow us to like basically track like basically when somebody enters the website, what they interact with and like how long, and like develop our own metrics based on the code we actually put in the website and mixed Panel is

43:41

amazing at this. What they basically do is they're just all they do is they say, hey, we're just going to take events and we're going to represent them in a whole bunch of different ways. You can filter, you can form funnels, you can show userflow, you can take a user and actually watch where they go on the

43:59

website and stuff. It's super super helpful for us. We've actually been we've had a total crush on them since we started working with their their product because they just not only is their their UI like so nice to work with, but it's just made our life. We were thinking of building this far because basically we know wanted data analytics to our website, but Google analytics and that

44:21

the sort of general analytics were just not enough. We needed like really specific ones, right, and we were going to build it ourselves, and then we stumbled the prompt upon Mixed Panel and it was like, oh my god, this saved us, Like like they made what we could

44:36

have made. It would have taken us. We would have had to start a new company to make what they made, and it's just it's so it's so helpful, So definitely for any developer out there, that wants to add like customer analytics, mixed panels, really really helpful.

Speaker 2

44:51

Awesome, excellent, Well this has been amazing. My last question for you. If people want to follow you, keep up with you, what are the best we're the best places to go to do that.

Speaker 3

45:00

Yeah, I mean, let's see, I don't I don't have like, I don't have like professional socials out there, but I do have. I am on medium as Ian Lavery, so you can read any articles I put up there. You can follow Peak a Voice AI on Twitter, and we have a YouTube channel if you want to check out my bands. Yeah, fellow kids in Sleep Circle check them out.

Speaker 2

45:23

Yeah, no, excellent, that's great. We'll get those in the show notes. And yeah, thanks for joining us today. This is a great chat.

Speaker 3

45:30

Yeah, thanks for having me. This is great.

Speaker 2

45:32

Cool, all right, everybody until next week.

Speaker 4

45:35

See you then,

Transcript source: Provided by creator in RSS feed: download file

All Things Voice Recognition and JavaScript with Ian Lavery - RRU 280

Episode description

Transcript