Simplifying Algorithms - Vadim Smolyakov - HS#18 | HockeyStick Show podcast

⁠¶ Guest Introduction: Vadim Smolyakov

00:00

I'm Miko Pawlikowski and this is HockeyStick. Today we talk about algorithms in machine learning. I'm joined by Vadim Smolyakov, the author of "Machine Learning Algorithms in Depth" by Manning, a data scientist and in the Enterprise and Security DI R&D team at Microsoft and a former PhD student in AI at MIT CSAIL. His job today is to simplify ML algorithms enough for me to understand. And if that wasn't hard enough, he's not allowed to use any pictures.

00:33

Welcome to this episode and thank you for flying HockeyStick. I don't get to speak to a lot of people who have done the MIT CSAIL. It's like a mystical, legendary course at this stage with all the hype around AI.

⁠¶ MIT CSAIL Experience

00:48

Maybe let's start there. How was it? How did you enjoy it? it was definitely an experience.

00:53

I really liked the theoretical aspects of, the content treatment, Right now there's a lot of news articles that comes out about AI, And, people try to catch up on the latest large language models, they really took over in the past few years, but what I think is really a unique about, MIT CSAIL is that, the theoretical treatment of the subject and really getting in depth and understanding behind the hood, how things work.

01:23

It also seems to be like the who's who, a lot of names that are recognizable now went through that.

⁠¶ Bayesian Inference and Non-Parametrics

01:28

So you focus on Bayesian inference. What was your thesis about? my focuses was on Bayesian non parametrics. And it's a very interesting set of models in which the parameters grow with data. So one example is, Dirichlet Process K-means, for example, where you are trying to classify a number of, let's say, species, and you don't know how many there are, right? So as you keep discovering new species, you add new clusters.

02:01

And, for that to work, you need to set the number of clusters K. and, with Dirichlet process K means this number of clusters is set automatically. which is like one of the main advantages of the algorithm. so based on non parametrics deals with models in which the number of parameters grows with data. It's a clever way of expanding the model size and capacity to fit the data available.

⁠¶ Vadim's Work at Microsoft

02:30

And did you manage to bring that kind of research, expand on that and in what you do at Microsoft at the moment? yeah, at Microsoft at the moment, I'm like a local ML expert on the security team. so the nature of machine learning problems switched from, Bayesian inference to more anomaly detection type problems. essentially, I worked at Microsoft on, time series anomaly detection. I worked on, support ticket classification routing. I worked on hyper personalization.

03:01

And, LLM data copilot most recently. the principles carry over, but the Bayesian non parametric nature of work doesn't necessarily, extend to my work right now.

⁠¶ The Origin of Vadim's Book

03:14

One of the reasons, how we actually met is through your book, The Machine Learning Algorithms in Depth. Can you tell us a little bit about the origin story of your book? Why did you write it? I've always liked writing even before graduate school. And, to me, working on a machine learning project in grad school and then, writing it up and eight pages of publication was not enough. I wanted to do more. I wanted to blog about the concepts I was learning. I wanted to, have a journal and then.

03:47

I've done a number of blog posts and I realized that again, even that wasn't enough for me. I wanted to compile them into a collection of algorithms, collection of books. And at the same time I was writing, this library of algorithms as part of graduate studies, I was getting more experience and I figured. wouldn't it be nice one day to put all of this together in a format which would be accessible to a wide audience?

04:17

and for me, I was personally transitioning, my area of study from wireless communications, which I did during my master's, to more machine learning during PhD. So I had to learn, a lot of these concepts, from scratch. So there was a steep learning curve and I figured if I can do it, then so can other people. And, that was like a big motivation behind writing a book is to be able to teach these cool concepts that I, learned in graduate school to, a wide audience, interested in the topic.

04:51

so that sounds like a long time coming, right? from the moment you started writing this blog post to a finished book. How many years did that take? I would say it was two years of just writing the book, but I also had a lot of materials prepared ahead of time. like code and, some ideas on what to write about, which would take another two years just to compile everything together.

05:16

sounds a little bit like, some of the authors I speak to and some of my friends are of this camp who basically like using writing as a tool to understand things better. there's this saying that if you can't understand something simply enough, you don't understand it well enough. are you also of that mind that writing a book is like the best way for yourself to organize this information in a way that you really can explain it to other people? yeah, definitely.

05:43

And it takes several passes to, I know this, like you have one kind of point of view of an algorithm and then you start writing it and it's Oh, there's actually concepts like I'm thinking of decision trees right now. there, you have certain exposure to decision trees at first interpretable models, and then. You realize that, Hey, it's actually a recursive algorithm and, you grow the trees recursively until they reach maximum depth.

06:08

And, of these parameters like mug steps, they start making a lot of sense, and then you start thinking about like bias, variance trade offs. And, you really understand the algorithm in depth Okay, so dear listeners, I think you know where this is going. Now we're going to try to go through some of those algorithms from the book and give you a sneak peek enough to understand some of the things you might not know.

06:33

And also enough to go and buy Vadim's book, obviously, but before we do that, so who's the target audience of the book?

⁠¶ Target Audience for the Book

06:41

Who is it for? And who's it not for? I wanted to make the book intermediate level so that, anyone who has some experience with machine learning could benefit from it, but also somebody who's new to machine learning will be able to pick up the concepts.

06:57

So specifically, I'd say the audience are, be graduate students, it could be undergraduate students who are interested in the topic, it could be, people who are trying to get into the field of machine learning, but are working in the industry right now, like as a, let's say software developer. the book does, derive algorithms from scratch. So there's some requirements in terms of mathematics that are good to know, linear algebra, probability calculus.

07:26

So I would say, anyone with interest in machine learning should be able to benefit from this book. Okay. But who shouldn't read it? what kind of expectations are going to misguide it for people to approach your book? The book is in depth written for somebody who's interested in understanding the algorithms from scratch, how they work under the hood.

07:51

So if you don't have interest in that, then you just want to use the libraries, import scikit-learn, or hugging face, you wouldn't benefit as much, from reading it. Fair enough. So with that, warning ahead of us.

⁠¶ Explaining Bayesian Algorithms

08:05

Imagine that you're speaking to a five year old software engineer, which we basically are right now, where should we start? what's the first example that you cover in book? does it have Bayesian, next to its name? yeah, first thing I talk about is the Bayesian worldview. and basically, it's a way to view the world in which you start with some prior knowledge, right? Bayesians, they talk a lot about priors. You start with a prior knowledge of a particular aspect of the world.

08:35

The world is too complex to have priors over everything. So you typically try to model a particular problem. You start with a prior knowledge. And then as you observe data, you update that prior knowledge into what's called the posterior. it's a probability of, parameters given the data, right? So as you observing data, you evolving your understanding of the world into something new, a posterior distribution, So what's an example of an algorithm like that?

09:06

could anything you important so I could learn. Uh, is an example of, algorithm of this nature. So, let's say anything with a graphical model to, Gaussian mixture model is an example of that. In gaussian mixture model, you're modeling the distribution of data points using a mixture or a collection of Gaussian distributions. So essentially, the model is, a scaled sum of Gaussians that are parametrized by a mean and covariance.

09:39

and the idea is to learn the mean and the covariance matrix, and, the mixture proportions. from the data itself. so there are several algorithms for learning it. one is one popular algorithms, EM algorithm. But, talk about it in the book. Um, but the idea is to be able to describe the data in this kind of forms of Gaussians, really closely. In a way that maximizes the likelihood of data. we may start off with a knowledge that all data is distributed as a uniform Gaussian distribution.

10:19

And as we observe more points, we update that our knowledge into, we evolve the shape of a uniform into a distribution that actually covers the points in a close way. So that would be one example of how Bayesian, approach applies here. So it sounds like basically some kind of iterative process where you're taking new data and budge your, not your parameters, in the right direction, to fit more closely, your new data, So that's a worldview algorithm.

10:51

You also mentioned previously when we were talking about your background, non parametrics. Can you tell us a bit more about that? Bayesian nonparametrics are the ones that, number of parameters grows with the number of data, the amount of data. the number of parameters automatically gets inferred from, data itself.

11:13

one example, since we just talked about Gaussian mixture model is Dirichlet process mixture model, which is an extension of Gaussian mixtures with potential infinite number of mixtures. obviously constrained towards the simplest model, that describes the data best, like Occam's razor principle. but, yeah, Dirichlet process mixture model where you don't know the number of clusters and you informed that from the data itself automatically.

11:44

Yeah, so that's an example of a Bayesian nonparametric model main advantage. Okay. So that's a little bit more abstract than a previous example of what you gave of, clustering various species of plants. Where does it have practical, application in day to day life of a developer. So clustering is a type of unsupervised learning where you are interested in understanding underlying patterns and data. a very kind of abstract, notion. And, the applications are. many, right?

12:19

So one example of clustering could be to detect anomalies, for instance, if, you group data and there's an outlier, a point that's sufficiently far away from all the existing points, you could see that as an anomaly. And, basically, that would be one application where cluster is important. Another application could be customer segmentation. You're interested in, figuring out different cohorts of customers and, their lifetime value for a particular product.

12:55

the example I give in the book is that of, classifying iris species. It's a classic machine learn data set. And, Yeah, it's simple enough to understand. So it's a toy example, but the applications are numerous. A lot of people coming, and listening to this, they come from classical, software engineering background. And when we say algos they think about binary search and, stuff like that and chasing down the complexity and thinking about, constraints of the stuff like that.

13:26

And then we've got the machine learning algorithms that, somehow sound exotic. And, with all the hype around AI, everybody's wondering, oh, should I be looking into more of that? You mentioned the Bayesian worldview and non parametrics and a few applications of those. I wonder if this is like a representative sample of machine learning algorithms.

13:49

And, the second part of the question is assuming that's the case, what other algorithms would you place firmly in this basic 101 machine learning algorithm set that everybody should be aware of? So first I want to make a distinction between kind of classical algorithms and machine learning algorithms, you have some sort of task that you're trying to solve, right? An algorithm is essentially a sequence of steps in solving that task.

14:15

So example could be, like you said, binary search, over a sorted array. Or it could be a, sorting itself. and, you're interested in, runtime and memory complexity to characterize the algorithm, And run it in the fastest possible time using a small sum of memory. so for instance, comparison based sorting is, has, and log n runtime complexity. same carries over to machine learning as well.

14:43

But, with the differences that in machine learning, you given, a collection of input output pairs, and you try to learn the rules to map the inputs to the outputs during training. so instead of having a fixed set of instructions, quicksort, for example, instead, if you're classifying, points, then you are learning the classification boundaries between the existing points. So you're learning the rules when it comes to machine learning algorithms.

15:14

and, we did talk about nonparametrics and we talked about Bayesian algorithms. actually a lot of algorithms are derived from principles of, applied probability, Bayes rule so examples include Naive Bayes, examples include, mixture models. some of the principles like maximizing likelihood is a common theme across a variety of algorithms. just like in deep learning, choosing the loss function is a common theme across a variety of, deep learning models. so definitely.

15:49

a big kind of category of algorithms. So we haven't touched upon yet is deep learning algorithms.

⁠¶ Supervised vs Unsupervised Learning

15:57

and in general to classify the algorithm types, they come in supervised and unsupervised fashion. So, the supervised algorithms have, a label associated with every example. So in other words, what the right answer looks like given the problem and given enough of these right answers, the algorithm is learning how to create right answers by itself, through generalization.

16:28

what I mean by that is, the goal of machine learning is to generalize to unseen data, to be able to demonstrate that something has been learned. So you mentioned supervised and unsupervised. And from what you said, I understand that supervised is basically, some kind of underlying function that we're trying to approximate, right? So that. As many, unknown, kind of data points, land as close to what we would like them to by giving it examples, right?

16:57

By comparison, what does it actually mean when the algorithm is unsupervised? So when algorithm's unsupervised, we don't have a learning, label, to which to learn from, but instead what we're interested in is, understanding, patterns in data. we're interested in making sense of a lot of data. clustering is one example of, unsupervised learning where we group data into clusters and then try to make sense of each individual cluster or interpreted for our application.

17:30

Will be some other examples of unsupervised. another example that comes to mind is, the extracting features from data. essentially all the encoders, they take an input and they reconstruct. An output from the input, but there is a bottleneck layer in between, which forces the auto encoder to learn a compressed representation of the input data, before generating the output, right?

17:57

So this bottleneck layer, which means that it has fewer parameters than the input kind of forces the auto encoder to learn something useful about the data, and this could be used as a feature later on in a downstream algorithm. So this all makes sense at the high level, right?

18:18

But I'm trying to come up with a more concrete example of the most basic version of an algorithm that you can have and to understand what it would look like, because like you said, the main difference being between something like binary search, when you've got a well understood algorithm, that's well, analyzed. And then you just apply it to data and you get some output. Whereas in the machine learning algorithm world, you're doing kind of the opposite.

18:48

You're learning the rules and trying to come up with the actual algorithm. Really? I don't know if that's the right way of saying that. But, that's the bit that you're trying to figure out rather than just applying it. I'm trying to think, what's the simplest, algorithm that we could maybe talk a little bit more in detail, of how it works, because, like I said, it's abstract and it might be a little bit hard to wrap your head around how that actually works in practice.

19:17

in my book, I talk about a lot of different algorithms. we can touch on a number of different algorithms.

⁠¶ Decision Trees and Random Forests

19:22

but let's start with decision trees. I think they're widely used. that's their interpretable models. Essentially, a decision tree learns to construct a sequence of if else conditions, right? you could trace the reasoning, behind the decision tree just by looking at how decisions are made through that if else tree.

19:45

for example, if you're applying for a loan and the loan gets rejected, then you could, analyze this decision, why the loan got rejected by looking at the decision tree and figuring out what branch of the decision tree was taken to lead to the outcome. and yeah, in some cases it's, an important design choice to use an interpretable model like a decision tree. And, the interpretability, extends through an ensemble of these models.

20:16

like random forest is an ensemble of decision trees, and we could extract feature importances, from that. let's talk about decision trees in detail. essentially it's a greedy and recursive algorithm that starts with a certain depth of a tree. And it grows, depth on each iteration, the maximum depth is reached. It's trying to optimize the genie index, which is a measure of, impurity.

20:42

and, we are at each iteration trying to understand how to, divide our feature range into one that optimizes for genie index. And once we complete one level, we move on to the next level of the tree and so on until the maximum depth is reached. So it's a greedy algorithm and it's a recursive algorithm.

21:09

And, the one I'm talking about is called CART, C A R T. So is that a deterministic way of doing that or did the maximum, depth you mentioned, is that an arbitrary decision, a hyper parameter effectively? the algorithm itself is, being greedy. It's deterministic. However, there's a way to introduce randomness and this is what's done in random forest is, you could introduce randomness in several ways. You could, sample the features that you're evaluating at each iteration.

21:40

You sample introduces randomness. you could also run the algorithm on the subset of data, so that the data that the algorithm sees is different each time. and this is important because you are trying to reduce the variance, uh, the algorithm. basically, if you're working in the regression setting where you're trying to predict a continuous quantity using random forest, for example, so you have a random forest regressor, then you want to minimize the mean square error.

22:17

And mean square error could be written as bias squared plus variance. So to minimize mean square error, you want to minimize, bias, and you want to minimize, variance. The way to minimize variance is by taking an average of, large number of trees. And, important to make sure that trees are de correlated. Because this will help, actually minimize the variance. and then injecting randomness into individual decision trees will help, decorrelate them. So they're basically different looking trees.

22:52

So like in a practical sense, let's say, go back to the example of what you, suggested, what you mentioned, decision tree to either grant or deny your request for a loan Would that, decision tree be recalculated on the fly, or you run the algorithm once you've got your current best model for deciding whether to give people loans and you version that? because I understand that the decision tree is the actual output of your machine learning algorithm, right?

23:25

And then the decision tree is like an algorithm in itself, right? That you run to evaluate whether you give the loan or not. So how does that work in practice? I would say there are two different modes of machine learning algorithms. One is training and the other is testing. So in training, you're learning all the parameters that are learnable in the machine learning algorithm. and, you need to have the right data for it. you need to have the labels in this case.

23:52

While during testing, you fix the parameters that you've learned and you're focusing on prediction, meaning given new input data, what would the output be like given a new customer with their own profile? What should the output be for that particular person? Okay. So what I'm picturing is like a massive database of, okay, this person with all the parameters about them. This is their business plan. This is their, previous exits and stuff like that.

24:19

And this is the amount they want and the decision, that were previously made by humans, you use that to somehow feed into, the decision tree maker, is that the right way of saying that Yeah. and spits out a decision tree version 1. 7. that you start running, right? is that how it works? yeah, I would imagine so.

⁠¶ Challenges in Implementing ML Algorithms

24:42

what are some of the difficulties in terms of actual software implementation of this things, Again, going back to a binary search, you got that we've got some people who thought about that, they came up with this optimized idea. Then we got a few people who sat down and optimized that for whatever hardware. And we've got a pretty speedy binary search or I don't know, quicksort. this algorithm, there seem to be much more custom and much more, Aligned with the data.

25:14

So what are some of the complications of that in terms of actually implementing this? Or maybe that's a completely wrong way of thinking about that and if that's the case, just tell me. yeah. When it comes to implementation, some of the computer science principles that you mentioned, they carry over, and I can talk about Some of the computer science paradigms, like algorithmic paradigms, later as well. but yeah, it's a matter of getting it right.

25:40

I think the correctness of the algorithm is very important. computational complexity, like runtime and, memory complexity are also important. Being able to scale the algorithm is important. it's an important challenge.

25:58

some algorithms like random forests are more amenable to parallelization because the trees are generated in parallel, whereas, another ensemble like, boosted algorithms, they work by fitting, sequentially residuals of trees, they're work in sequential manner, so there are less amenable to parallelization. I would say the number one challenge is to get, the math correctly, and then to translate that math into code. And then from there on to have low computational, low memory complexity.

26:33

So I guess it's not all that different after all. I'm guessing some of the things will be common, you mentioned the greedy aspect that comes from, the classic algorithms. I'm guessing a lot of that will be, dynamic programming, and you're probably going to apply all the usual tricks, like divide and conquer and stuff like that Wherever you can, but is there anything like particularly common and unusual that you wouldn't be doing with, classical algorithms that you do a lot in ML?

27:05

there's different phases like training and testing, right? Learning the parameters and predicting the parameters. the notion of learnable parameters themselves, I think, is key difference. What's that? What are learnable parameters? essentially like variables that you try to fit, variables that you try to optimize for data. It's like room for growth or room for, adaptability in an algorithm itself. Having an objective function is another key differentiator.

27:36

a lot of Bayesian algorithms, they maximize the log likelihood or minimize the negative log likelihood. that's another difference. a methodology for learning these parameters would be another difference. for example, it could be backpropagation and deep learning, right? There's a methodology for learning the parameters of the model. or it could be, Bayes rule as a way of updating the parameters, in a graphical model, for example. That makes me think.

28:05

So is it true that at the moment all of ML is being completely dominated by deep learning? And when people talk about ML, they basically talk about deep learning most of the time? Back propagation and stuff like that has been. Super hot topic, because of, chat GPTs, of the world and stuff like that. And the rest is becoming a little bit, less in, fashion at the moment? it comes in waves.

28:33

I tend to focus on fundamentals because fundamentals are never going to be out of fashion, solid, background and applied probability calculus, linear algebra, Bayesian inference, deep learning, these are all going to be in fashion for a really long time. definitely large language models showed, so much, growth in the past few years. And, these are deep learning models, starting from like natural language, machine translation, encoder decoder type architectures and, going to, GPT.

29:11

For, and wherever the next GPT is, in size and in performance. it's interesting to. Think about it. I'm really happy that, they took off at such speed and there's so much interest in AI, so what was that? 2019 or something like that when the first version of ChatGPT came out, right? it's been a few years now. as someone who specializes in a lot of this, fundamental, algorithms and understands how they're derived and where they come from and their limitations.

29:43

What do you think of all the hype that's currently flowing around, AGI being just around the corner and AI taking your job and all of that, I'm a believer in co pilots. So I think, AI is helping people with their job. I'm not sure if they're going to be taking over the job, but, also a big believer in automation, automation as a way of helping a developer deal with less pleasant aspects of the job, right? if AI can do that, that's fantastic.

30:13

But I think a lot of the planning and thinking is still up to the human, to reason, to decide. Yeah, I benefited a lot from co pilots. they're really great at summarizing a lot of resources available online and through, retro augmented generation systems. You could accomplish a lot. I'm a big believer in co pilots. I think this is probably something that might be getting a little bit of, bad rep, because everybody just wants like the final step, right?

30:45

It was the same thing with self driving cars. my Tesla is driving itself pretty well, maybe 95% of the time, if I'm on like a longer route and I'm on the motorway or whatever, It's doing most of the work already pretty well, I'm still responsible for it and I have to look but what everybody wants is like the final step when you can just kick back and relax and not do any of that and I think that's understandable.

31:12

But at the same time it's like making the current, intermediate step of a co pilot situation, maybe sounds a little bit less glamorous than it actually is because it's already pretty cool. so totally agree with you on that. we've done one example of the decision tree.

⁠¶ Top Machine Learning Algorithms

31:32

I wonder what would be like your top three hall of fame machine algorithms. I. saw your book, and there are some of the things that I keep seeing elsewhere, like Markov chains and Monte Carlo stuff like that. there are some of the things that sound interesting, like genetic algorithms and, I wonder what that actually means.

31:52

But if you were to give us like your top three favorite, Hall of Fame algorithms and tell us a little bit how they work, high level again for a five year old software engineer. What will be your selection? What's on that menu? definitely have to mention one of them would be a Markov chain Monte Carlo type algorithm. so what Markov chains are essentially it's a sequence of random variables and, the future is independent of the past.

32:18

So the future state of random variable only depends on the present state, which reminds me of a quote that, doesn't really matter where you're coming from, all that really matters is where you're going, so Markov chain Monte Carlo, one of my favorite algorithms in that area is Metropolis Hastings algorithm. And, idea there is you're after a posterior distribution. you want to draw samples from this posterior distribution. You want to, study it, analyze it. Posterior is like the goal, the answer.

32:49

But it's hard to sample from it, because it's, in real life models, they're complex. And, what you do instead is you approximate it with something called a proposal distribution. And a proposal distribution is easier to sample from. So what happens is you draw samples from a proposal distribution, and then based on Metropolis Hastings ratio, you evaluate these samples, and you either accept them or reject them. You either take them or you drop them. And you repeat this process many times.

33:22

so Metropolis Hastings enables sampling from these high dimensional distribution spaces, and, it's, simple enough to implement from scratch. it's a great algorithm, overall. There are various improvements on top of it. It's definitely not the most efficient. Algorithm, but it's a really good one. It's, that's why I bring it up. but how do you come up with this proposal distribution? Proposals are something that's easier to sample from.

33:52

So it could be a Gaussian with certain mean covariance, like a multivariate Gaussian and high dimensional problems. You know, typically you want to have a high acceptance ratio, so the closer your proposal is to the actual target distribution, the target posterior, then the better. so you're trying to estimate, based on domain knowledge or otherwise, the proximity, how close can you get to the target. I see. because I keep, thinking the classical way about it.

34:23

So it's not like one of those algorithms where you just have the steps, there's a step which is basically suggest a reasonable distribution that approximates it. with something that's well known and look at the data and come up with, something that should be reasonable. And then you used, Metropolis Hastings to evaluate, basically. it's much more artisanal, right?

34:46

there's always this step of, staring at the data and looking and coming up with, mix of your experience and creativity to come up with something that sounds about right. which is scary. for someone who comes from, very exact word of, algorithms. This is a, this is scary stuff. All right, cool. So Metropolis Hastings, is that two names I believe these are names are named after the inventors of the algorithm. So that's an interesting approach.

35:15

What will be your number two of your top three I would pick, approximate nearest neighbors because of its popularity and, current like ritual augmented generation systems they're used. everywhere. essentially, approximate nearest neighbors is an improvement of K nearest neighbors. with K nearest neighbors, if you're given, a query point, you want to compute the distance between that query point and all the other points in the training data set.

35:46

You want to compute the distances and then sort these distances and then select top k closest distance points. So this is highly computationally intensive operation. first of all, you have to compute, and the dimensional distances. Then you have to sort and then log in and select the top K. So approximate nearest neighbors is a way to get around it.

36:10

And, there are Three approximate nearest neighbour flavors that I could talk about, one is tree based nn essentially, what you do with tree based nn is, you divide up the space into regions, and each leaf in the tree is a region. So one example is like KD trees. and then based on the problem, is it a classification problem or regression problem? You can compute the final answer either by, taking majority vote for classification or taking an average of points in the region for regression problem.

36:48

we take a quick detour to say, because those are words that have meaning, but they probably have more particular meaning in, the machine learning context. So could you quickly tell us what's the difference between regression and classification types of algorithms? so in regression, you're interested in estimating a continuous quantity. so a real value, such as, let's say, a stock price in classification. You're interested in, estimating a discrete quantity.

37:19

For example, it could be a particular, customer age group. so the differences in the quantity you're estimating for continuous, it's regression for discrete, it's classification. back to approximate nearest neighbor is, tree based nn, we have our space, which we divide into regions. and, based on the points in each region and the task at hand, we either average the points if we're looking for regression.

37:48

answer or we take majority vote is, looking at the class labels and taking the majority label as the answer, if the problem is a classification problem. another. Example of approximate nearest neighbors is locality sensitive hashing. what we do there is we essentially group points into buckets, based on their proximity with each other. And instead of searching through all the points, we only look inside the bucket to find the k nearest neighbors.

38:21

So this helps reduce computational complexity dramatically. So first cluster them using one of the other algorithms, and then you just look inside the cluster. That will be the third type is clustering. in the clustering sense, we cluster the points into clusters and only look inside the cluster. So the buckets we were talking about, how are they different from clusters? the buckets are formed in a slightly different way.

38:48

essentially, you can visualize it as points on a high dimensional sphere and you intersect the points with hyperplanes and points that are captured between the hyperplanes forming, into a bucket are placed into the same bucket. So they're based on locality there. it points are closer together on that sphere, get grouped into the same bucket. Okay. All right.

39:12

so these three methods, the, Tree based nn, locality assigns to hashing nn, and, based nn they help speed up, complexity of exact K and N. So what would be some of the applications of this approximate nearest neighbors? where would we see that in practice maybe in production? can you give us an example? so in Retrieval Augmented Generation Systems, you have a vector store and you're interested in retrieving closest or in semantic search, for example, you're interested in retrieving.

39:46

closest, unit from the vector store And like a bunch of dimensions. Yeah. So you can use this approximation to get something quicker. Even if it's not exact, which is pretty cool. All right. So that was number two on your top three lists. What's number three? I would say, attention and transformers, I would say my third, on the list of favorite algorithms, self attention methods, they really revolutionize the space. And, the idea there is to attend to the context.

40:27

originating from, neural machine translation. if we are translating a target word, to a different language, we need to understand the context around that word.

40:38

We need to understand the whole sentence around it before we can translate a single word and, self attention mechanisms enable us to do just that, and in a paralyzed fashion, it could also be seen as a soft dictionary lookup with query key value pairs in which the target word is a query and you're computing in the product between the query and the key, and multiplying that by the value stored in that soft lookup dictionary.

41:12

And, that's how you get the famous, formula for involving those three variables. essentially we're trying to understand the context and the contribution of every word in the sentence to the target word in which we're translating. and, we have, Learnable parameters, so we're keeping track of, word embeddings and we're keeping track of word position we have these learnable parameters, which help us, find the closest map in the target language to the word, which we're translating.

41:42

Okay. Okay, so I got, a sentence. I don't know. I like cats. And we want to understand, the like, how is it connected to the cats, right? Uh huh. so does it mean that we're calculating like the complete product of, connections between all the pairs of words, embeddings or whatever is underlying there. How does it work?

42:10

I do have a chapter in my book on self attention and transformers, it comes back to attention's all we need architecture, the encoder decoder architecture, the paper, first introduced it. famous paper. yeah, essentially we're predicting one word at a time. in a masked, causal way, right? and are looking at all the words that came before that word and figuring out the highest probability next word in our dictionary.

42:39

and of course there's different varieties of, architectures now when it comes to transformers, there's the decoder only GPT family, then there are encoder only BERT, then there's encoder decoder architectures and, like T5. And they're suitable for different applications. GPT has been very popular when it comes to generative AI. We've got the top three from Vadim. For anybody else who, is struggling a little bit like me to, go through that, probably the best way, is to go grab a book.

43:13

the book is still in MEAP, right? The mining early access program. And, I think I looked it up on the website. I think it said August this year for final version. Is that right? everything is written and finished, it's up to production folks at manning to actually have the print version ready. the PDF is available and all the contents are there right now. Got it. Manning, please hurry up. We want the book finished. What's next for you, Vadim?

43:45

I've been thinking about, maybe making an online course on machine learning topic. I'm exploring different media right now. writing a book is one media. I'm getting into YouTube a little bit more. I, started posting content on YouTube.

44:00

I also have an Instagram channel, at the life guide now, which I talk about inspirational, motivational content related to different quotes and different things that helped me grow and go through, difficult periods, kind of things that help me and things I want to share with the world. yeah, growing those. channels and maybe looking at online courses is my next step. Yeah, I think to be honest with you, YouTube is probably my favorite way of learning things at the moment.

44:34

It's basically got everything and anything that you need. And on any topic, really, you're going to find something and many topics, you're going to find so many different ways of explaining something. And it's a nice medium because it's so flexible, right? You can explain, you can show, you can give examples, you can demonstrate. It's amazing.

44:55

if, our civilization fails some thousand years from now, I hope that YouTube survives because for the next one to pick it up, that's a lot of knowledge that's encoded in there and in a very nice, to consume way. I'm going to ask you before I let you go. for some predictions. Given, crazy rate of acceleration in all the things. There seem to be an AI startup on every corner now. And they seem to be going, almost as quickly as they're coming.

⁠¶ Future of AI and ML

45:27

Where do you think we're going to see most, development in the coming years? where would you personally love to see development in the coming years? actually recently attended a keynote that, machine learning data science conference, and bloods at Microsoft. And, I was really inspired by this, agents and, autonomous thinking units. And, as part of co pilots and assistance, and, there's so much room for growth in that space.

45:55

there's different form factors like we're used to our phones and laptops, right? But imagine. Having co pilot that's not on your phone or on your laptop, but, somebody who's portable, somebody who's with you, somebody who's understands you really well and helps you do your tasks or helps you, have a good time. yeah, so different form factors, like a portable co pilot, like a device that could. Be with you and, learn from you and interact, with you.

46:29

So I think that's redesigning what we have today in terms of, LLM agents or population of agents, not just focused on language, but other types of agents, I think is going to be the next step forward. I would like to challenge you a little bit on that, because I've been thinking like that initially when I was watching, for example, the rabbit R1 keynote, and they were giving this demo of how you're just going to talk to it. And it's going to, effectively use the UIs in various apps.

47:02

And I was like, Oh, that's a great idea. All this apps, they have weird UI things, and I don't want to click that. I don't want to learn it. I just wish it was automated. And there was also humane AI. And they both seem to suck quite a lot. Like I watched some of the reviews, I even ordered the rabbit or one, and it just doesn't seem to be working all that well, I think humane AI was already talking about hoping to be acquired by someone who can take it in a better direction.

47:31

And my thinking was actually, what is so wrong with the phones? there's a smartphone, it's already evolved and it's already got basically everything you need to run a reasonably sized model already. So why don't we just like the idea of just having that naturally evolve to be more prominent in your phone? And why do we need a new device for that? What do you think about that? it has to make sense, right? if it's not working as expected, then people are not gonna, buy it, right?

48:04

but it has to add value to our lives. it could be, the interaction with the device. Instead of clicking, you simply use an eye tracking software and you could click using your eyes as an example, something seamless, something that removes the bottlenecks, instead of typing, of course, we have now, all the interactions with our devices, but something that simplifies our lives, it has to have value. Yeah, that's for sure.

48:36

I'm just wondering, there's A few startups now that are working on this humanoid robots, right? There's obviously like the Tesla Optimus and a bunch of others. I think Unitree announced that you can now order their $16,000 mini, four or five. Feet tall humanoid, which is, I guess it's not mini anymore. It's, that's pretty big, but I'm just wondering if I actually need that yet. don't get me wrong. I would love to get one of this.

49:08

And if I had 16 grand lying around that I had no use for, I would have ordered one already. But I do wonder, Whether that's literally around the corner or whether this is going to be another one of the self driving car situations where It's been next year for a decade and a half at least now Have you ordered one? no, we do have a robo vacuum though. Oh if there's a are awesome.

49:34

if there's a way I could, reduce the amount of chores I need to do that could free up my time, but I know it's not an easy problem. even things like grasping is not, an easy problem for robotics. So, it might be a few more years. I'm glad that you're a fellow vacuum cleaner aficionado. I love mine. I upgraded last year to one that, finally has the mop thing. it not only vacuums, but also, mops the floor and cleans itself up and dries itself up and everything.

50:05

And it's been like easily one of the best investments that I've done. I did have to basically change my flat layout quite significantly. I got rid of all the carpets now, and I laid better flooring just so that I know that all of the floods can be mopped and cleaned by the robot and it does it every day and I couldn't be happier. in that respect, robots, I'm looking forward. I can definitely bring one home.

⁠¶ Conclusion and Farewell

50:31

All right, Vadim, it's been a pleasure. That was probably the most challenging episode we've done. When you try to talk about algorithms without actually being able to show them and give an example or point to some code and what I'm hoping we achieved here was a high level map that people can now go and look up in books like yours again. Let me plug that. It's called. Machine learning algorithms in depth by Manning. My guest was Vadim Smolyakov. Vadim, thank you very much.

51:03

I'll see you next time. Thank you.

Transcript source: Provided by creator in RSS feed: download file

Simplifying Algorithms - Vadim Smolyakov - HS#18

Episode description

Transcript