The History of Machine Learning with Tom Mitchell - podcast episode cover

The History of Machine Learning with Tom Mitchell

Feb 23, 20261 hr 8 minEp. 1
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

Tom Mitchell, Founders University Professor at Carnegie Mellon University kicks off the podcast with this recording of his February 2026 seminar talk on “The History of Machine Learning.”

He takes us from the writings of early philosophers about whether it is even possible to form correct general laws given only specific examples, to today’s machine learning algorithms that underlie a trillion dollar AI economy. Along the way we see the thoughts and recollections of many of the pioneers in the field, in the form of excerpts from upcoming podcast episodes featuring full interviews with each.

Tom discusses the wonderful creativity and diversity of approaches explored during the 1980s, the integration of statistics and probability into the field in the 1990s and early 2000s, and the amazing progress over the past decade that has brought us today’s AI systems.  He reflects in the end on what we should learn from this history.

Recorded at Carnegie Mellon University.

Transcript

Tom Mitchell: Welcome to machine learning. Tom Mitchell: How did we get here? Tom Mitchell: I'm Tom Mitchell, your podcast host. Tom Mitchell: Now many people ask, how did we get to this point where today we Tom Mitchell: have these amazing AI systems? Tom Mitchell: I have a one sentence answer to that question.

Tom Mitchell: We tried for fifty years to Tom Mitchell: write by hand intelligent Tom Mitchell: programs, but we discovered Tom Mitchell: about a decade ago that it was Tom Mitchell: actually much easier and much Tom Mitchell: more successful to use machine Tom Mitchell: learning methods to instead Tom Mitchell: train them to become Tom Mitchell: intelligent. Tom Mitchell: So the real question is, how did machine learning get here?

Tom Mitchell: What were the successes along the way and the failures? Tom Mitchell: Who were the people involved? Tom Mitchell: What were they thinking? Tom Mitchell: What even made them want to get Tom Mitchell: into this field in the first Tom Mitchell: place? Tom Mitchell: This first episode will set the stage for the podcast.

Tom Mitchell: It is a recording of a lecture I gave this month in February Tom Mitchell: twenty twenty six at Carnegie Mellon University, and it Tom Mitchell: attempts to cover in one hour a seventy five year history of the Tom Mitchell: field of machine learning.

Tom Mitchell: Most of the rest of the episodes Tom Mitchell: in the podcast involve Tom Mitchell: interviews with various pioneers Tom Mitchell: in the field, who made very Tom Mitchell: significant contributions along Tom Mitchell: the way. Tom Mitchell: Before we start, I want to thank Carnegie Mellon University and Tom Mitchell: also the Stanford University Digital Economy Lab for Tom Mitchell: supporting the podcast. Tom Mitchell: And I want to thank Maddie Smith, our podcast producer.

Tom Mitchell: I hope you enjoy the podcast. Tom Mitchell: If we're going to talk about Tom Mitchell: machine learning, it's only fair Tom Mitchell: to start with the first people Tom Mitchell: who talked about how on earth is Tom Mitchell: learning possible? Tom Mitchell: Which were the philosophers? Tom Mitchell: And so as early as Aristotle, he was talking about the question Tom Mitchell: of how is it that people could look at examples of things and Tom Mitchell: learn their general essence?

Tom Mitchell: In his words, about a century later, there was a school of Tom Mitchell: philosophers called the Pyrrhonists, who really zeroed Tom Mitchell: in on the problem of induction and how it can be justified. Tom Mitchell: When we say induction, what we Tom Mitchell: really mean is the process of Tom Mitchell: coming up with a general rule Tom Mitchell: from looking at specific Tom Mitchell: examples.

Tom Mitchell: And so they talked about Tom Mitchell: questions like, well, if all of Tom Mitchell: the swans we've seen so far in Tom Mitchell: our life are white, should we Tom Mitchell: conclude that all swans are Tom Mitchell: white? Tom Mitchell: What would be the justification for that? Tom Mitchell: Maybe there's a black swan out there that we haven't seen. Tom Mitchell: And, uh, that debate went on for Tom Mitchell: some time around thirteen Tom Mitchell: hundred.

Tom Mitchell: William of Ockham, uh, suggested Tom Mitchell: something that we now call Tom Mitchell: Occam's razor, the policy that Tom Mitchell: we should prefer the simplest Tom Mitchell: hypothesis. Tom Mitchell: So, indeed, if all the swans we've seen so far are white, Tom Mitchell: then the simplest hypothesis is all swans are white. Tom Mitchell: That was his prescription.

Tom Mitchell: Later on, around sixteen hundred, Francis Bacon brought Tom Mitchell: up the importance of data collection, of actively Tom Mitchell: experimenting, to collect data that could falsify hypotheses Tom Mitchell: that weren't correct. Tom Mitchell: And then in the seventeen hundreds, the philosopher David Tom Mitchell: Hume really kind of nailed the problem of induction.

Tom Mitchell: He argued very persuasively that it's really impossible to Tom Mitchell: generalize from examples if you don't have some additional Tom Mitchell: assumption that you're making. Tom Mitchell: And he pointed out that even the assumption that the future will Tom Mitchell: be like the past is itself not a provable assumption is just a Tom Mitchell: guess that we use. Tom Mitchell: So his point was that people do induction, but it's a habit.

Tom Mitchell: It's not a justified, rational, provable, correct process. Tom Mitchell: So they had plenty to say around the nineteen forties when Tom Mitchell: computers became available. Tom Mitchell: Alan Turing, who's often called Tom Mitchell: the father of computing, uh, Tom Mitchell: suggested that maybe computers Tom Mitchell: could learn.

Tom Mitchell: He said instead of trying to produce a program to simulate Tom Mitchell: the adult mind, why not rather try to produce one which Tom Mitchell: simulates a child's? Tom Mitchell: If this were then subjected to Tom Mitchell: an appropriate course of Tom Mitchell: education, one would obtain the Tom Mitchell: adult brain. Tom Mitchell: So he had the idea that maybe computers could learn.

Tom Mitchell: But he did not have an algorithm by which they would learn that Tom Mitchell: waited until the nineteen fifties, when there were two Tom Mitchell: important seminal events. Tom Mitchell: One was a computer program Tom Mitchell: written by an IBM researcher Tom Mitchell: named Art Samuel, and his Tom Mitchell: program learned to play Tom Mitchell: checkers. Tom Mitchell: I'll just read you a couple Tom Mitchell: sentences from the abstract of Tom Mitchell: this paper.

Tom Mitchell: He said two machine learning procedures have been Tom Mitchell: investigated in some detail using the game of checkers. Tom Mitchell: enough work has been done to Tom Mitchell: verify the fact that a computer Tom Mitchell: can be programmed so that it Tom Mitchell: will learn to play a better game Tom Mitchell: of checkers than can be played Tom Mitchell: by the person who wrote the Tom Mitchell: program.

Tom Mitchell: And then he went on to point out Tom Mitchell: the principles of machine Tom Mitchell: learning verified by these Tom Mitchell: experiments are, of course, Tom Mitchell: applicable to many other Tom Mitchell: situations. Tom Mitchell: So he had really one of maybe Tom Mitchell: the first demonstration of a Tom Mitchell: program that learned to do Tom Mitchell: something interesting.

Tom Mitchell: And he understood that the Tom Mitchell: techniques he was using were Tom Mitchell: very general. Tom Mitchell: Now, how did he get the computer to learn to play checkers? Tom Mitchell: His program learned an Tom Mitchell: evaluation function that would Tom Mitchell: assign a numerical score to any Tom Mitchell: checkers position, and that Tom Mitchell: score would be higher, the Tom Mitchell: better the checkers position Tom Mitchell: was.

Tom Mitchell: From your point of view as you're playing the game, and Tom Mitchell: then you would use that to control a search. Tom Mitchell: A look ahead search for which move to proceed to take that Tom Mitchell: evaluation function was a linear weighted combination of board Tom Mitchell: features that he made up. Tom Mitchell: Things like how many checkers are on the board that are mine, Tom Mitchell: how many are on the board that are yours, and so forth.

Tom Mitchell: So his program learned. Tom Mitchell: What it learned was that evaluation function. Tom Mitchell: How did it learn it? Tom Mitchell: By playing games against itself. Tom Mitchell: And he points out that in eight to ten hours, it could learn Tom Mitchell: well enough to beat him. Tom Mitchell: Those ideas persisted through the decades.

Tom Mitchell: They became reused over and over, including in the computer Tom Mitchell: programs that finally beat the World Chess Champion and the Tom Mitchell: World Backgammon Champion and the World Go champion. Tom Mitchell: So those ideas were really seminal.

Tom Mitchell: A second thing that happened in Tom Mitchell: the fifties was the invention of Tom Mitchell: the first early version of Tom Mitchell: neural networks by Frank Tom Mitchell: Rosenblum, wrote, I'm sorry, Tom Mitchell: Frank Rosenblatt from Cornell, Tom Mitchell: and he was interested in Tom Mitchell: neuroscience. Tom Mitchell: How can the brain neurons in the brain be used to learn?

Tom Mitchell: And he ended up building a simple, uh, at least by today's Tom Mitchell: standards, simple neural network that consisted of, uh, one layer Tom Mitchell: of neurons where, uh, there would be a receptive field, uh, Tom Mitchell: input, say an image, and then the neurons would respond to Tom Mitchell: that and produce an output set of neuron firings.

Tom Mitchell: What got learned in that case Tom Mitchell: were the connection strengths Tom Mitchell: between the input to the neuron Tom Mitchell: and the probability that it Tom Mitchell: would fire. Tom Mitchell: And the way he trained it was Tom Mitchell: what we now call supervised Tom Mitchell: learning. Tom Mitchell: You show an input and and what the output should be. Tom Mitchell: And he had schemes for updating those weights to fit the data.

Tom Mitchell: Now that the importance of this Tom Mitchell: work is that it catalyzed a Tom Mitchell: whole bunch of work in the Tom Mitchell: nineteen sixties, for the next Tom Mitchell: decade, looking at different Tom Mitchell: algorithms for tuning the Tom Mitchell: weights of perceptron style Tom Mitchell: systems.

Tom Mitchell: That work proceeded for a Tom Mitchell: decade or so, and at the end of Tom Mitchell: the nineteen sixties, two MIT Tom Mitchell: scientists, Marvin Minsky and Tom Mitchell: Seymour Papert, wrote a book Tom Mitchell: called perceptrons.

Tom Mitchell: But unfortunately, that book Tom Mitchell: proved that a single layer Tom Mitchell: perceptron, which is the only Tom Mitchell: thing we knew how to train at Tom Mitchell: that point, uh, could never even Tom Mitchell: represent any many, many Tom Mitchell: functions that we wanted to Tom Mitchell: learn. Tom Mitchell: It could only represent linear functions, not even, uh, Tom Mitchell: exclusive or, you know, where the input could be. Tom Mitchell: The output would be one.

Tom Mitchell: If input one is a one and the other is a zero, or if it's a Tom Mitchell: zero and a one. Tom Mitchell: But the output would have to be zero if they were both one. Tom Mitchell: You can't even represent that Tom Mitchell: simple function with a Tom Mitchell: perceptron no matter how you Tom Mitchell: train it. Tom Mitchell: So this really kind of put the Tom Mitchell: kibosh on work on perceptrons, Tom Mitchell: uh, following the publication of Tom Mitchell: this book.

Tom Mitchell: Now, if we're not going to be able or don't want to spend our Tom Mitchell: time figuring out how to learn perceptrons, Then what's next? Tom Mitchell: Well, it turned out one of Tom Mitchell: Minsky's PhD students, Patrick Tom Mitchell: Winston.

Tom Mitchell: The next year published his Tom Mitchell: thesis, and Winston suggested Tom Mitchell: that instead of learning Tom Mitchell: perceptron type representations Tom Mitchell: of information, we should learn Tom Mitchell: symbolic descriptions. Tom Mitchell: And so his program, uh, in his thesis, he showed how his Tom Mitchell: program could learn descriptions of different physical structures Tom Mitchell: like an arch or a tower.

Tom Mitchell: And he would train the program by showing it line drawings of Tom Mitchell: positive and negative examples of, uh, in this example arches. Tom Mitchell: And then the program would process those incrementally Tom Mitchell: arriving examples to produce a symbolic description that would Tom Mitchell: describe the different parts and relations among them.

Tom Mitchell: For example, an arch could be two rectangles which don't touch Tom Mitchell: each other, but which jointly support a roof of any shape. Tom Mitchell: So this was an important step Tom Mitchell: because it shifted the focus Tom Mitchell: onto learning a much richer kind Tom Mitchell: of representation, symbolic Tom Mitchell: descriptions. Tom Mitchell: And this became the new paradigm Tom Mitchell: which dominated the nineteen Tom Mitchell: seventies.

Tom Mitchell: So during the seventies, there Tom Mitchell: were a number of people working Tom Mitchell: on learning symbolic Tom Mitchell: descriptions. Tom Mitchell: My favorite is the metaphor program, developed by Bruce Tom Mitchell: Buchanan at Stanford. Tom Mitchell: This program, again, was a symbolic learning program.

Tom Mitchell: What it learned was rules that would predict how molecules Tom Mitchell: would shatter inside a mass spectrometer, and therefore Tom Mitchell: predict what the mass spectrum of a new molecule would be. Tom Mitchell: And those rules again described, Tom Mitchell: Symbolically described a Tom Mitchell: subgraph of atoms within the Tom Mitchell: molecular graph.

Tom Mitchell: And the rules would say, if you find this subgraph, then Tom Mitchell: specific bonds in that subgraph are likely to fragment when you Tom Mitchell: put this in a mass spectrometer. Tom Mitchell: And this was an important step forward. Tom Mitchell: I asked Bruce Buchanan, how will it work? Tom Mitchell: What was this program able to do in terms of did it work. Bruce Buchanan: Well for one small class of steroid molecules, the keto and Bruce Buchanan: estranes, if you will?

Bruce Buchanan: Uh, we had, uh, fewer than a Bruce Buchanan: dozen spectra, and we were able Bruce Buchanan: to tease out the rules that Bruce Buchanan: determine, uh, How a new keto Bruce Buchanan: androstane would fragment in a Bruce Buchanan: mass spectrometer, and we were Bruce Buchanan: able to publish that set of Bruce Buchanan: rules in a refereed chemical Bruce Buchanan: chemical journal, Chemistry Bruce Buchanan: Journal.

Bruce Buchanan: Sorry. Bruce Buchanan: Uh, and it was, to our Bruce Buchanan: knowledge, the first time that Bruce Buchanan: the result of a machine learning Bruce Buchanan: program, Symbolic Learning, had Bruce Buchanan: been published, uh, in a Bruce Buchanan: refereed journal. Tom Mitchell: So that was an important milestone for machine learning, Tom Mitchell: really, the first time that a program discovered some Tom Mitchell: knowledge that was useful enough to get published in that domain.

Tom Mitchell: Now it turned out personal note Tom Mitchell: I was a PhD student at Stanford Tom Mitchell: at the time, and Bruce became my Tom Mitchell: PhD advisor, so my PhD thesis Tom Mitchell: was also built around, this same Tom Mitchell: data set.

Tom Mitchell: And for my thesis I developed a system called Version Spaces Tom Mitchell: that was the first symbolic learning algorithm where you Tom Mitchell: could prove that it would converge, and furthermore, that Tom Mitchell: the learner would know when it had converged, so it would know Tom Mitchell: it was done.

Tom Mitchell: And it did that by maintaining Tom Mitchell: not just one hypothesis that it Tom Mitchell: would modify, but by keeping Tom Mitchell: track of every hypothesis Tom Mitchell: consistent with the data that it Tom Mitchell: had seen. Tom Mitchell: And this also opened up the possibility of what we call Tom Mitchell: today active learning. Tom Mitchell: It made it easy for the system Tom Mitchell: to play twenty questions with Tom Mitchell: the teacher.

Tom Mitchell: Uh, it could ask the teacher, please label this example so Tom Mitchell: that in a way, uh, it could reduce the set of hypothesis as Tom Mitchell: quickly as possible. Tom Mitchell: So by the end of the seventies, there seemed to be enough work Tom Mitchell: going on in the field that it was time to hold a meeting.

Tom Mitchell: And so we organized the first Tom Mitchell: workshop in machine learning was Tom Mitchell: held here at CMU at Wayne Hall, Tom Mitchell: a couple of buildings that Tom Mitchell: direction, and it was organized Tom Mitchell: by Jaime Carbonell, who was an Tom Mitchell: assistant professor here at the Tom Mitchell: time.

Tom Mitchell: Richard Michalski, who is a more Tom Mitchell: senior professor at Illinois and Tom Mitchell: myself, I was at the time an Tom Mitchell: assistant professor at Rutgers Tom Mitchell: University. Tom Mitchell: And so we held this meeting, pulled together some people. Tom Mitchell: One of the people who attended was a student of Richard Tom Mitchell: Michalski named Tom Dietterich.

Tom Mitchell: And Tom went on to make many Tom Mitchell: contributions in the field of Tom Mitchell: machine learning. Tom Mitchell: And so I asked Tom, what was the field like in nineteen eighty? Tom Dietterich: I'd say it was really chaotic. Tom Dietterich: you know, I was, Tom Dietterich: attended that very first machine Tom Dietterich: learning workshop that was Tom Dietterich: organized.

Tom Dietterich: I think you were one of the core Tom Dietterich: organizers at CMU, and there Tom Dietterich: were probably thirty people in Tom Dietterich: the room and, uh, and probably Tom Dietterich: thirty completely different Tom Dietterich: talks. Tom Dietterich: You know, I remember, I was talking Tom Dietterich: about I had done, a sort of algorithm comparison paper Tom Dietterich: that I published at Ijcai seventy nine, I think.

Tom Dietterich: So just before that workshop, in which I was, by Tom Dietterich: hand executing these very simple algorithms for this kind of Tom Dietterich: subgraph learning problem, uh, and comparing how many subgraph Tom Dietterich: isomorphism calculations they had to do.

Tom Dietterich: But it was like the first Tom Dietterich: attempt to actually compare Tom Dietterich: multiple machine learning Tom Dietterich: algorithms that were more or Tom Dietterich: less trying to do the same Tom Dietterich: thing. Tom Dietterich: There were a couple of them there, and, you Tom Dietterich: know, I think John Anderson was there talking about, you Tom Dietterich: know, cognitive models.

Tom Dietterich: You were there talking about Tom Dietterich: the beginnings of EBL and the Tom Dietterich: Lex system for, for, Tom Dietterich: calculus, symbolic Tom Dietterich: integration. Tom Dietterich: You know, I remember the most interesting talk I Tom Dietterich: thought was Ross Quinlan's talk on, on ID3, where he was Tom Dietterich: trying to take these reverse numerated chess endgames Tom Dietterich: and learn decision trees.

Tom Dietterich: That would completely, Tom Dietterich: exactly losslessly, Tom Dietterich: basically compress those Tom Dietterich: giant tables into a small Tom Dietterich: decision tree. Tom Dietterich: A really important thing people should understand in those days Tom Dietterich: was we believed there was a right answer for our Tom Dietterich: machine learning problems.

Tom Dietterich: And we would, Tom Dietterich: it would often happen that I Tom Dietterich: would run like the algorithms Tom Dietterich: and it would not get the right Tom Dietterich: answer. Tom Dietterich: It would not get the, the logical expression that we Tom Dietterich: thought was the right answer. Tom Dietterich: It would get something that was really, actually equally Tom Dietterich: accurate on the training data.

Tom Dietterich: And actually it worked Tom Dietterich: pretty well although we Tom Dietterich: didn't really have a set idea of Tom Dietterich: a separate test set in those Tom Dietterich: days. Tom Dietterich: I mean, it was not a field of statistics. Tom Dietterich: It was, the idea was right.

Tom Dietterich: We were coming out of the, really the John McCarthy program Tom Dietterich: of programs with common sense, which didn't have a lot to do Tom Dietterich: with common sense, but was about we're going to represent Tom Dietterich: everything in logic, and we're going to use logical inference Tom Dietterich: as the execution engine. Tom Mitchell: So there's Tom's take on what things were like.

Tom Mitchell: He mentioned that he thought the most interesting talk was Tom Mitchell: Ross Quinlan's talk. Tom Mitchell: I agree, I thought that was the most interesting talk. Tom Mitchell: Ross's talk presented the idea Tom Mitchell: that we should learn decision Tom Mitchell: trees.

Tom Mitchell: A decision tree is something where you classify your example Tom Mitchell: by putting it at the root of the tree, and then you sort it down Tom Mitchell: to a leaf in the tree based on its features, and the leaf tells Tom Mitchell: you what the output classification label should be. Tom Mitchell: That's what get learned. Tom Mitchell: What gets learned? Tom Mitchell: So I asked Ross how he came up with this idea. JR Quinlan: I had done a PhD under a psychologist, Earl hunt.

JR Quinlan: And part of his work involved decision trees, which I learned JR Quinlan: about, of course, as a student, but then put in the back of my JR Quinlan: mind for fifteen years or so. JR Quinlan: And then I was at at Stanford on JR Quinlan: sabbatical at the same time as JR Quinlan: Donald.

JR Quinlan: Mickey was teaching a course on learning, and he had a challenge JR Quinlan: for the class on which, you know, I sat in on the class and JR Quinlan: the challenge was to work out a way of predicting a win in JR Quinlan: a very simple chess end game. JR Quinlan: King rook versus king knight. JR Quinlan: So I remembered Earl Hunt's work on decision trees, and I JR Quinlan: thought, well, maybe that would be the way to go.

JR Quinlan: So I developed a thing called ID3, which was just a simple JR Quinlan: decision tree program. JR Quinlan: No pruning, just straight decision tree. JR Quinlan: And then, uh, that that seemed JR Quinlan: to solve the problem pretty JR Quinlan: well, up to about ninety five JR Quinlan: percent. JR Quinlan: And then I got that up to one hundred the next year. JR Quinlan: And then remember, the first real time I talked about this JR Quinlan: was at that conference.

JR Quinlan: You organized the workshop in nineteen eighty at Pittsburgh, JR Quinlan: at Carnegie Mellon. JR Quinlan: You, Richard and Hymie all, all set up that workshop. JR Quinlan: And then I gave a talk on, uh, decision tree learning. Tom Mitchell: So there's Ross's story.

Tom Mitchell: He he got the idea of decision Tom Mitchell: trees from his thesis advisor Tom Mitchell: many years earlier, but it turns Tom Mitchell: out Ross was the one who came up Tom Mitchell: with the algorithm that actually Tom Mitchell: successfully discovered useful Tom Mitchell: decision trees. Tom Mitchell: And that whole idea of decision tree learning became very Tom Mitchell: important in the field.

Tom Mitchell: By twenty ten, it was probably Tom Mitchell: the one of the most commercially Tom Mitchell: used approaches in machine Tom Mitchell: learning. Tom Mitchell: So in the early eighties, there were various experiments like Tom Mitchell: these trying to build machine learning systems, but really no Tom Mitchell: theory, no theory that could tell us, for example, how many Tom Mitchell: examples would we have to present to a learner in order Tom Mitchell: for it to reliably learn?

Tom Mitchell: And that changed in nineteen Tom Mitchell: eighty four, when Les Valiant Tom Mitchell: published a paper on what he Tom Mitchell: calls probably approximately Tom Mitchell: correct learning. Tom Mitchell: And the idea is it really Tom Mitchell: was the first practical theory Tom Mitchell: to tell us how many examples you Tom Mitchell: would need.

Tom Mitchell: And it in particular, in Tom Mitchell: particular, the number of Tom Mitchell: examples you need depends on Tom Mitchell: three things. Tom Mitchell: The complexity of your hypothesis space. Tom Mitchell: For example, if you're going to learn decision trees of depth Tom Mitchell: two, that's a lot less complex than if you're learning decision Tom Mitchell: trees of depth twelve.

Tom Mitchell: So the it depends on how complex Tom Mitchell: your hypotheses are, depends on Tom Mitchell: the error rate you're willing to Tom Mitchell: tolerate in the final Tom Mitchell: hypothesis. Tom Mitchell: One percent error five percent error. Tom Mitchell: It also depends on the probability you're willing to Tom Mitchell: put up with that. Tom Mitchell: If you do choose that many Tom Mitchell: random randomly provided Tom Mitchell: training examples.

Tom Mitchell: The probability that you'll still fail. Tom Mitchell: You can't guarantee that you Tom Mitchell: won't fail, but you can reduce Tom Mitchell: that probability. Tom Mitchell: So this was a breakthrough in the area of theoretical Tom Mitchell: characterization of algorithms. Tom Mitchell: So I asked I asked les what he thought was the key idea there. Leslie Valiant: It's a it's a kind of a model of computation.

Leslie Valiant: But it yeah, it makes sense Leslie Valiant: because it's got some Leslie Valiant: applications.

Leslie Valiant: So that's the particular result which persuaded Leslie Valiant: people that there was something there is this result Leslie Valiant: that if you take a conjunctive normal form formula, Leslie Valiant: which, you know, from NP completeness at the time, we Leslie Valiant: already knew there's some hardness in it, because if Leslie Valiant: someone gave you the formula was computationally difficult to Leslie Valiant: find out whether it's a null, it's the equivalent of formula

Leslie Valiant: which, is always zero, which is never satisfiable. Leslie Valiant: On the other hand, this was Leslie Valiant: kind of this, uh, conducting Leslie Valiant: normal form formula with three, Leslie Valiant: variables in each Leslie Valiant: clause. Leslie Valiant: Uh, so this was PAC learnable. Leslie Valiant: And so this was a bit striking that something which is very Leslie Valiant: hard is learnable.

Leslie Valiant: But then this, this Leslie Valiant: highlighted the difference Leslie Valiant: between, uh, computing and uh, Leslie Valiant: and learning because so with the Leslie Valiant: learning model, the idea was Leslie Valiant: that there was a distribution of Leslie Valiant: inputs. Leslie Valiant: And you learned from this distribution, but you only have Leslie Valiant: to be good on this distribution when you have to predict.

Leslie Valiant: So if, for example, in this formula, there were some very Leslie Valiant: rare ones which are so very rare, then the learner wouldn't Leslie Valiant: have to know about that. Leslie Valiant: So in this sense this was easier than the NP completeness. Tom Mitchell: So I was actually quite surprised at that answer. Tom Mitchell: What he's saying.

Tom Mitchell: Put another way is that what was Tom Mitchell: really interesting there is that Tom Mitchell: for this one kind of hypothesis, Tom Mitchell: conjunctive normal form, which Tom Mitchell: is a way of it's a kind of Tom Mitchell: logical expression. Tom Mitchell: If your hypotheses are of that form, then it's easier to learn Tom Mitchell: them than it is to compute them.

Tom Mitchell: When he says compute them, what he means is the cost of Tom Mitchell: answering the question, can you find a positive example of this? Tom Mitchell: And it was known at the time that the computational cost of Tom Mitchell: answering that question, is there a positive example of this Tom Mitchell: formula was exponential in the size of the formula?

Tom Mitchell: And then he discovered that Tom Mitchell: learning a formula, if somebody Tom Mitchell: gives you a positive and Tom Mitchell: negative examples only takes Tom Mitchell: polynomial less than exponential Tom Mitchell: time. Tom Mitchell: So I agree with him that that's Tom Mitchell: a fascinating theoretical fact, Tom Mitchell: but that would not be the answer Tom Mitchell: I would give about why this Tom Mitchell: revolutionized the field of Tom Mitchell: machine learning.

Tom Mitchell: It revolutionized the field, in my view, because he was the Tom Mitchell: first person, really to be able to come up with a framing, a new Tom Mitchell: framing of the machine learning problem that even allowed this Tom Mitchell: kind of theoretical analysis.

Tom Mitchell: In particular, his framing Tom Mitchell: included assumptions like the Tom Mitchell: training data would come from Tom Mitchell: some source that would give you Tom Mitchell: that would give you random Tom Mitchell: examples according to some Tom Mitchell: probability distribution. Tom Mitchell: And then later, when you wanted to test your hypothesis on new Tom Mitchell: data, you would get more random examples from that same source.

Tom Mitchell: And so he reframed the problem Tom Mitchell: in a way that made theory Tom Mitchell: possible. Tom Mitchell: The consequence of that was he Tom Mitchell: catalyzed a huge amount of Tom Mitchell: theoretical work in machine Tom Mitchell: learning and continues this day Tom Mitchell: just keeps branching further and Tom Mitchell: further. Tom Mitchell: There are conferences specifically designed to cover Tom Mitchell: theoretical computer science.

Tom Mitchell: So the eighties was really a very generative decade. Tom Mitchell: There are a lot of things going on. Tom Mitchell: Another thing was going on was some people were looking at Tom Mitchell: human learning and how that might inspire our models of AI Tom Mitchell: and machine learning. Tom Mitchell: One such effort was here at CMU Tom Mitchell: by Alan Newell and his two PhD Tom Mitchell: students, John Laird and Paul Tom Mitchell: Rosenbloom. Tom Mitchell: They took the approach of.

Tom Mitchell: They built a system they called Tom Mitchell: Soar, which was really one of Tom Mitchell: the first AI agents designed to Tom Mitchell: capture the full breadth of what Tom Mitchell: humans do play games, solve Tom Mitchell: problems many different tasks, Tom Mitchell: so they frame their machine Tom Mitchell: learning problem as one of Tom Mitchell: getting a general agent to Tom Mitchell: learn.

Tom Mitchell: And their architecture had very interesting properties that I Tom Mitchell: think are relevant today. Tom Mitchell: Now that agents are again a topic of hot activity, I won't Tom Mitchell: go into the details, but in the podcast there's an interview Tom Mitchell: with John Laird who goes into detail on this. Tom Mitchell: Another item that can't be Tom Mitchell: overlooked in the eighties was Tom Mitchell: really the rebirth of neural Tom Mitchell: network.

Tom Mitchell: Remember, in the end of sixties, Tom Mitchell: Minsky and Papert published that Tom Mitchell: book that killed off work on Tom Mitchell: perceptrons? Tom Mitchell: Well, in the mid eighties, Tom Mitchell: finally, people came up with an Tom Mitchell: algorithm that could train not Tom Mitchell: just one layer perceptrons, but Tom Mitchell: multilayer perceptrons. Tom Mitchell: And that allowed learning Tom Mitchell: functions that were highly Tom Mitchell: non-linear.

Tom Mitchell: And Dave Rumelhart, J. Tom Mitchell: McClelland and Geoff Hinton were Tom Mitchell: three of the ringleaders of this Tom Mitchell: effort. Tom Mitchell: So I asked Geoff about that period. Tom Mitchell: Now we're up to the mid eighties Tom Mitchell: when really neural nets are Tom Mitchell: reborn. Tom Mitchell: Is that the right word? Tom Mitchell: How would you. Geoffrey Hinton: Backprop with backpropagation? Geoffrey Hinton: I mean, we didn't invent it.

Geoffrey Hinton: Invented by several different Geoffrey Hinton: groups, but we showed that it Geoffrey Hinton: really worked to learn Geoffrey Hinton: representations. Geoffrey Hinton: And as you know, sort of one of the big problems in AI is how do Geoffrey Hinton: you learn new representations? Geoffrey Hinton: How do you avoid having to put them all in by hand?

Geoffrey Hinton: And my particular example, Geoffrey Hinton: which was the family trees Geoffrey Hinton: example, where you take all the Geoffrey Hinton: information in some family Geoffrey Hinton: trees, you convert it into Geoffrey Hinton: triples of symbols like John has Geoffrey Hinton: Father Mary. Geoffrey Hinton: And then you train a neural Geoffrey Hinton: net to predict the last term in Geoffrey Hinton: a triple. Geoffrey Hinton: Given the first two terms.

Geoffrey Hinton: So it's just like the big language models. Geoffrey Hinton: You're predicting the next word given the context. Geoffrey Hinton: It's just much simpler.

Geoffrey Hinton: I had one hundred and twelve Geoffrey Hinton: total examples, of which one Geoffrey Hinton: hundred and four training Geoffrey Hinton: examples and eight were test Geoffrey Hinton: examples, which is a bit less Geoffrey Hinton: than the trillion examples they Geoffrey Hinton: have nowadays, Geoffrey Hinton: but it was the same idea. Geoffrey Hinton: You convert a symbol into a feature vector.

Geoffrey Hinton: You then have the feature vectors of the context interact Geoffrey Hinton: via a hidden layer. Geoffrey Hinton: They then predict the features Geoffrey Hinton: of the next symbol, and from Geoffrey Hinton: those features you guess what Geoffrey Hinton: the next symbol should be, and Geoffrey Hinton: you try and maximize the Geoffrey Hinton: probability of predicting the Geoffrey Hinton: next symbol.

Geoffrey Hinton: And you then backpropagate Geoffrey Hinton: through the feature interactions Geoffrey Hinton: and through the process of Geoffrey Hinton: converting a symbol into Geoffrey Hinton: features. Geoffrey Hinton: And that way you learn feature vectors to represent the Geoffrey Hinton: symbols and how these vectors should interact to predict the Geoffrey Hinton: features of the next symbol. Geoffrey Hinton: And that's what these big language models do.

Tom Mitchell: So there's Jeff in the mid Tom Mitchell: nineteen eighties work on Tom Mitchell: backpropagation. Tom Mitchell: Another personal note in Tom Mitchell: nineteen eighty six, while this Tom Mitchell: was going on, I came to spend a Tom Mitchell: year at CMU as a visiting Tom Mitchell: professor. Tom Mitchell: And I got to meet Allen Newell at the time. Tom Mitchell: And Allen said, hey, do you want to team teach a course?

Tom Mitchell: We'll teach a course on Tom Mitchell: architectures for intelligent Tom Mitchell: agents. Tom Mitchell: And of course I said yes. Tom Mitchell: The opportunity to teach with Allen. Tom Mitchell: And he said, by the way, there Tom Mitchell: will be another, uh, an Tom Mitchell: assistant professor working with Tom Mitchell: us. Tom Mitchell: The three of us will team teach it. Tom Mitchell: That's Geoff Hinton.

Tom Mitchell: So Allen, Geoff and I team Tom Mitchell: taught in spring of nineteen Tom Mitchell: eighty six. Tom Mitchell: Uh, this course was one of the best experiences of my career up Tom Mitchell: to that point. Tom Mitchell: And so it was a large part of the reason why I ended up Tom Mitchell: staying at CMU. Tom Mitchell: But when I came, I was here Tom Mitchell: for about a year, and then Jeff Tom Mitchell: moved on.

Tom Mitchell: He moved up to the University of Toronto and started Tom Mitchell: building up a group there. Tom Mitchell: One of the people who joined his group was a person named Yann Tom Mitchell: LeCun, who went on to win the Turing Award jointly with Jeff Tom Mitchell: and Yoshua Bengio for their work in neural networks. Tom Mitchell: So I asked Jon about this period.

Yann LeCun: And then, mid nineteen Yann LeCun: eighty seven, I moved to Toronto Yann LeCun: to do a postdoc with Jeff, and I Yann LeCun: completed this, the Yann LeCun: simulator. Yann LeCun: Jeff thought I was not doing Yann LeCun: anything because I was just Yann LeCun: basically hacking, you know, all Yann LeCun: the time, Yann LeCun: and this, this system was kind of Yann LeCun: interesting because we had to build a front end language to Yann LeCun: interact with it.

Yann LeCun: And that language was the Lisp Yann LeCun: interpreter that Leon and I Yann LeCun: wrote. Yann LeCun: And so we're using Lisp, even though as a front end to kind of Yann LeCun: a neural net simulator. Yann LeCun: And I, you know, implemented Yann LeCun: a weight sharing, abilities Yann LeCun: and all that stuff and started Yann LeCun: experimenting with what became Yann LeCun: convolutional nets.

Yann LeCun: You know, when I was a postdoc in Toronto, early nineteen Yann LeCun: eighty eight, roughly, and started to get really good Yann LeCun: results on, you know, very simple shape recognition, like, Yann LeCun: yhandwritten characters that had drawn with my mouse or Yann LeCun: something like that.

Yann LeCun: Right. Tom Mitchell: So, as you just heard, Yann was Tom Mitchell: experimenting with can we apply Tom Mitchell: neural networks to the problem Tom Mitchell: of character recognition, Tom Mitchell: written characters. Tom Mitchell: People were experimenting with many different uses of neural Tom Mitchell: nets at the time. Tom Mitchell: My favorite, the one I would vote application of the decade Tom Mitchell: was done in the area. Tom Mitchell: Surprisingly, of self-driving cars.

Tom Mitchell: There was a PhD student here at CMU named Dean Pomerleau. Tom Mitchell: He trained a neural network Tom Mitchell: where the input was an image Tom Mitchell: taken by a camera looking out Tom Mitchell: the front windshield of a Tom Mitchell: vehicle. Tom Mitchell: And the output of the neural Tom Mitchell: network was the steering command Tom Mitchell: telling the car which direction Tom Mitchell: to steer. Tom Mitchell: So I asked Dean about that work.

Tom Mitchell: How much training data did you have? Dean Pommerleau: So the interesting thing was, to Dean Pommerleau: begin with, it was all batch Dean Pommerleau: training. Dean Pommerleau: So I'd drive, I'd have a person drive the vehicle along Schenley Dean Pommerleau: Park, uh, Flagstaff Hill Path, and then I would go off and Dean Pommerleau: crunch it overnight. Dean Pommerleau: But in the end, what we were Dean Pommerleau: able to do is, uh, real time Dean Pommerleau: learning.

Dean Pommerleau: So one drive up the hill with a Dean Pommerleau: human behind the wheel steering Dean Pommerleau: and the neural network, learning Dean Pommerleau: to pair images with camera Dean Pommerleau: images with the steering command Dean Pommerleau: that the human was giving was Dean Pommerleau: able to, uh, train it in about Dean Pommerleau: five minutes to, uh, take over Dean Pommerleau: and steer on its own from there Dean Pommerleau: on, on that road and on similar

Dean Pommerleau: roads. Dean Pommerleau: So it was one of the first real time, real world vision Dean Pommerleau: applications of, uh, of artificial neural networks going Dean Pommerleau: beyond just Flagstaff Hill, you know, the little paths on there. Dean Pommerleau: And we went out on, on real roads first through the golf Dean Pommerleau: course, Schenley Golf Course, on the, uh, on the road there.

Dean Pommerleau: And then we, we went on, you know, the local highways, in Dean Pommerleau: fact, the longest as part of my PhD, the longest trip we did Dean Pommerleau: was, I think, about one hundred miles at the time from basically Dean Pommerleau: up, uh, I-79 from Pittsburgh all the way up to Erie. Dean Pommerleau: Uh, and it drove basically the, the whole way. Dean Pommerleau: So it and it was getting up to fifty five miles per hour after Dean Pommerleau: we got a faster vehicle.

Tom Mitchell: It turns out he didn't ask for permission. Tom Mitchell: So so this was all happening in the nineteen eighties. Tom Mitchell: Really, it was a decade of Tom Mitchell: amazing invention and innovation Tom Mitchell: and exploration. Tom Mitchell: Another important thing that Tom Mitchell: happened in that decade was the Tom Mitchell: development of reinforcement Tom Mitchell: learning.

Tom Mitchell: The way to understand that is to first realize that supervised Tom Mitchell: learning was the kind of standard way of framing the Tom Mitchell: machine learning question. Tom Mitchell: When Dean talked about training Tom Mitchell: his system, he would input an Tom Mitchell: image. Tom Mitchell: He had people drive the car, so he got a lot of training Tom Mitchell: examples of the form. Tom Mitchell: Here's the image and here's the correct steering command.

Tom Mitchell: So he could tell the neural network for this input. Tom Mitchell: Here's the correct output. Tom Mitchell: That's called supervised learning. Tom Mitchell: But reinforcement learning reframes the problem. Tom Mitchell: It takes into account that sometimes we don't know what the Tom Mitchell: right output is.

Tom Mitchell: For example, if you're learning to play chess, you might not Tom Mitchell: have a person who tells you at every step given this board Tom Mitchell: position, here's the right move. Tom Mitchell: Instead, you might have to wait until the end of the game after Tom Mitchell: you've made many moves to get the feedback signal that says Tom Mitchell: you lost or you won, and then you have to figure out what to Tom Mitchell: do about that because you actually took many moves.

Tom Mitchell: So that's what reinforcement learning is about. Tom Mitchell: And Rich Sutton and Andy Barto were instrumental in kind of Tom Mitchell: framing that problem and, and working on it. Tom Mitchell: They recently won the Turing Award for this work. Tom Mitchell: So I asked Rich how Tom Mitchell: reinforcement learning fit into Tom Mitchell: the field.

Rich Sutton: The field of machine learning Rich Sutton: has always been been dominated Rich Sutton: by the more straightforward Rich Sutton: supervised approach. Rich Sutton: There was, as I mentioned at the very beginning, Rich Sutton: the rewards and penalties were were very much a part of it.

Rich Sutton: But then the, focus, as Rich Sutton: things became more clear and Rich Sutton: more better defined and it Rich Sutton: became more clear, learning Rich Sutton: problem then became pattern Rich Sutton: recognition and supervised Rich Sutton: learning. Rich Sutton: And, this fellow, the strange, uh, fellow Harry Klopf, Rich Sutton: recognized this more than other people and Rich Sutton: wrote some reports and ultimately a book, saying Rich Sutton: that something had been lost.

Rich Sutton: And Andy Barta and I picked up on his work and Rich Sutton: and eventually realized that he was right, that something had Rich Sutton: been left out, and in some sense it was obvious that something Rich Sutton: had been left out. Rich Sutton: From the point of view of Rich Sutton: psychology, where I'd been Rich Sutton: studying how animals learn and Rich Sutton: animals learn.

Rich Sutton: Really in both ways, in both a Rich Sutton: supervised way and a Rich Sutton: reinforcement way. Rich Sutton: And so, we picked up on that and made that into a well Rich Sutton: defined area in the. Rich Sutton: When was that? Rich Sutton: That would have been in the eighties. Rich Sutton: And then finally, you wrote a book on it in ninety eight. Rich Sutton: So then it became a clear, uh, subfield of machine learning.

Rich Sutton: Yeah. Rich Sutton: But the key thing is why is why why is I the way I say it to Rich Sutton: myself is that why is reinforcement learning off? Rich Sutton: Why is it powerful? Rich Sutton: Potentially powerful. Rich Sutton: It's powerful because it's learning. Rich Sutton: It's really learning from experience. Rich Sutton: Learning from the normal data Rich Sutton: that an animal or a person would Rich Sutton: get.

Rich Sutton: And it doesn't require a Rich Sutton: prepared special data like you Rich Sutton: of course do in supervised Rich Sutton: learning. Tom Mitchell: So during the eighties, there were a lot of other really Tom Mitchell: interesting things going on. Tom Mitchell: Uh, people experimenting with Tom Mitchell: the idea that maybe machines Tom Mitchell: should learn by simulating Tom Mitchell: evolution.

Tom Mitchell: There was an entire set of conferences on something called Tom Mitchell: genetic algorithms, genetic programming, which had to do Tom Mitchell: with that sort of thing. Tom Mitchell: Uh, a cluster of work on Tom Mitchell: studying human learning and Tom Mitchell: other areas. Tom Mitchell: But we don't have time for all of those.

Tom Mitchell: Let's move on to the nineteen Tom Mitchell: nineties, when, again, there was Tom Mitchell: a, I would say, a sea change in Tom Mitchell: terms of the style of work that Tom Mitchell: went on. Tom Mitchell: The theme of the nineteen nineties was really the Tom Mitchell: integration of statistical and probabilistic methods into the Tom Mitchell: field of machine learning.

Tom Mitchell: And a lot of that took the Tom Mitchell: grounded form of learning a new Tom Mitchell: kind of object, which people Tom Mitchell: called either graphical models Tom Mitchell: or Bayes. Tom Mitchell: Bayes nets. Tom Mitchell: But what got learned in that Tom Mitchell: case was, again, a network where Tom Mitchell: each node would represent a Tom Mitchell: variable. Tom Mitchell: For example, maybe you would be interested in predicting whether Tom Mitchell: somebody has lung cancer.

Tom Mitchell: You'd make that a variable and maybe you'd have evidence like Tom Mitchell: are they a smoker? Tom Mitchell: Do they have a normal or abnormal X-ray result? Tom Mitchell: You'd make those variables. Tom Mitchell: And then the edges in the graph represent probabilistic Tom Mitchell: dependencies among the variables in a way such that in the end, Tom Mitchell: the whole graph represents the full joint probability Tom Mitchell: distribution over the entire collection of variables.

Tom Mitchell: So that's what got learned and how it got learned. Tom Mitchell: Waited for some algorithms to be discovered.

Tom Mitchell: One of the key people who was Tom Mitchell: involved in inventing those Tom Mitchell: algorithms, although Judea Tom Mitchell: Pearl, came up with the idea of Tom Mitchell: how to represent these, Tom Mitchell: Daphne Kohler, a professor at Tom Mitchell: Stanford, was one of the most Tom Mitchell: active researchers in terms of Tom Mitchell: designing algorithms for Tom Mitchell: learning these. Tom Mitchell: So I asked her, why do we need graphical models?

Daphne Koller: Graphical models, for me, emerged by realizing that the Daphne Koller: problems that we needed to solve to address most real world Daphne Koller: applications went beyond. Daphne Koller: You have a vector representation Daphne Koller: of an input and a single, Daphne Koller: oftentimes binary or at best Daphne Koller: continuous output. Daphne Koller: There was so much more opportunity to think about Daphne Koller: richly structured environments, richly structured problems.

Daphne Koller: So even if you think about problems like understanding what Daphne Koller: is in an image, that's not a single label problem of there is Daphne Koller: a dog, because images are complex and there's Daphne Koller: interrelationships between the different objects you want it to Daphne Koller: get beyond the yes no. Is there a dog in this image to something Daphne Koller: that is much more rich?

Daphne Koller: There's a dog and a Frisbee and Daphne Koller: a beach and three kids building Daphne Koller: a sandcastle. Daphne Koller: You have a rich input and a rich output. Daphne Koller: Thinking about these richly Daphne Koller: structured domains gave rise to Daphne Koller: we have to think about multiple Daphne Koller: variables.

Daphne Koller: We have to think about the Daphne Koller: interactions between those Daphne Koller: variables and leverage that Daphne Koller: structure both in our input and Daphne Koller: output space in order to get to Daphne Koller: much better conclusions and deal Daphne Koller: with problems that really Daphne Koller: matter.

Tom Mitchell: So this work on training Tom Mitchell: graphical models was really part Tom Mitchell: of a bigger theme that decade, Tom Mitchell: which was just the integration Tom Mitchell: of statistical methods with what Tom Mitchell: had been pretty much statistics Tom Mitchell: free machine learning up to that Tom Mitchell: point. Tom Mitchell: Another person who was Tom Mitchell: instrumental in that was Tom Mitchell: Berkeley professor named Mike Tom Mitchell: Jordan.

Tom Mitchell: I asked him about the Tom Mitchell: relationship between statistics Tom Mitchell: and machine. Michael I. Jordan: So anyway, by the time I moved to wanted to move to Berkeley, I Michael I. Jordan: was realizing that I was missing the whole statistics community, Michael I. Jordan: that, uh, it was just separate from machine learning, as maybe Michael I. Jordan: you kind of remember, there was occasionally a little leakage, Michael I. Jordan: but it was way too separate.

Michael I. Jordan: And and nowadays we're often seeing, you know, people will Michael I. Jordan: run a machine learning method, but then it's not calibrated. Michael I. Jordan: It's not, you know, has bias and all that. Michael I. Jordan: And that's the thing statisticians have talked about Michael I. Jordan: for a long, long time. Michael I. Jordan: And so nowadays I think it's a given that, yeah, they're, Michael I. Jordan: they're kind of two parts, two sides of the same coin.

Michael I. Jordan: Machine learning is maybe a little more engineering in order Michael I. Jordan: to build a system and make it do great things in the world. Michael I. Jordan: And statistics is a little bit more, well, let's be cautious. Michael I. Jordan: Let's say we're going to do like clinical trials.

Michael I. Jordan: Let's make sure that the the answer is really trustable, but Michael I. Jordan: those are two sides of the same coin, and I think that's Michael I. Jordan: probably pretty much clear now. Michael I. Jordan: But for a long time there was a resistance. Michael I. Jordan: Everyone said this is a brand new field, this is different. Michael I. Jordan: And I kept and again annoying colleagues by saying, no, I Michael I. Jordan: don't believe it is.

Michael I. Jordan: So anyway, long story short, it is. Tom Mitchell: It is remarkable that to me that Tom Mitchell: the field of machine learning Tom Mitchell: went through most of the Tom Mitchell: nineteen eighties, kind of Tom Mitchell: without even noticing that Tom Mitchell: statistics exist. Michael I. Jordan: I mean, people like Leo Breiman Michael I. Jordan: were around to help make the Michael I. Jordan: passage.

Michael I. Jordan: So ensemble methods, they were kind of invented by Leo and stat Michael I. Jordan: literature, but they were independently invented in the Michael I. Jordan: machine learning literature. Michael I. Jordan: And is that machine learning or statistics? Michael I. Jordan: Well, clearly it's both and it needs both perspectives.

Michael I. Jordan: And yes, in the nineteen nineties that the Em algorithm, Michael I. Jordan: you know, the graphical models, they were they had, they had uh, Michael I. Jordan: so yeah, the nineties, it was a real flourishing of that. Tom Mitchell: So Mike mentioned that one of the themes was ensemble.

Tom Mitchell: So anyway, I think that's Tom Mitchell: actually a very nice example of Tom Mitchell: how machine learning theory and Tom Mitchell: statistical theory kind of Tom Mitchell: intertwined. Tom Mitchell: The idea of ensemble learning is Tom Mitchell: instead of learning one Tom Mitchell: hypothesis, let's learn multiple Tom Mitchell: ones.

Tom Mitchell: For example, instead of learning Tom Mitchell: a decision tree, you might learn Tom Mitchell: a whole forest of decision Tom Mitchell: trees. Tom Mitchell: And then when it comes to Tom Mitchell: classifying a new example, you Tom Mitchell: give it to all of the Tom Mitchell: classifiers and you let them Tom Mitchell: vote and you take the vote of Tom Mitchell: the classifiers.

Tom Mitchell: Well, that turned out to be very Tom Mitchell: successful and commercially very Tom Mitchell: important. Tom Mitchell: But it also is a beautiful Tom Mitchell: example where, there's a Tom Mitchell: pretty interesting theory around Tom Mitchell: that. Tom Mitchell: And initially, Yoav Freund and Robert Shapiro, uh, in the early Tom Mitchell: nineties, uh, started working on a theory and methods for doing Tom Mitchell: this kind of ensemble.

Tom Mitchell: Leo Breiman, who was a statistician, recognized that Tom Mitchell: this echoed some of the themes of resampling and statistics. Tom Mitchell: And those two things, uh, kind Tom Mitchell: of came together in a very Tom Mitchell: successful way. Tom Mitchell: So in the nineties and the first Tom Mitchell: decade of the two thousand, Tom Mitchell: there were many other things Tom Mitchell: going on.

Tom Mitchell: The development of things called support vector machines, Tom Mitchell: kernel methods, which were, mathematical techniques for Tom Mitchell: learning, very nonlinear classifiers that were actually Tom Mitchell: commercially important and opened the door in many cases to Tom Mitchell: machine learning for non-numerical data, data like Tom Mitchell: images or text. Tom Mitchell: There is work on manifold learning.

Tom Mitchell: There was also growing Tom Mitchell: commercialization during that Tom Mitchell: decade. Tom Mitchell: More and more companies were Tom Mitchell: starting to use machine learning Tom Mitchell: commercially.

Tom Mitchell: But for me, the theme of that first decade of the two thousand Tom Mitchell: was really a growing awareness by many people that, you know, Tom Mitchell: maybe we have good enough machine learning algorithms that Tom Mitchell: the bottleneck to more accuracy is not the algorithm. Tom Mitchell: Maybe we need more data and more computation.

Tom Mitchell: And this idea was crystallized in this beautiful paper written Tom Mitchell: in two thousand and nine by three authors at Google, called Tom Mitchell: The Unreasonable Effectiveness of Data, which really Tom Mitchell: highlighted, cases where, if you want better Tom Mitchell: results, keep your same algorithm, get more data. Tom Mitchell: And that was kind of a theme of what was going on at the time, Tom Mitchell: but things really broke open in the year twenty twelve.

Tom Mitchell: In twenty twelve, the computer vision community had Tom Mitchell: been using a data set created by Fei-Fei Li called ImageNet to Tom Mitchell: test out different vision algorithms, see who could do the Tom Mitchell: best job of labeling which object was the primary object in Tom Mitchell: an image, and the image net data set was very large. Tom Mitchell: In twenty twelve, Geoff Hinton and some of his students entered Tom Mitchell: the competition and they blew away the competition.

Tom Mitchell: What's interesting is they were the only neural network approach Tom Mitchell: in the competition by that time. Tom Mitchell: By the way, neural networks were Tom Mitchell: very scarce in the field of Tom Mitchell: machine learning. Tom Mitchell: They had been displaced really Tom Mitchell: by more recent probabilistic Tom Mitchell: methods, and only a smallish Tom Mitchell: number of researchers were even Tom Mitchell: still working on neural Tom Mitchell: networks.

Tom Mitchell: But, nevertheless, this happened. Tom Mitchell: So I asked Geoff about that. Geoffrey Hinton: And Yann realized when Fei-Fei came up with the ImageNet Geoffrey Hinton: dataset, Yann realized they could win that competition, and Geoffrey Hinton: he tried to get graduate students and postdocs in his lab Geoffrey Hinton: to do it, and they all declined. Geoffrey Hinton: And Ilya, Ilya Sutskever realized that, backprop Geoffrey Hinton: would just kill ImageNet.

Geoffrey Hinton: He wanted Alex to work on it and actually didn't really Geoffrey Hinton: want to work on it. Geoffrey Hinton: Alex had already been Geoffrey Hinton: working on small images and Geoffrey Hinton: recognizing small images in Cfar Geoffrey Hinton: ten, and pre-processed Geoffrey Hinton: everything for Alex to make it Geoffrey Hinton: easy. Geoffrey Hinton: And I bought Alex two Nvidia Geoffrey Hinton: GPUs to have in his bedroom at Geoffrey Hinton: home.

Geoffrey Hinton: Alex then got on with got on with it, and he was an Geoffrey Hinton: absolutely wizard programmer. Geoffrey Hinton: He wrote amazing code on Geoffrey Hinton: multiple GPUs to do convolution Geoffrey Hinton: really efficiently. Geoffrey Hinton: Much better code than anybody else had ever written. Geoffrey Hinton: I believe and so it's a combination of Ilya realizing we Geoffrey Hinton: really had to do this.

Geoffrey Hinton: I know you was involved in the design of the net and so on, but Geoffrey Hinton: Alex's programming skills. Geoffrey Hinton: And then I added a few ideas, like use rectified linear units Geoffrey Hinton: instead of sigmoid units and use little patches of the images. Geoffrey Hinton: I mean, big patches of the images.

Geoffrey Hinton: So you can translate things Geoffrey Hinton: around a bit to get some Geoffrey Hinton: translation invariance, as well Geoffrey Hinton: as using convolution, and Geoffrey Hinton: use dropout. Geoffrey Hinton: So that was one of the first applications of dropout. Geoffrey Hinton: And that helped about one percent. Geoffrey Hinton: It helped. Geoffrey Hinton: And then we beat the best vision systems.

Geoffrey Hinton: The best vision systems were sort of plateauing at twenty Geoffrey Hinton: five percent errors. Geoffrey Hinton: That's errors for getting the right answer in the top in your Geoffrey Hinton: top five bets. Geoffrey Hinton: And we got like fifteen percent, fifteen or sixteen, Geoffrey Hinton: depending on how you count it. Geoffrey Hinton: So we got almost half the error rate.

Geoffrey Hinton: And what happened then was what Geoffrey Hinton: ought to happen in science but Geoffrey Hinton: seldom does. Geoffrey Hinton: So our most vigorous opponents, like Jitendra Malik and Geoffrey Hinton: Zisserman, Andrew Zisserman, looked at these results and Geoffrey Hinton: said, okay, you were right. Geoffrey Hinton: That never happens in science. Geoffrey Hinton: And slightly irritating. Andrew Zisserman then switched Geoffrey Hinton: to doing this.

Geoffrey Hinton: He had some very good postdocs or students working with him. Geoffrey Hinton: Simonyan, after about Geoffrey Hinton: a year, they were making better Geoffrey Hinton: networks than us, but that was Geoffrey Hinton: really the. Geoffrey Hinton: As far as the general public was concerned. Geoffrey Hinton: That was the start of this big Geoffrey Hinton: swing towards deep learning in Geoffrey Hinton: twenty twelve.

Tom Mitchell: So that event, that competition Tom Mitchell: and the fact that the neural Tom Mitchell: network approach, totally Tom Mitchell: dominated all the other Tom Mitchell: approaches really was a wake up Tom Mitchell: call to both the computer vision Tom Mitchell: community, which within a couple Tom Mitchell: of years everybody was using Tom Mitchell: neural networks.

Tom Mitchell: But it was also a wake up call to the machine learning Tom Mitchell: community, who had kind of scoffed at neural networks for Tom Mitchell: several decades, that neural networks were back. Tom Mitchell: And so people started again, now Tom Mitchell: experimenting with this new Tom Mitchell: generation of deep neural Tom Mitchell: networks.

Tom Mitchell: That just meant that instead of having two layers, they could Tom Mitchell: have many layers, dozens of layers, because training Tom Mitchell: algorithms were available and so was is computation. Tom Mitchell: People start experimenting with these and primarily on Tom Mitchell: perceptual style problems.

Tom Mitchell: In fact, by twenty sixteen, Tom Mitchell: neural nets had taken over not Tom Mitchell: only computer vision, but in Tom Mitchell: twenty sixteen, some scientists Tom Mitchell: from Microsoft showed that they Tom Mitchell: had been able to train a neural Tom Mitchell: network to finally reach human Tom Mitchell: level recognition. Tom Mitchell: Speech recognition performance for individual words in a widely Tom Mitchell: used data set called the switchboard data set.

Tom Mitchell: So people were experimenting with neural nets for visual Tom Mitchell: data, speech data, radar, lidar, all kinds of sensory data. Tom Mitchell: People started also asking, Tom Mitchell: well, can we apply these to text Tom Mitchell: data? Tom Mitchell: And the answer was yes.

Tom Mitchell: And people started inventing various architectures, things Tom Mitchell: with names like long short term memory and others to analyze Tom Mitchell: sequences of text and applying them to problems like machine Tom Mitchell: translation, translating English into French, and so forth. Tom Mitchell: And, uh, that kind of worked. Tom Mitchell: And then in twenty seventeen, Tom Mitchell: a very important paper was Tom Mitchell: published.

Tom Mitchell: The name of the paper was Attention is All You Need. Tom Mitchell: And with that was referring to was a subcircuit in a Tom Mitchell: neural network called an attention mechanism that had Tom Mitchell: recently been invented and developed and was trainable. Tom Mitchell: But that attention mechanism Tom Mitchell: was used in this paper, and it Tom Mitchell: advanced the state of the art in Tom Mitchell: machine translation.

Tom Mitchell: But even more importantly for us today, it introduced the Tom Mitchell: transformer architecture based on this attention mechanism. Tom Mitchell: And it's that transformer Tom Mitchell: architecture that underlies GPT Tom Mitchell: and pretty much all of the large Tom Mitchell: language models that were Tom Mitchell: released around twenty twenty Tom Mitchell: two. Tom Mitchell: So that was a major event.

Tom Mitchell: Now, around the same time, Yann Tom Mitchell: LeCun, remember the guy who was Tom Mitchell: a postdoc with Jeff in nineteen Tom Mitchell: eighty seven? Tom Mitchell: Yann had become the head of AI research at Facebook. Tom Mitchell: And so he was in a very interesting position because he Tom Mitchell: was both an academic.

Tom Mitchell: He retained his NYU professorship and at the same Tom Mitchell: time he had a foot in the commercial world directing the Tom Mitchell: AI strategy for Facebook. Tom Mitchell: So ask John about this period Tom Mitchell: and what it looked like to him Tom Mitchell: from from being inside both Tom Mitchell: worlds.

Tom Mitchell: His first part of his answer was Tom Mitchell: that he said for him, a key Tom Mitchell: development was realizing that Tom Mitchell: you didn't have to wait for Tom Mitchell: people to label all your Tom Mitchell: training data, that you could do Tom Mitchell: something called self-supervised Tom Mitchell: learning.

Tom Mitchell: For example, just take data like a string of words and remove a Tom Mitchell: word and have the program force the program to predict what that Tom Mitchell: removed word was. Tom Mitchell: So there's no human labeling you have to do for that. Tom Mitchell: You can use the whole web and Tom Mitchell: you get a lot of training Tom Mitchell: examples. Tom Mitchell: So that's self-supervised learning was a key development. Tom Mitchell: But then here's this description of what next.

Yann LeCun: So the idea that self-supervised learning could really kind of Yann LeCun: bring something to the table there, I think was kind of a Yann LeCun: big sort of mind, change of mindset. Yann LeCun: And then there was Transformers, of course.

Yann LeCun: Right. Yann LeCun: Um, that, so, so before that, there was some Yann LeCun: demonstration that, you know, you could basically match Yann LeCun: the performance of classical systems for tasks like Yann LeCun: translation, language translation using large neural Yann LeCun: nets like LSTM. Yann LeCun: So this was the work by Ilya Sutskever when he was at Google.

Yann LeCun: We had this big sequence to sequence model with LSTMs and Yann LeCun: some gigantic model where you can train it to do. Yann LeCun: Translation. Yann LeCun: And it kind of works at the same Yann LeCun: level, if not better in some Yann LeCun: cases than the then classical, Yann LeCun: classical, the transition Yann LeCun: methods.

Yann LeCun: Then a few months later, Yann LeCun: Yoshua Bengio and Kyunghyun Cho, Yann LeCun: who is now a colleague at NYU, Yann LeCun: uh, showed that you could change Yann LeCun: the architecture and use this Yann LeCun: attention mechanism. Yann LeCun: That, that they proposed, to basically get really good Yann LeCun: performance on translation with much smaller models than what Yann LeCun: Ilya had been proposing.

Yann LeCun: And the entire industry jumped Yann LeCun: on this, Chris Manning's Yann LeCun: group at Stanford, kind of, you Yann LeCun: know, used that architecture and Yann LeCun: basically beat, you know, Yann LeCun: won the WMT competition for a Yann LeCun: particular, uh, type of Yann LeCun: translation. Yann LeCun: And the entire industry jumped on it.

Yann LeCun: So within a few months after that, like, you know, all the Yann LeCun: big players, uh, in translation, were using attention type Yann LeCun: architectures for translation. Yann LeCun: And that's when, the transformer paper came out. Yann LeCun: Attention is all you need. Yann LeCun: So basically, if you build a neural net just with those kind Yann LeCun: of attention circuit, you don't need much else. Yann LeCun: And it ends up working super well.

Yann LeCun: And that's what started the, you Yann LeCun: know, the transformer Yann LeCun: revolution. Yann LeCun: Uh, and then after that came Bert, that also came out of Yann LeCun: Google, which was this idea of using self-supervised learning, Yann LeCun: where I take a sequence of words, corrupt it, remove some Yann LeCun: other words, and then train this big neural net to reconstruct Yann LeCun: the words that are missing. Yann LeCun: Predict the words that are missing.

Yann LeCun: And again, people were Yann LeCun: amazed by like how how good the Yann LeCun: representations learned by the Yann LeCun: system were for all kinds of NLP Yann LeCun: tasks. Yann LeCun: And that really, uh, you know, kind of captured the imagination Yann LeCun: of a lot of people. Yann LeCun: And then after that, the next revolution was, oh, Yann LeCun: actually, the best thing to do is you remove the encoder, you Yann LeCun: just use a decoder.

Yann LeCun: And you just train a system, you feed it a sequence, and you Yann LeCun: just train it to reproduce the input sequence on its output, Yann LeCun: and because the architecture of the decoder is strictly causal. Yann LeCun: Because a particular output is not connected to the Yann LeCun: corresponding input, it's only connected to the ones to the Yann LeCun: left of it.

Yann LeCun: Implicitly, you're training the Yann LeCun: system to predict the next word Yann LeCun: that comes after a sequence of Yann LeCun: words. Yann LeCun: That's the GPT architecture that Yann LeCun: was, you know, promoted by Yann LeCun: OpenAI. Yann LeCun: And, that turned out to be more scalable than Bert.

Yann LeCun: And so in a sense that you can Yann LeCun: train gigantic networks on Yann LeCun: enormous amounts of data and you Yann LeCun: get some sort of emergent, Yann LeCun: property. Yann LeCun: And that's what gave us llms. Tom Mitchell: So that brings us up to today with Transformers. Tom Mitchell: And you can see this very strange evolution in wandering Tom Mitchell: path of, uh, progress exploration over decades.

Tom Mitchell: So before we leave, I Tom Mitchell: want to let's just take a look Tom Mitchell: at that history And say, what if Tom Mitchell: this is a case study of how Tom Mitchell: scientific progress was made in Tom Mitchell: this field? Tom Mitchell: What are the main themes we see? Tom Mitchell: Well, I think the first one is progress happens in waves. Tom Mitchell: It's paradigm after paradigm, right?

Tom Mitchell: First there were perceptrons, Tom Mitchell: but that got, uh, thrown away Tom Mitchell: and replaced by symbolic Tom Mitchell: representations being learned, Tom Mitchell: eventually to be replaced by Tom Mitchell: neural nets, which were replaced Tom Mitchell: by probabilistic methods and so Tom Mitchell: forth. Tom Mitchell: So there's wave after wave of paradigm. Tom Mitchell: Another theme is that a lot of Tom Mitchell: these ideas really came from Tom Mitchell: other fields.

Tom Mitchell: Even the very notion of Tom Mitchell: perceptrons came from somebody Tom Mitchell: who was fundamentally a Tom Mitchell: neuroscientist interested in how Tom Mitchell: neurons in the brain could even Tom Mitchell: learn stuff. Tom Mitchell: Pack learning. Tom Mitchell: You heard less valiant talk. Tom Mitchell: He's very much a Tom Mitchell: computational complexity Tom Mitchell: researcher who found that this Tom Mitchell: was an interesting theoretical Tom Mitchell: result.

Tom Mitchell: Bayesian networks heavily Tom Mitchell: influenced by statistics and so Tom Mitchell: forth. Tom Mitchell: Many of these advances really Tom Mitchell: were new framings of the Tom Mitchell: problem. Tom Mitchell: So, uh, Winston's work on Tom Mitchell: symbolic learning was really a Tom Mitchell: reframing of what the problem Tom Mitchell: was.

Tom Mitchell: The work on reinforcement Tom Mitchell: learning is really changing the Tom Mitchell: definition of what the training Tom Mitchell: signal even is for these Tom Mitchell: systems. Tom Mitchell: So that's another theme that you see. Tom Mitchell: And finally, I think like a lot Tom Mitchell: of scientific fields, machine Tom Mitchell: learning is really a blend of Tom Mitchell: technical forces and social Tom Mitchell: forces.

Tom Mitchell: Certainly in the long term, Tom Mitchell: the cold, hard facts of what Tom Mitchell: works best come out and those Tom Mitchell: methods win. Tom Mitchell: But in the shorter term, the Tom Mitchell: question of who works on what Tom Mitchell: kinds of problems is very much Tom Mitchell: influenced by the personalities Tom Mitchell: of people.

Tom Mitchell: Their ability to persuade other Tom Mitchell: people to jump in and start Tom Mitchell: working with them on their Tom Mitchell: problems. Tom Mitchell: So these are some of the themes you see. Tom Mitchell: And I think if you look around at other fields, sometimes you Tom Mitchell: see similar themes. Tom Mitchell: Finally, what are the lessons from all this for researchers? Tom Mitchell: I think the first lesson really is question authority.

Tom Mitchell: Because really, if you think Tom Mitchell: about the major advances, many Tom Mitchell: of those came from just, uh, Tom Mitchell: going against what was currently Tom Mitchell: the conventional wisdom in the Tom Mitchell: field. Tom Mitchell: Inventing a new framing or Tom Mitchell: taking a radically different Tom Mitchell: approach. Tom Mitchell: Another lesson don't drag your feet.

Tom Mitchell: I've seen decade after decade, new paradigms emerge in the Tom Mitchell: field, and every single time that happens, existing Tom Mitchell: researchers take longer than they need to to recognize the Tom Mitchell: benefits of the new paradigm. Tom Mitchell: And the most guilty people are the senior researchers.

Tom Mitchell: You can probably explain that by Tom Mitchell: taking into account who has the Tom Mitchell: most to lose if there's a new Tom Mitchell: paradigm replacing the current Tom Mitchell: approach. Tom Mitchell: Another lesson learn to Tom Mitchell: communicate and learn to follow Tom Mitchell: through. Tom Mitchell: You heard Geoff Hinton when he Tom Mitchell: was talking about in the mid Tom Mitchell: eighties, the development of Tom Mitchell: back propagation.

Tom Mitchell: You heard him say we didn't invent backpropagation, but we Tom Mitchell: showed that it was important. Tom Mitchell: And actually, to be fair, they Tom Mitchell: thought they were inventing Tom Mitchell: backpropagation.

Tom Mitchell: They they actually reinvented Tom Mitchell: it, but they had no idea that Tom Mitchell: somebody had invented it before, Tom Mitchell: because whoever did that didn't Tom Mitchell: succeed in waking up the Tom Mitchell: research community to the fact Tom Mitchell: that they had a really good Tom Mitchell: idea. Tom Mitchell: I don't know why. Tom Mitchell: Maybe they didn't put in the Tom Mitchell: effort or succeed in Tom Mitchell: communicating.

Tom Mitchell: Maybe they dropped it after they Tom Mitchell: did it and went some other Tom Mitchell: direction so that they didn't Tom Mitchell: follow through to provide the Tom Mitchell: evidence. Tom Mitchell: But that kind of thing happens Tom Mitchell: frequently in successful Tom Mitchell: researchers are good Tom Mitchell: communicators, and they follow Tom Mitchell: through to to push the field to Tom Mitchell: pay attention.

Tom Mitchell: The final lesson, I think, is Tom Mitchell: the philosophers were actually Tom Mitchell: right. Tom Mitchell: We really today, despite these amazing capabilities of our Tom Mitchell: learning systems, we don't have a proof or anything like a Tom Mitchell: rational justification of why you can generalize from examples Tom Mitchell: to get these general rules that work well despite the success Tom Mitchell: that we have.

Tom Mitchell: We don't really understand at this very fundamental level why. Tom Mitchell: And I think that if we did pay more attention to that question, Tom Mitchell: we might have a better chance to develop algorithms that Tom Mitchell: outperform what we have today. Tom Mitchell: So I'll stop there. Tom Mitchell: Thank you very much. Speaker 12: Tom Mitchell is the Founders Speaker 12: University professor at Carnegie Speaker 12: Mellon University.

Speaker 12: Machine learning How Did We get here? Speaker 12: Is produced by the Stanford Digital Economy Lab. Speaker 12: If you enjoyed this episode, Speaker 12: subscribe wherever you listen to Speaker 12: podcasts.

Transcript source: Provided by creator in RSS feed: download file
For the best experience, listen in Metacast app for iOS or Android