Tom Mitchell: Welcome to machine learning. Tom Mitchell: How did we get here? Tom Mitchell: I'm Tom Mitchell, your podcast host. Tom Mitchell: Now many people ask, how did we get to this point where today we Tom Mitchell: have these amazing AI systems? Tom Mitchell: I have a one sentence answer to that question.
Tom Mitchell: We tried for fifty years to Tom Mitchell: write by hand intelligent Tom Mitchell: programs, but we discovered Tom Mitchell: about a decade ago that it was Tom Mitchell: actually much easier and much Tom Mitchell: more successful to use machine Tom Mitchell: learning methods to instead Tom Mitchell: train them to become Tom Mitchell: intelligent. Tom Mitchell: So the real question is, how did machine learning get here?
Tom Mitchell: What were the successes along the way and the failures? Tom Mitchell: Who were the people involved? Tom Mitchell: What were they thinking? Tom Mitchell: What even made them want to get Tom Mitchell: into this field in the first Tom Mitchell: place? Tom Mitchell: This first episode will set the stage for the podcast.
Tom Mitchell: It is a recording of a lecture I gave this month in February Tom Mitchell: twenty twenty six at Carnegie Mellon University, and it Tom Mitchell: attempts to cover in one hour a seventy five year history of the Tom Mitchell: field of machine learning.
Tom Mitchell: Most of the rest of the episodes Tom Mitchell: in the podcast involve Tom Mitchell: interviews with various pioneers Tom Mitchell: in the field, who made very Tom Mitchell: significant contributions along Tom Mitchell: the way. Tom Mitchell: Before we start, I want to thank Carnegie Mellon University and Tom Mitchell: also the Stanford University Digital Economy Lab for Tom Mitchell: supporting the podcast. Tom Mitchell: And I want to thank Maddie Smith, our podcast producer.
Tom Mitchell: I hope you enjoy the podcast. Tom Mitchell: If we're going to talk about Tom Mitchell: machine learning, it's only fair Tom Mitchell: to start with the first people Tom Mitchell: who talked about how on earth is Tom Mitchell: learning possible? Tom Mitchell: Which were the philosophers? Tom Mitchell: And so as early as Aristotle, he was talking about the question Tom Mitchell: of how is it that people could look at examples of things and Tom Mitchell: learn their general essence?
Tom Mitchell: In his words, about a century later, there was a school of Tom Mitchell: philosophers called the Pyrrhonists, who really zeroed Tom Mitchell: in on the problem of induction and how it can be justified. Tom Mitchell: When we say induction, what we Tom Mitchell: really mean is the process of Tom Mitchell: coming up with a general rule Tom Mitchell: from looking at specific Tom Mitchell: examples.
Tom Mitchell: And so they talked about Tom Mitchell: questions like, well, if all of Tom Mitchell: the swans we've seen so far in Tom Mitchell: our life are white, should we Tom Mitchell: conclude that all swans are Tom Mitchell: white? Tom Mitchell: What would be the justification for that? Tom Mitchell: Maybe there's a black swan out there that we haven't seen. Tom Mitchell: And, uh, that debate went on for Tom Mitchell: some time around thirteen Tom Mitchell: hundred.
Tom Mitchell: William of Ockham, uh, suggested Tom Mitchell: something that we now call Tom Mitchell: Occam's razor, the policy that Tom Mitchell: we should prefer the simplest Tom Mitchell: hypothesis. Tom Mitchell: So, indeed, if all the swans we've seen so far are white, Tom Mitchell: then the simplest hypothesis is all swans are white. Tom Mitchell: That was his prescription.
Tom Mitchell: Later on, around sixteen hundred, Francis Bacon brought Tom Mitchell: up the importance of data collection, of actively Tom Mitchell: experimenting, to collect data that could falsify hypotheses Tom Mitchell: that weren't correct. Tom Mitchell: And then in the seventeen hundreds, the philosopher David Tom Mitchell: Hume really kind of nailed the problem of induction.
Tom Mitchell: He argued very persuasively that it's really impossible to Tom Mitchell: generalize from examples if you don't have some additional Tom Mitchell: assumption that you're making. Tom Mitchell: And he pointed out that even the assumption that the future will Tom Mitchell: be like the past is itself not a provable assumption is just a Tom Mitchell: guess that we use. Tom Mitchell: So his point was that people do induction, but it's a habit.
Tom Mitchell: It's not a justified, rational, provable, correct process. Tom Mitchell: So they had plenty to say around the nineteen forties when Tom Mitchell: computers became available. Tom Mitchell: Alan Turing, who's often called Tom Mitchell: the father of computing, uh, Tom Mitchell: suggested that maybe computers Tom Mitchell: could learn.
Tom Mitchell: He said instead of trying to produce a program to simulate Tom Mitchell: the adult mind, why not rather try to produce one which Tom Mitchell: simulates a child's? Tom Mitchell: If this were then subjected to Tom Mitchell: an appropriate course of Tom Mitchell: education, one would obtain the Tom Mitchell: adult brain. Tom Mitchell: So he had the idea that maybe computers could learn.
Tom Mitchell: But he did not have an algorithm by which they would learn that Tom Mitchell: waited until the nineteen fifties, when there were two Tom Mitchell: important seminal events. Tom Mitchell: One was a computer program Tom Mitchell: written by an IBM researcher Tom Mitchell: named Art Samuel, and his Tom Mitchell: program learned to play Tom Mitchell: checkers. Tom Mitchell: I'll just read you a couple Tom Mitchell: sentences from the abstract of Tom Mitchell: this paper.
Tom Mitchell: He said two machine learning procedures have been Tom Mitchell: investigated in some detail using the game of checkers. Tom Mitchell: enough work has been done to Tom Mitchell: verify the fact that a computer Tom Mitchell: can be programmed so that it Tom Mitchell: will learn to play a better game Tom Mitchell: of checkers than can be played Tom Mitchell: by the person who wrote the Tom Mitchell: program.
Tom Mitchell: And then he went on to point out Tom Mitchell: the principles of machine Tom Mitchell: learning verified by these Tom Mitchell: experiments are, of course, Tom Mitchell: applicable to many other Tom Mitchell: situations. Tom Mitchell: So he had really one of maybe Tom Mitchell: the first demonstration of a Tom Mitchell: program that learned to do Tom Mitchell: something interesting.
Tom Mitchell: And he understood that the Tom Mitchell: techniques he was using were Tom Mitchell: very general. Tom Mitchell: Now, how did he get the computer to learn to play checkers? Tom Mitchell: His program learned an Tom Mitchell: evaluation function that would Tom Mitchell: assign a numerical score to any Tom Mitchell: checkers position, and that Tom Mitchell: score would be higher, the Tom Mitchell: better the checkers position Tom Mitchell: was.
Tom Mitchell: From your point of view as you're playing the game, and Tom Mitchell: then you would use that to control a search. Tom Mitchell: A look ahead search for which move to proceed to take that Tom Mitchell: evaluation function was a linear weighted combination of board Tom Mitchell: features that he made up. Tom Mitchell: Things like how many checkers are on the board that are mine, Tom Mitchell: how many are on the board that are yours, and so forth.
Tom Mitchell: So his program learned. Tom Mitchell: What it learned was that evaluation function. Tom Mitchell: How did it learn it? Tom Mitchell: By playing games against itself. Tom Mitchell: And he points out that in eight to ten hours, it could learn Tom Mitchell: well enough to beat him. Tom Mitchell: Those ideas persisted through the decades.
Tom Mitchell: They became reused over and over, including in the computer Tom Mitchell: programs that finally beat the World Chess Champion and the Tom Mitchell: World Backgammon Champion and the World Go champion. Tom Mitchell: So those ideas were really seminal.
Tom Mitchell: A second thing that happened in Tom Mitchell: the fifties was the invention of Tom Mitchell: the first early version of Tom Mitchell: neural networks by Frank Tom Mitchell: Rosenblum, wrote, I'm sorry, Tom Mitchell: Frank Rosenblatt from Cornell, Tom Mitchell: and he was interested in Tom Mitchell: neuroscience. Tom Mitchell: How can the brain neurons in the brain be used to learn?
Tom Mitchell: And he ended up building a simple, uh, at least by today's Tom Mitchell: standards, simple neural network that consisted of, uh, one layer Tom Mitchell: of neurons where, uh, there would be a receptive field, uh, Tom Mitchell: input, say an image, and then the neurons would respond to Tom Mitchell: that and produce an output set of neuron firings.
Tom Mitchell: What got learned in that case Tom Mitchell: were the connection strengths Tom Mitchell: between the input to the neuron Tom Mitchell: and the probability that it Tom Mitchell: would fire. Tom Mitchell: And the way he trained it was Tom Mitchell: what we now call supervised Tom Mitchell: learning. Tom Mitchell: You show an input and and what the output should be. Tom Mitchell: And he had schemes for updating those weights to fit the data.
Tom Mitchell: Now that the importance of this Tom Mitchell: work is that it catalyzed a Tom Mitchell: whole bunch of work in the Tom Mitchell: nineteen sixties, for the next Tom Mitchell: decade, looking at different Tom Mitchell: algorithms for tuning the Tom Mitchell: weights of perceptron style Tom Mitchell: systems.
Tom Mitchell: That work proceeded for a Tom Mitchell: decade or so, and at the end of Tom Mitchell: the nineteen sixties, two MIT Tom Mitchell: scientists, Marvin Minsky and Tom Mitchell: Seymour Papert, wrote a book Tom Mitchell: called perceptrons.
Tom Mitchell: But unfortunately, that book Tom Mitchell: proved that a single layer Tom Mitchell: perceptron, which is the only Tom Mitchell: thing we knew how to train at Tom Mitchell: that point, uh, could never even Tom Mitchell: represent any many, many Tom Mitchell: functions that we wanted to Tom Mitchell: learn. Tom Mitchell: It could only represent linear functions, not even, uh, Tom Mitchell: exclusive or, you know, where the input could be. Tom Mitchell: The output would be one.
Tom Mitchell: If input one is a one and the other is a zero, or if it's a Tom Mitchell: zero and a one. Tom Mitchell: But the output would have to be zero if they were both one. Tom Mitchell: You can't even represent that Tom Mitchell: simple function with a Tom Mitchell: perceptron no matter how you Tom Mitchell: train it. Tom Mitchell: So this really kind of put the Tom Mitchell: kibosh on work on perceptrons, Tom Mitchell: uh, following the publication of Tom Mitchell: this book.
Tom Mitchell: Now, if we're not going to be able or don't want to spend our Tom Mitchell: time figuring out how to learn perceptrons, Then what's next? Tom Mitchell: Well, it turned out one of Tom Mitchell: Minsky's PhD students, Patrick Tom Mitchell: Winston.
Tom Mitchell: The next year published his Tom Mitchell: thesis, and Winston suggested Tom Mitchell: that instead of learning Tom Mitchell: perceptron type representations Tom Mitchell: of information, we should learn Tom Mitchell: symbolic descriptions. Tom Mitchell: And so his program, uh, in his thesis, he showed how his Tom Mitchell: program could learn descriptions of different physical structures Tom Mitchell: like an arch or a tower.
Tom Mitchell: And he would train the program by showing it line drawings of Tom Mitchell: positive and negative examples of, uh, in this example arches. Tom Mitchell: And then the program would process those incrementally Tom Mitchell: arriving examples to produce a symbolic description that would Tom Mitchell: describe the different parts and relations among them.
Tom Mitchell: For example, an arch could be two rectangles which don't touch Tom Mitchell: each other, but which jointly support a roof of any shape. Tom Mitchell: So this was an important step Tom Mitchell: because it shifted the focus Tom Mitchell: onto learning a much richer kind Tom Mitchell: of representation, symbolic Tom Mitchell: descriptions. Tom Mitchell: And this became the new paradigm Tom Mitchell: which dominated the nineteen Tom Mitchell: seventies.
Tom Mitchell: So during the seventies, there Tom Mitchell: were a number of people working Tom Mitchell: on learning symbolic Tom Mitchell: descriptions. Tom Mitchell: My favorite is the metaphor program, developed by Bruce Tom Mitchell: Buchanan at Stanford. Tom Mitchell: This program, again, was a symbolic learning program.
Tom Mitchell: What it learned was rules that would predict how molecules Tom Mitchell: would shatter inside a mass spectrometer, and therefore Tom Mitchell: predict what the mass spectrum of a new molecule would be. Tom Mitchell: And those rules again described, Tom Mitchell: Symbolically described a Tom Mitchell: subgraph of atoms within the Tom Mitchell: molecular graph.
Tom Mitchell: And the rules would say, if you find this subgraph, then Tom Mitchell: specific bonds in that subgraph are likely to fragment when you Tom Mitchell: put this in a mass spectrometer. Tom Mitchell: And this was an important step forward. Tom Mitchell: I asked Bruce Buchanan, how will it work? Tom Mitchell: What was this program able to do in terms of did it work. Bruce Buchanan: Well for one small class of steroid molecules, the keto and Bruce Buchanan: estranes, if you will?
Bruce Buchanan: Uh, we had, uh, fewer than a Bruce Buchanan: dozen spectra, and we were able Bruce Buchanan: to tease out the rules that Bruce Buchanan: determine, uh, How a new keto Bruce Buchanan: androstane would fragment in a Bruce Buchanan: mass spectrometer, and we were Bruce Buchanan: able to publish that set of Bruce Buchanan: rules in a refereed chemical Bruce Buchanan: chemical journal, Chemistry Bruce Buchanan: Journal.
Bruce Buchanan: Sorry. Bruce Buchanan: Uh, and it was, to our Bruce Buchanan: knowledge, the first time that Bruce Buchanan: the result of a machine learning Bruce Buchanan: program, Symbolic Learning, had Bruce Buchanan: been published, uh, in a Bruce Buchanan: refereed journal. Tom Mitchell: So that was an important milestone for machine learning, Tom Mitchell: really, the first time that a program discovered some Tom Mitchell: knowledge that was useful enough to get published in that domain.
Tom Mitchell: Now it turned out personal note Tom Mitchell: I was a PhD student at Stanford Tom Mitchell: at the time, and Bruce became my Tom Mitchell: PhD advisor, so my PhD thesis Tom Mitchell: was also built around, this same Tom Mitchell: data set.
Tom Mitchell: And for my thesis I developed a system called Version Spaces Tom Mitchell: that was the first symbolic learning algorithm where you Tom Mitchell: could prove that it would converge, and furthermore, that Tom Mitchell: the learner would know when it had converged, so it would know Tom Mitchell: it was done.
Tom Mitchell: And it did that by maintaining Tom Mitchell: not just one hypothesis that it Tom Mitchell: would modify, but by keeping Tom Mitchell: track of every hypothesis Tom Mitchell: consistent with the data that it Tom Mitchell: had seen. Tom Mitchell: And this also opened up the possibility of what we call Tom Mitchell: today active learning. Tom Mitchell: It made it easy for the system Tom Mitchell: to play twenty questions with Tom Mitchell: the teacher.
Tom Mitchell: Uh, it could ask the teacher, please label this example so Tom Mitchell: that in a way, uh, it could reduce the set of hypothesis as Tom Mitchell: quickly as possible. Tom Mitchell: So by the end of the seventies, there seemed to be enough work Tom Mitchell: going on in the field that it was time to hold a meeting.
Tom Mitchell: And so we organized the first Tom Mitchell: workshop in machine learning was Tom Mitchell: held here at CMU at Wayne Hall, Tom Mitchell: a couple of buildings that Tom Mitchell: direction, and it was organized Tom Mitchell: by Jaime Carbonell, who was an Tom Mitchell: assistant professor here at the Tom Mitchell: time.
Tom Mitchell: Richard Michalski, who is a more Tom Mitchell: senior professor at Illinois and Tom Mitchell: myself, I was at the time an Tom Mitchell: assistant professor at Rutgers Tom Mitchell: University. Tom Mitchell: And so we held this meeting, pulled together some people. Tom Mitchell: One of the people who attended was a student of Richard Tom Mitchell: Michalski named Tom Dietterich.
Tom Mitchell: And Tom went on to make many Tom Mitchell: contributions in the field of Tom Mitchell: machine learning. Tom Mitchell: And so I asked Tom, what was the field like in nineteen eighty? Tom Dietterich: I'd say it was really chaotic. Tom Dietterich: you know, I was, Tom Dietterich: attended that very first machine Tom Dietterich: learning workshop that was Tom Dietterich: organized.
Tom Dietterich: I think you were one of the core Tom Dietterich: organizers at CMU, and there Tom Dietterich: were probably thirty people in Tom Dietterich: the room and, uh, and probably Tom Dietterich: thirty completely different Tom Dietterich: talks. Tom Dietterich: You know, I remember, I was talking Tom Dietterich: about I had done, a sort of algorithm comparison paper Tom Dietterich: that I published at Ijcai seventy nine, I think.
Tom Dietterich: So just before that workshop, in which I was, by Tom Dietterich: hand executing these very simple algorithms for this kind of Tom Dietterich: subgraph learning problem, uh, and comparing how many subgraph Tom Dietterich: isomorphism calculations they had to do.
Tom Dietterich: But it was like the first Tom Dietterich: attempt to actually compare Tom Dietterich: multiple machine learning Tom Dietterich: algorithms that were more or Tom Dietterich: less trying to do the same Tom Dietterich: thing. Tom Dietterich: There were a couple of them there, and, you Tom Dietterich: know, I think John Anderson was there talking about, you Tom Dietterich: know, cognitive models.
Tom Dietterich: You were there talking about Tom Dietterich: the beginnings of EBL and the Tom Dietterich: Lex system for, for, Tom Dietterich: calculus, symbolic Tom Dietterich: integration. Tom Dietterich: You know, I remember the most interesting talk I Tom Dietterich: thought was Ross Quinlan's talk on, on ID3, where he was Tom Dietterich: trying to take these reverse numerated chess endgames Tom Dietterich: and learn decision trees.
Tom Dietterich: That would completely, Tom Dietterich: exactly losslessly, Tom Dietterich: basically compress those Tom Dietterich: giant tables into a small Tom Dietterich: decision tree. Tom Dietterich: A really important thing people should understand in those days Tom Dietterich: was we believed there was a right answer for our Tom Dietterich: machine learning problems.
Tom Dietterich: And we would, Tom Dietterich: it would often happen that I Tom Dietterich: would run like the algorithms Tom Dietterich: and it would not get the right Tom Dietterich: answer. Tom Dietterich: It would not get the, the logical expression that we Tom Dietterich: thought was the right answer. Tom Dietterich: It would get something that was really, actually equally Tom Dietterich: accurate on the training data.
Tom Dietterich: And actually it worked Tom Dietterich: pretty well although we Tom Dietterich: didn't really have a set idea of Tom Dietterich: a separate test set in those Tom Dietterich: days. Tom Dietterich: I mean, it was not a field of statistics. Tom Dietterich: It was, the idea was right.
Tom Dietterich: We were coming out of the, really the John McCarthy program Tom Dietterich: of programs with common sense, which didn't have a lot to do Tom Dietterich: with common sense, but was about we're going to represent Tom Dietterich: everything in logic, and we're going to use logical inference Tom Dietterich: as the execution engine. Tom Mitchell: So there's Tom's take on what things were like.
Tom Mitchell: He mentioned that he thought the most interesting talk was Tom Mitchell: Ross Quinlan's talk. Tom Mitchell: I agree, I thought that was the most interesting talk. Tom Mitchell: Ross's talk presented the idea Tom Mitchell: that we should learn decision Tom Mitchell: trees.
Tom Mitchell: A decision tree is something where you classify your example Tom Mitchell: by putting it at the root of the tree, and then you sort it down Tom Mitchell: to a leaf in the tree based on its features, and the leaf tells Tom Mitchell: you what the output classification label should be. Tom Mitchell: That's what get learned. Tom Mitchell: What gets learned? Tom Mitchell: So I asked Ross how he came up with this idea. JR Quinlan: I had done a PhD under a psychologist, Earl hunt.
JR Quinlan: And part of his work involved decision trees, which I learned JR Quinlan: about, of course, as a student, but then put in the back of my JR Quinlan: mind for fifteen years or so. JR Quinlan: And then I was at at Stanford on JR Quinlan: sabbatical at the same time as JR Quinlan: Donald.
JR Quinlan: Mickey was teaching a course on learning, and he had a challenge JR Quinlan: for the class on which, you know, I sat in on the class and JR Quinlan: the challenge was to work out a way of predicting a win in JR Quinlan: a very simple chess end game. JR Quinlan: King rook versus king knight. JR Quinlan: So I remembered Earl Hunt's work on decision trees, and I JR Quinlan: thought, well, maybe that would be the way to go.
JR Quinlan: So I developed a thing called ID3, which was just a simple JR Quinlan: decision tree program. JR Quinlan: No pruning, just straight decision tree. JR Quinlan: And then, uh, that that seemed JR Quinlan: to solve the problem pretty JR Quinlan: well, up to about ninety five JR Quinlan: percent. JR Quinlan: And then I got that up to one hundred the next year. JR Quinlan: And then remember, the first real time I talked about this JR Quinlan: was at that conference.
JR Quinlan: You organized the workshop in nineteen eighty at Pittsburgh, JR Quinlan: at Carnegie Mellon. JR Quinlan: You, Richard and Hymie all, all set up that workshop. JR Quinlan: And then I gave a talk on, uh, decision tree learning. Tom Mitchell: So there's Ross's story.
Tom Mitchell: He he got the idea of decision Tom Mitchell: trees from his thesis advisor Tom Mitchell: many years earlier, but it turns Tom Mitchell: out Ross was the one who came up Tom Mitchell: with the algorithm that actually Tom Mitchell: successfully discovered useful Tom Mitchell: decision trees. Tom Mitchell: And that whole idea of decision tree learning became very Tom Mitchell: important in the field.
Tom Mitchell: By twenty ten, it was probably Tom Mitchell: the one of the most commercially Tom Mitchell: used approaches in machine Tom Mitchell: learning. Tom Mitchell: So in the early eighties, there were various experiments like Tom Mitchell: these trying to build machine learning systems, but really no Tom Mitchell: theory, no theory that could tell us, for example, how many Tom Mitchell: examples would we have to present to a learner in order Tom Mitchell: for it to reliably learn?
Tom Mitchell: And that changed in nineteen Tom Mitchell: eighty four, when Les Valiant Tom Mitchell: published a paper on what he Tom Mitchell: calls probably approximately Tom Mitchell: correct learning. Tom Mitchell: And the idea is it really Tom Mitchell: was the first practical theory Tom Mitchell: to tell us how many examples you Tom Mitchell: would need.
Tom Mitchell: And it in particular, in Tom Mitchell: particular, the number of Tom Mitchell: examples you need depends on Tom Mitchell: three things. Tom Mitchell: The complexity of your hypothesis space. Tom Mitchell: For example, if you're going to learn decision trees of depth Tom Mitchell: two, that's a lot less complex than if you're learning decision Tom Mitchell: trees of depth twelve.
Tom Mitchell: So the it depends on how complex Tom Mitchell: your hypotheses are, depends on Tom Mitchell: the error rate you're willing to Tom Mitchell: tolerate in the final Tom Mitchell: hypothesis. Tom Mitchell: One percent error five percent error. Tom Mitchell: It also depends on the probability you're willing to Tom Mitchell: put up with that. Tom Mitchell: If you do choose that many Tom Mitchell: random randomly provided Tom Mitchell: training examples.
Tom Mitchell: The probability that you'll still fail. Tom Mitchell: You can't guarantee that you Tom Mitchell: won't fail, but you can reduce Tom Mitchell: that probability. Tom Mitchell: So this was a breakthrough in the area of theoretical Tom Mitchell: characterization of algorithms. Tom Mitchell: So I asked I asked les what he thought was the key idea there. Leslie Valiant: It's a it's a kind of a model of computation.
Leslie Valiant: But it yeah, it makes sense Leslie Valiant: because it's got some Leslie Valiant: applications.
Leslie Valiant: So that's the particular result which persuaded Leslie Valiant: people that there was something there is this result Leslie Valiant: that if you take a conjunctive normal form formula, Leslie Valiant: which, you know, from NP completeness at the time, we Leslie Valiant: already knew there's some hardness in it, because if Leslie Valiant: someone gave you the formula was computationally difficult to Leslie Valiant: find out whether it's a null, it's the equivalent of formula
Leslie Valiant: which, is always zero, which is never satisfiable. Leslie Valiant: On the other hand, this was Leslie Valiant: kind of this, uh, conducting Leslie Valiant: normal form formula with three, Leslie Valiant: variables in each Leslie Valiant: clause. Leslie Valiant: Uh, so this was PAC learnable. Leslie Valiant: And so this was a bit striking that something which is very Leslie Valiant: hard is learnable.
Leslie Valiant: But then this, this Leslie Valiant: highlighted the difference Leslie Valiant: between, uh, computing and uh, Leslie Valiant: and learning because so with the Leslie Valiant: learning model, the idea was Leslie Valiant: that there was a distribution of Leslie Valiant: inputs. Leslie Valiant: And you learned from this distribution, but you only have Leslie Valiant: to be good on this distribution when you have to predict.
Leslie Valiant: So if, for example, in this formula, there were some very Leslie Valiant: rare ones which are so very rare, then the learner wouldn't Leslie Valiant: have to know about that. Leslie Valiant: So in this sense this was easier than the NP completeness. Tom Mitchell: So I was actually quite surprised at that answer. Tom Mitchell: What he's saying.
Tom Mitchell: Put another way is that what was Tom Mitchell: really interesting there is that Tom Mitchell: for this one kind of hypothesis, Tom Mitchell: conjunctive normal form, which Tom Mitchell: is a way of it's a kind of Tom Mitchell: logical expression. Tom Mitchell: If your hypotheses are of that form, then it's easier to learn Tom Mitchell: them than it is to compute them.
Tom Mitchell: When he says compute them, what he means is the cost of Tom Mitchell: answering the question, can you find a positive example of this? Tom Mitchell: And it was known at the time that the computational cost of Tom Mitchell: answering that question, is there a positive example of this Tom Mitchell: formula was exponential in the size of the formula?
Tom Mitchell: And then he discovered that Tom Mitchell: learning a formula, if somebody Tom Mitchell: gives you a positive and Tom Mitchell: negative examples only takes Tom Mitchell: polynomial less than exponential Tom Mitchell: time. Tom Mitchell: So I agree with him that that's Tom Mitchell: a fascinating theoretical fact, Tom Mitchell: but that would not be the answer Tom Mitchell: I would give about why this Tom Mitchell: revolutionized the field of Tom Mitchell: machine learning.
Tom Mitchell: It revolutionized the field, in my view, because he was the Tom Mitchell: first person, really to be able to come up with a framing, a new Tom Mitchell: framing of the machine learning problem that even allowed this Tom Mitchell: kind of theoretical analysis.
Tom Mitchell: In particular, his framing Tom Mitchell: included assumptions like the Tom Mitchell: training data would come from Tom Mitchell: some source that would give you Tom Mitchell: that would give you random Tom Mitchell: examples according to some Tom Mitchell: probability distribution. Tom Mitchell: And then later, when you wanted to test your hypothesis on new Tom Mitchell: data, you would get more random examples from that same source.
Tom Mitchell: And so he reframed the problem Tom Mitchell: in a way that made theory Tom Mitchell: possible. Tom Mitchell: The consequence of that was he Tom Mitchell: catalyzed a huge amount of Tom Mitchell: theoretical work in machine Tom Mitchell: learning and continues this day Tom Mitchell: just keeps branching further and Tom Mitchell: further. Tom Mitchell: There are conferences specifically designed to cover Tom Mitchell: theoretical computer science.
Tom Mitchell: So the eighties was really a very generative decade. Tom Mitchell: There are a lot of things going on. Tom Mitchell: Another thing was going on was some people were looking at Tom Mitchell: human learning and how that might inspire our models of AI Tom Mitchell: and machine learning. Tom Mitchell: One such effort was here at CMU Tom Mitchell: by Alan Newell and his two PhD Tom Mitchell: students, John Laird and Paul Tom Mitchell: Rosenbloom. Tom Mitchell: They took the approach of.
Tom Mitchell: They built a system they called Tom Mitchell: Soar, which was really one of Tom Mitchell: the first AI agents designed to Tom Mitchell: capture the full breadth of what Tom Mitchell: humans do play games, solve Tom Mitchell: problems many different tasks, Tom Mitchell: so they frame their machine Tom Mitchell: learning problem as one of Tom Mitchell: getting a general agent to Tom Mitchell: learn.
Tom Mitchell: And their architecture had very interesting properties that I Tom Mitchell: think are relevant today. Tom Mitchell: Now that agents are again a topic of hot activity, I won't Tom Mitchell: go into the details, but in the podcast there's an interview Tom Mitchell: with John Laird who goes into detail on this. Tom Mitchell: Another item that can't be Tom Mitchell: overlooked in the eighties was Tom Mitchell: really the rebirth of neural Tom Mitchell: network.
Tom Mitchell: Remember, in the end of sixties, Tom Mitchell: Minsky and Papert published that Tom Mitchell: book that killed off work on Tom Mitchell: perceptrons? Tom Mitchell: Well, in the mid eighties, Tom Mitchell: finally, people came up with an Tom Mitchell: algorithm that could train not Tom Mitchell: just one layer perceptrons, but Tom Mitchell: multilayer perceptrons. Tom Mitchell: And that allowed learning Tom Mitchell: functions that were highly Tom Mitchell: non-linear.
Tom Mitchell: And Dave Rumelhart, J. Tom Mitchell: McClelland and Geoff Hinton were Tom Mitchell: three of the ringleaders of this Tom Mitchell: effort. Tom Mitchell: So I asked Geoff about that period. Tom Mitchell: Now we're up to the mid eighties Tom Mitchell: when really neural nets are Tom Mitchell: reborn. Tom Mitchell: Is that the right word? Tom Mitchell: How would you. Geoffrey Hinton: Backprop with backpropagation? Geoffrey Hinton: I mean, we didn't invent it.
Geoffrey Hinton: Invented by several different Geoffrey Hinton: groups, but we showed that it Geoffrey Hinton: really worked to learn Geoffrey Hinton: representations. Geoffrey Hinton: And as you know, sort of one of the big problems in AI is how do Geoffrey Hinton: you learn new representations? Geoffrey Hinton: How do you avoid having to put them all in by hand?
Geoffrey Hinton: And my particular example, Geoffrey Hinton: which was the family trees Geoffrey Hinton: example, where you take all the Geoffrey Hinton: information in some family Geoffrey Hinton: trees, you convert it into Geoffrey Hinton: triples of symbols like John has Geoffrey Hinton: Father Mary. Geoffrey Hinton: And then you train a neural Geoffrey Hinton: net to predict the last term in Geoffrey Hinton: a triple. Geoffrey Hinton: Given the first two terms.
Geoffrey Hinton: So it's just like the big language models. Geoffrey Hinton: You're predicting the next word given the context. Geoffrey Hinton: It's just much simpler.
Geoffrey Hinton: I had one hundred and twelve Geoffrey Hinton: total examples, of which one Geoffrey Hinton: hundred and four training Geoffrey Hinton: examples and eight were test Geoffrey Hinton: examples, which is a bit less Geoffrey Hinton: than the trillion examples they Geoffrey Hinton: have nowadays, Geoffrey Hinton: but it was the same idea. Geoffrey Hinton: You convert a symbol into a feature vector.
Geoffrey Hinton: You then have the feature vectors of the context interact Geoffrey Hinton: via a hidden layer. Geoffrey Hinton: They then predict the features Geoffrey Hinton: of the next symbol, and from Geoffrey Hinton: those features you guess what Geoffrey Hinton: the next symbol should be, and Geoffrey Hinton: you try and maximize the Geoffrey Hinton: probability of predicting the Geoffrey Hinton: next symbol.
Geoffrey Hinton: And you then backpropagate Geoffrey Hinton: through the feature interactions Geoffrey Hinton: and through the process of Geoffrey Hinton: converting a symbol into Geoffrey Hinton: features. Geoffrey Hinton: And that way you learn feature vectors to represent the Geoffrey Hinton: symbols and how these vectors should interact to predict the Geoffrey Hinton: features of the next symbol. Geoffrey Hinton: And that's what these big language models do.
Tom Mitchell: So there's Jeff in the mid Tom Mitchell: nineteen eighties work on Tom Mitchell: backpropagation. Tom Mitchell: Another personal note in Tom Mitchell: nineteen eighty six, while this Tom Mitchell: was going on, I came to spend a Tom Mitchell: year at CMU as a visiting Tom Mitchell: professor. Tom Mitchell: And I got to meet Allen Newell at the time. Tom Mitchell: And Allen said, hey, do you want to team teach a course?
Tom Mitchell: We'll teach a course on Tom Mitchell: architectures for intelligent Tom Mitchell: agents. Tom Mitchell: And of course I said yes. Tom Mitchell: The opportunity to teach with Allen. Tom Mitchell: And he said, by the way, there Tom Mitchell: will be another, uh, an Tom Mitchell: assistant professor working with Tom Mitchell: us. Tom Mitchell: The three of us will team teach it. Tom Mitchell: That's Geoff Hinton.
Tom Mitchell: So Allen, Geoff and I team Tom Mitchell: taught in spring of nineteen Tom Mitchell: eighty six. Tom Mitchell: Uh, this course was one of the best experiences of my career up Tom Mitchell: to that point. Tom Mitchell: And so it was a large part of the reason why I ended up Tom Mitchell: staying at CMU. Tom Mitchell: But when I came, I was here Tom Mitchell: for about a year, and then Jeff Tom Mitchell: moved on.
Tom Mitchell: He moved up to the University of Toronto and started Tom Mitchell: building up a group there. Tom Mitchell: One of the people who joined his group was a person named Yann Tom Mitchell: LeCun, who went on to win the Turing Award jointly with Jeff Tom Mitchell: and Yoshua Bengio for their work in neural networks. Tom Mitchell: So I asked Jon about this period.
Yann LeCun: And then, mid nineteen Yann LeCun: eighty seven, I moved to Toronto Yann LeCun: to do a postdoc with Jeff, and I Yann LeCun: completed this, the Yann LeCun: simulator. Yann LeCun: Jeff thought I was not doing Yann LeCun: anything because I was just Yann LeCun: basically hacking, you know, all Yann LeCun: the time, Yann LeCun: and this, this system was kind of Yann LeCun: interesting because we had to build a front end language to Yann LeCun: interact with it.
Yann LeCun: And that language was the Lisp Yann LeCun: interpreter that Leon and I Yann LeCun: wrote. Yann LeCun: And so we're using Lisp, even though as a front end to kind of Yann LeCun: a neural net simulator. Yann LeCun: And I, you know, implemented Yann LeCun: a weight sharing, abilities Yann LeCun: and all that stuff and started Yann LeCun: experimenting with what became Yann LeCun: convolutional nets.
Yann LeCun: You know, when I was a postdoc in Toronto, early nineteen Yann LeCun: eighty eight, roughly, and started to get really good Yann LeCun: results on, you know, very simple shape recognition, like, Yann LeCun: yhandwritten characters that had drawn with my mouse or Yann LeCun: something like that.
Yann LeCun: Right. Tom Mitchell: So, as you just heard, Yann was Tom Mitchell: experimenting with can we apply Tom Mitchell: neural networks to the problem Tom Mitchell: of character recognition, Tom Mitchell: written characters. Tom Mitchell: People were experimenting with many different uses of neural Tom Mitchell: nets at the time. Tom Mitchell: My favorite, the one I would vote application of the decade Tom Mitchell: was done in the area. Tom Mitchell: Surprisingly, of self-driving cars.
Tom Mitchell: There was a PhD student here at CMU named Dean Pomerleau. Tom Mitchell: He trained a neural network Tom Mitchell: where the input was an image Tom Mitchell: taken by a camera looking out Tom Mitchell: the front windshield of a Tom Mitchell: vehicle. Tom Mitchell: And the output of the neural Tom Mitchell: network was the steering command Tom Mitchell: telling the car which direction Tom Mitchell: to steer. Tom Mitchell: So I asked Dean about that work.
Tom Mitchell: How much training data did you have? Dean Pommerleau: So the interesting thing was, to Dean Pommerleau: begin with, it was all batch Dean Pommerleau: training. Dean Pommerleau: So I'd drive, I'd have a person drive the vehicle along Schenley Dean Pommerleau: Park, uh, Flagstaff Hill Path, and then I would go off and Dean Pommerleau: crunch it overnight. Dean Pommerleau: But in the end, what we were Dean Pommerleau: able to do is, uh, real time Dean Pommerleau: learning.
Dean Pommerleau: So one drive up the hill with a Dean Pommerleau: human behind the wheel steering Dean Pommerleau: and the neural network, learning Dean Pommerleau: to pair images with camera Dean Pommerleau: images with the steering command Dean Pommerleau: that the human was giving was Dean Pommerleau: able to, uh, train it in about Dean Pommerleau: five minutes to, uh, take over Dean Pommerleau: and steer on its own from there Dean Pommerleau: on, on that road and on similar
Dean Pommerleau: roads. Dean Pommerleau: So it was one of the first real time, real world vision Dean Pommerleau: applications of, uh, of artificial neural networks going Dean Pommerleau: beyond just Flagstaff Hill, you know, the little paths on there. Dean Pommerleau: And we went out on, on real roads first through the golf Dean Pommerleau: course, Schenley Golf Course, on the, uh, on the road there.
Dean Pommerleau: And then we, we went on, you know, the local highways, in Dean Pommerleau: fact, the longest as part of my PhD, the longest trip we did Dean Pommerleau: was, I think, about one hundred miles at the time from basically Dean Pommerleau: up, uh, I-79 from Pittsburgh all the way up to Erie. Dean Pommerleau: Uh, and it drove basically the, the whole way. Dean Pommerleau: So it and it was getting up to fifty five miles per hour after Dean Pommerleau: we got a faster vehicle.
Tom Mitchell: It turns out he didn't ask for permission. Tom Mitchell: So so this was all happening in the nineteen eighties. Tom Mitchell: Really, it was a decade of Tom Mitchell: amazing invention and innovation Tom Mitchell: and exploration. Tom Mitchell: Another important thing that Tom Mitchell: happened in that decade was the Tom Mitchell: development of reinforcement Tom Mitchell: learning.
Tom Mitchell: The way to understand that is to first realize that supervised Tom Mitchell: learning was the kind of standard way of framing the Tom Mitchell: machine learning question. Tom Mitchell: When Dean talked about training Tom Mitchell: his system, he would input an Tom Mitchell: image. Tom Mitchell: He had people drive the car, so he got a lot of training Tom Mitchell: examples of the form. Tom Mitchell: Here's the image and here's the correct steering command.
Tom Mitchell: So he could tell the neural network for this input. Tom Mitchell: Here's the correct output. Tom Mitchell: That's called supervised learning. Tom Mitchell: But reinforcement learning reframes the problem. Tom Mitchell: It takes into account that sometimes we don't know what the Tom Mitchell: right output is.
Tom Mitchell: For example, if you're learning to play chess, you might not Tom Mitchell: have a person who tells you at every step given this board Tom Mitchell: position, here's the right move. Tom Mitchell: Instead, you might have to wait until the end of the game after Tom Mitchell: you've made many moves to get the feedback signal that says Tom Mitchell: you lost or you won, and then you have to figure out what to Tom Mitchell: do about that because you actually took many moves.
Tom Mitchell: So that's what reinforcement learning is about. Tom Mitchell: And Rich Sutton and Andy Barto were instrumental in kind of Tom Mitchell: framing that problem and, and working on it. Tom Mitchell: They recently won the Turing Award for this work. Tom Mitchell: So I asked Rich how Tom Mitchell: reinforcement learning fit into Tom Mitchell: the field.
Rich Sutton: The field of machine learning Rich Sutton: has always been been dominated Rich Sutton: by the more straightforward Rich Sutton: supervised approach. Rich Sutton: There was, as I mentioned at the very beginning, Rich Sutton: the rewards and penalties were were very much a part of it.
Rich Sutton: But then the, focus, as Rich Sutton: things became more clear and Rich Sutton: more better defined and it Rich Sutton: became more clear, learning Rich Sutton: problem then became pattern Rich Sutton: recognition and supervised Rich Sutton: learning. Rich Sutton: And, this fellow, the strange, uh, fellow Harry Klopf, Rich Sutton: recognized this more than other people and Rich Sutton: wrote some reports and ultimately a book, saying Rich Sutton: that something had been lost.
Rich Sutton: And Andy Barta and I picked up on his work and Rich Sutton: and eventually realized that he was right, that something had Rich Sutton: been left out, and in some sense it was obvious that something Rich Sutton: had been left out. Rich Sutton: From the point of view of Rich Sutton: psychology, where I'd been Rich Sutton: studying how animals learn and Rich Sutton: animals learn.
Rich Sutton: Really in both ways, in both a Rich Sutton: supervised way and a Rich Sutton: reinforcement way. Rich Sutton: And so, we picked up on that and made that into a well Rich Sutton: defined area in the. Rich Sutton: When was that? Rich Sutton: That would have been in the eighties. Rich Sutton: And then finally, you wrote a book on it in ninety eight. Rich Sutton: So then it became a clear, uh, subfield of machine learning.
Rich Sutton: Yeah. Rich Sutton: But the key thing is why is why why is I the way I say it to Rich Sutton: myself is that why is reinforcement learning off? Rich Sutton: Why is it powerful? Rich Sutton: Potentially powerful. Rich Sutton: It's powerful because it's learning. Rich Sutton: It's really learning from experience. Rich Sutton: Learning from the normal data Rich Sutton: that an animal or a person would Rich Sutton: get.
Rich Sutton: And it doesn't require a Rich Sutton: prepared special data like you Rich Sutton: of course do in supervised Rich Sutton: learning. Tom Mitchell: So during the eighties, there were a lot of other really Tom Mitchell: interesting things going on. Tom Mitchell: Uh, people experimenting with Tom Mitchell: the idea that maybe machines Tom Mitchell: should learn by simulating Tom Mitchell: evolution.
Tom Mitchell: There was an entire set of conferences on something called Tom Mitchell: genetic algorithms, genetic programming, which had to do Tom Mitchell: with that sort of thing. Tom Mitchell: Uh, a cluster of work on Tom Mitchell: studying human learning and Tom Mitchell: other areas. Tom Mitchell: But we don't have time for all of those.
Tom Mitchell: Let's move on to the nineteen Tom Mitchell: nineties, when, again, there was Tom Mitchell: a, I would say, a sea change in Tom Mitchell: terms of the style of work that Tom Mitchell: went on. Tom Mitchell: The theme of the nineteen nineties was really the Tom Mitchell: integration of statistical and probabilistic methods into the Tom Mitchell: field of machine learning.
Tom Mitchell: And a lot of that took the Tom Mitchell: grounded form of learning a new Tom Mitchell: kind of object, which people Tom Mitchell: called either graphical models Tom Mitchell: or Bayes. Tom Mitchell: Bayes nets. Tom Mitchell: But what got learned in that Tom Mitchell: case was, again, a network where Tom Mitchell: each node would represent a Tom Mitchell: variable. Tom Mitchell: For example, maybe you would be interested in predicting whether Tom Mitchell: somebody has lung cancer.
Tom Mitchell: You'd make that a variable and maybe you'd have evidence like Tom Mitchell: are they a smoker? Tom Mitchell: Do they have a normal or abnormal X-ray result? Tom Mitchell: You'd make those variables. Tom Mitchell: And then the edges in the graph represent probabilistic Tom Mitchell: dependencies among the variables in a way such that in the end, Tom Mitchell: the whole graph represents the full joint probability Tom Mitchell: distribution over the entire collection of variables.
Tom Mitchell: So that's what got learned and how it got learned. Tom Mitchell: Waited for some algorithms to be discovered.
Tom Mitchell: One of the key people who was Tom Mitchell: involved in inventing those Tom Mitchell: algorithms, although Judea Tom Mitchell: Pearl, came up with the idea of Tom Mitchell: how to represent these, Tom Mitchell: Daphne Kohler, a professor at Tom Mitchell: Stanford, was one of the most Tom Mitchell: active researchers in terms of Tom Mitchell: designing algorithms for Tom Mitchell: learning these. Tom Mitchell: So I asked her, why do we need graphical models?
Daphne Koller: Graphical models, for me, emerged by realizing that the Daphne Koller: problems that we needed to solve to address most real world Daphne Koller: applications went beyond. Daphne Koller: You have a vector representation Daphne Koller: of an input and a single, Daphne Koller: oftentimes binary or at best Daphne Koller: continuous output. Daphne Koller: There was so much more opportunity to think about Daphne Koller: richly structured environments, richly structured problems.
Daphne Koller: So even if you think about problems like understanding what Daphne Koller: is in an image, that's not a single label problem of there is Daphne Koller: a dog, because images are complex and there's Daphne Koller: interrelationships between the different objects you want it to Daphne Koller: get beyond the yes no. Is there a dog in this image to something Daphne Koller: that is much more rich?
Daphne Koller: There's a dog and a Frisbee and Daphne Koller: a beach and three kids building Daphne Koller: a sandcastle. Daphne Koller: You have a rich input and a rich output. Daphne Koller: Thinking about these richly Daphne Koller: structured domains gave rise to Daphne Koller: we have to think about multiple Daphne Koller: variables.
Daphne Koller: We have to think about the Daphne Koller: interactions between those Daphne Koller: variables and leverage that Daphne Koller: structure both in our input and Daphne Koller: output space in order to get to Daphne Koller: much better conclusions and deal Daphne Koller: with problems that really Daphne Koller: matter.
Tom Mitchell: So this work on training Tom Mitchell: graphical models was really part Tom Mitchell: of a bigger theme that decade, Tom Mitchell: which was just the integration Tom Mitchell: of statistical methods with what Tom Mitchell: had been pretty much statistics Tom Mitchell: free machine learning up to that Tom Mitchell: point. Tom Mitchell: Another person who was Tom Mitchell: instrumental in that was Tom Mitchell: Berkeley professor named Mike Tom Mitchell: Jordan.
Tom Mitchell: I asked him about the Tom Mitchell: relationship between statistics Tom Mitchell: and machine. Michael I. Jordan: So anyway, by the time I moved to wanted to move to Berkeley, I Michael I. Jordan: was realizing that I was missing the whole statistics community, Michael I. Jordan: that, uh, it was just separate from machine learning, as maybe Michael I. Jordan: you kind of remember, there was occasionally a little leakage, Michael I. Jordan: but it was way too separate.
Michael I. Jordan: And and nowadays we're often seeing, you know, people will Michael I. Jordan: run a machine learning method, but then it's not calibrated. Michael I. Jordan: It's not, you know, has bias and all that. Michael I. Jordan: And that's the thing statisticians have talked about Michael I. Jordan: for a long, long time. Michael I. Jordan: And so nowadays I think it's a given that, yeah, they're, Michael I. Jordan: they're kind of two parts, two sides of the same coin.
Michael I. Jordan: Machine learning is maybe a little more engineering in order Michael I. Jordan: to build a system and make it do great things in the world. Michael I. Jordan: And statistics is a little bit more, well, let's be cautious. Michael I. Jordan: Let's say we're going to do like clinical trials.
Michael I. Jordan: Let's make sure that the the answer is really trustable, but Michael I. Jordan: those are two sides of the same coin, and I think that's Michael I. Jordan: probably pretty much clear now. Michael I. Jordan: But for a long time there was a resistance. Michael I. Jordan: Everyone said this is a brand new field, this is different. Michael I. Jordan: And I kept and again annoying colleagues by saying, no, I Michael I. Jordan: don't believe it is.
Michael I. Jordan: So anyway, long story short, it is. Tom Mitchell: It is remarkable that to me that Tom Mitchell: the field of machine learning Tom Mitchell: went through most of the Tom Mitchell: nineteen eighties, kind of Tom Mitchell: without even noticing that Tom Mitchell: statistics exist. Michael I. Jordan: I mean, people like Leo Breiman Michael I. Jordan: were around to help make the Michael I. Jordan: passage.
Michael I. Jordan: So ensemble methods, they were kind of invented by Leo and stat Michael I. Jordan: literature, but they were independently invented in the Michael I. Jordan: machine learning literature. Michael I. Jordan: And is that machine learning or statistics? Michael I. Jordan: Well, clearly it's both and it needs both perspectives.
Michael I. Jordan: And yes, in the nineteen nineties that the Em algorithm, Michael I. Jordan: you know, the graphical models, they were they had, they had uh, Michael I. Jordan: so yeah, the nineties, it was a real flourishing of that. Tom Mitchell: So Mike mentioned that one of the themes was ensemble.
Tom Mitchell: So anyway, I think that's Tom Mitchell: actually a very nice example of Tom Mitchell: how machine learning theory and Tom Mitchell: statistical theory kind of Tom Mitchell: intertwined. Tom Mitchell: The idea of ensemble learning is Tom Mitchell: instead of learning one Tom Mitchell: hypothesis, let's learn multiple Tom Mitchell: ones.
Tom Mitchell: For example, instead of learning Tom Mitchell: a decision tree, you might learn Tom Mitchell: a whole forest of decision Tom Mitchell: trees. Tom Mitchell: And then when it comes to Tom Mitchell: classifying a new example, you Tom Mitchell: give it to all of the Tom Mitchell: classifiers and you let them Tom Mitchell: vote and you take the vote of Tom Mitchell: the classifiers.
Tom Mitchell: Well, that turned out to be very Tom Mitchell: successful and commercially very Tom Mitchell: important. Tom Mitchell: But it also is a beautiful Tom Mitchell: example where, there's a Tom Mitchell: pretty interesting theory around Tom Mitchell: that. Tom Mitchell: And initially, Yoav Freund and Robert Shapiro, uh, in the early Tom Mitchell: nineties, uh, started working on a theory and methods for doing Tom Mitchell: this kind of ensemble.
Tom Mitchell: Leo Breiman, who was a statistician, recognized that Tom Mitchell: this echoed some of the themes of resampling and statistics. Tom Mitchell: And those two things, uh, kind Tom Mitchell: of came together in a very Tom Mitchell: successful way. Tom Mitchell: So in the nineties and the first Tom Mitchell: decade of the two thousand, Tom Mitchell: there were many other things Tom Mitchell: going on.
Tom Mitchell: The development of things called support vector machines, Tom Mitchell: kernel methods, which were, mathematical techniques for Tom Mitchell: learning, very nonlinear classifiers that were actually Tom Mitchell: commercially important and opened the door in many cases to Tom Mitchell: machine learning for non-numerical data, data like Tom Mitchell: images or text. Tom Mitchell: There is work on manifold learning.
Tom Mitchell: There was also growing Tom Mitchell: commercialization during that Tom Mitchell: decade. Tom Mitchell: More and more companies were Tom Mitchell: starting to use machine learning Tom Mitchell: commercially.
Tom Mitchell: But for me, the theme of that first decade of the two thousand Tom Mitchell: was really a growing awareness by many people that, you know, Tom Mitchell: maybe we have good enough machine learning algorithms that Tom Mitchell: the bottleneck to more accuracy is not the algorithm. Tom Mitchell: Maybe we need more data and more computation.
Tom Mitchell: And this idea was crystallized in this beautiful paper written Tom Mitchell: in two thousand and nine by three authors at Google, called Tom Mitchell: The Unreasonable Effectiveness of Data, which really Tom Mitchell: highlighted, cases where, if you want better Tom Mitchell: results, keep your same algorithm, get more data. Tom Mitchell: And that was kind of a theme of what was going on at the time, Tom Mitchell: but things really broke open in the year twenty twelve.
Tom Mitchell: In twenty twelve, the computer vision community had Tom Mitchell: been using a data set created by Fei-Fei Li called ImageNet to Tom Mitchell: test out different vision algorithms, see who could do the Tom Mitchell: best job of labeling which object was the primary object in Tom Mitchell: an image, and the image net data set was very large. Tom Mitchell: In twenty twelve, Geoff Hinton and some of his students entered Tom Mitchell: the competition and they blew away the competition.
Tom Mitchell: What's interesting is they were the only neural network approach Tom Mitchell: in the competition by that time. Tom Mitchell: By the way, neural networks were Tom Mitchell: very scarce in the field of Tom Mitchell: machine learning. Tom Mitchell: They had been displaced really Tom Mitchell: by more recent probabilistic Tom Mitchell: methods, and only a smallish Tom Mitchell: number of researchers were even Tom Mitchell: still working on neural Tom Mitchell: networks.
Tom Mitchell: But, nevertheless, this happened. Tom Mitchell: So I asked Geoff about that. Geoffrey Hinton: And Yann realized when Fei-Fei came up with the ImageNet Geoffrey Hinton: dataset, Yann realized they could win that competition, and Geoffrey Hinton: he tried to get graduate students and postdocs in his lab Geoffrey Hinton: to do it, and they all declined. Geoffrey Hinton: And Ilya, Ilya Sutskever realized that, backprop Geoffrey Hinton: would just kill ImageNet.
Geoffrey Hinton: He wanted Alex to work on it and actually didn't really Geoffrey Hinton: want to work on it. Geoffrey Hinton: Alex had already been Geoffrey Hinton: working on small images and Geoffrey Hinton: recognizing small images in Cfar Geoffrey Hinton: ten, and pre-processed Geoffrey Hinton: everything for Alex to make it Geoffrey Hinton: easy. Geoffrey Hinton: And I bought Alex two Nvidia Geoffrey Hinton: GPUs to have in his bedroom at Geoffrey Hinton: home.
Geoffrey Hinton: Alex then got on with got on with it, and he was an Geoffrey Hinton: absolutely wizard programmer. Geoffrey Hinton: He wrote amazing code on Geoffrey Hinton: multiple GPUs to do convolution Geoffrey Hinton: really efficiently. Geoffrey Hinton: Much better code than anybody else had ever written. Geoffrey Hinton: I believe and so it's a combination of Ilya realizing we Geoffrey Hinton: really had to do this.
Geoffrey Hinton: I know you was involved in the design of the net and so on, but Geoffrey Hinton: Alex's programming skills. Geoffrey Hinton: And then I added a few ideas, like use rectified linear units Geoffrey Hinton: instead of sigmoid units and use little patches of the images. Geoffrey Hinton: I mean, big patches of the images.
Geoffrey Hinton: So you can translate things Geoffrey Hinton: around a bit to get some Geoffrey Hinton: translation invariance, as well Geoffrey Hinton: as using convolution, and Geoffrey Hinton: use dropout. Geoffrey Hinton: So that was one of the first applications of dropout. Geoffrey Hinton: And that helped about one percent. Geoffrey Hinton: It helped. Geoffrey Hinton: And then we beat the best vision systems.
Geoffrey Hinton: The best vision systems were sort of plateauing at twenty Geoffrey Hinton: five percent errors. Geoffrey Hinton: That's errors for getting the right answer in the top in your Geoffrey Hinton: top five bets. Geoffrey Hinton: And we got like fifteen percent, fifteen or sixteen, Geoffrey Hinton: depending on how you count it. Geoffrey Hinton: So we got almost half the error rate.
Geoffrey Hinton: And what happened then was what Geoffrey Hinton: ought to happen in science but Geoffrey Hinton: seldom does. Geoffrey Hinton: So our most vigorous opponents, like Jitendra Malik and Geoffrey Hinton: Zisserman, Andrew Zisserman, looked at these results and Geoffrey Hinton: said, okay, you were right. Geoffrey Hinton: That never happens in science. Geoffrey Hinton: And slightly irritating. Andrew Zisserman then switched Geoffrey Hinton: to doing this.
Geoffrey Hinton: He had some very good postdocs or students working with him. Geoffrey Hinton: Simonyan, after about Geoffrey Hinton: a year, they were making better Geoffrey Hinton: networks than us, but that was Geoffrey Hinton: really the. Geoffrey Hinton: As far as the general public was concerned. Geoffrey Hinton: That was the start of this big Geoffrey Hinton: swing towards deep learning in Geoffrey Hinton: twenty twelve.
Tom Mitchell: So that event, that competition Tom Mitchell: and the fact that the neural Tom Mitchell: network approach, totally Tom Mitchell: dominated all the other Tom Mitchell: approaches really was a wake up Tom Mitchell: call to both the computer vision Tom Mitchell: community, which within a couple Tom Mitchell: of years everybody was using Tom Mitchell: neural networks.
Tom Mitchell: But it was also a wake up call to the machine learning Tom Mitchell: community, who had kind of scoffed at neural networks for Tom Mitchell: several decades, that neural networks were back. Tom Mitchell: And so people started again, now Tom Mitchell: experimenting with this new Tom Mitchell: generation of deep neural Tom Mitchell: networks.
Tom Mitchell: That just meant that instead of having two layers, they could Tom Mitchell: have many layers, dozens of layers, because training Tom Mitchell: algorithms were available and so was is computation. Tom Mitchell: People start experimenting with these and primarily on Tom Mitchell: perceptual style problems.
Tom Mitchell: In fact, by twenty sixteen, Tom Mitchell: neural nets had taken over not Tom Mitchell: only computer vision, but in Tom Mitchell: twenty sixteen, some scientists Tom Mitchell: from Microsoft showed that they Tom Mitchell: had been able to train a neural Tom Mitchell: network to finally reach human Tom Mitchell: level recognition. Tom Mitchell: Speech recognition performance for individual words in a widely Tom Mitchell: used data set called the switchboard data set.
Tom Mitchell: So people were experimenting with neural nets for visual Tom Mitchell: data, speech data, radar, lidar, all kinds of sensory data. Tom Mitchell: People started also asking, Tom Mitchell: well, can we apply these to text Tom Mitchell: data? Tom Mitchell: And the answer was yes.
Tom Mitchell: And people started inventing various architectures, things Tom Mitchell: with names like long short term memory and others to analyze Tom Mitchell: sequences of text and applying them to problems like machine Tom Mitchell: translation, translating English into French, and so forth. Tom Mitchell: And, uh, that kind of worked. Tom Mitchell: And then in twenty seventeen, Tom Mitchell: a very important paper was Tom Mitchell: published.
Tom Mitchell: The name of the paper was Attention is All You Need. Tom Mitchell: And with that was referring to was a subcircuit in a Tom Mitchell: neural network called an attention mechanism that had Tom Mitchell: recently been invented and developed and was trainable. Tom Mitchell: But that attention mechanism Tom Mitchell: was used in this paper, and it Tom Mitchell: advanced the state of the art in Tom Mitchell: machine translation.
Tom Mitchell: But even more importantly for us today, it introduced the Tom Mitchell: transformer architecture based on this attention mechanism. Tom Mitchell: And it's that transformer Tom Mitchell: architecture that underlies GPT Tom Mitchell: and pretty much all of the large Tom Mitchell: language models that were Tom Mitchell: released around twenty twenty Tom Mitchell: two. Tom Mitchell: So that was a major event.
Tom Mitchell: Now, around the same time, Yann Tom Mitchell: LeCun, remember the guy who was Tom Mitchell: a postdoc with Jeff in nineteen Tom Mitchell: eighty seven? Tom Mitchell: Yann had become the head of AI research at Facebook. Tom Mitchell: And so he was in a very interesting position because he Tom Mitchell: was both an academic.
Tom Mitchell: He retained his NYU professorship and at the same Tom Mitchell: time he had a foot in the commercial world directing the Tom Mitchell: AI strategy for Facebook. Tom Mitchell: So ask John about this period Tom Mitchell: and what it looked like to him Tom Mitchell: from from being inside both Tom Mitchell: worlds.
Tom Mitchell: His first part of his answer was Tom Mitchell: that he said for him, a key Tom Mitchell: development was realizing that Tom Mitchell: you didn't have to wait for Tom Mitchell: people to label all your Tom Mitchell: training data, that you could do Tom Mitchell: something called self-supervised Tom Mitchell: learning.
Tom Mitchell: For example, just take data like a string of words and remove a Tom Mitchell: word and have the program force the program to predict what that Tom Mitchell: removed word was. Tom Mitchell: So there's no human labeling you have to do for that. Tom Mitchell: You can use the whole web and Tom Mitchell: you get a lot of training Tom Mitchell: examples. Tom Mitchell: So that's self-supervised learning was a key development. Tom Mitchell: But then here's this description of what next.
Yann LeCun: So the idea that self-supervised learning could really kind of Yann LeCun: bring something to the table there, I think was kind of a Yann LeCun: big sort of mind, change of mindset. Yann LeCun: And then there was Transformers, of course.
Yann LeCun: Right. Yann LeCun: Um, that, so, so before that, there was some Yann LeCun: demonstration that, you know, you could basically match Yann LeCun: the performance of classical systems for tasks like Yann LeCun: translation, language translation using large neural Yann LeCun: nets like LSTM. Yann LeCun: So this was the work by Ilya Sutskever when he was at Google.
Yann LeCun: We had this big sequence to sequence model with LSTMs and Yann LeCun: some gigantic model where you can train it to do. Yann LeCun: Translation. Yann LeCun: And it kind of works at the same Yann LeCun: level, if not better in some Yann LeCun: cases than the then classical, Yann LeCun: classical, the transition Yann LeCun: methods.
Yann LeCun: Then a few months later, Yann LeCun: Yoshua Bengio and Kyunghyun Cho, Yann LeCun: who is now a colleague at NYU, Yann LeCun: uh, showed that you could change Yann LeCun: the architecture and use this Yann LeCun: attention mechanism. Yann LeCun: That, that they proposed, to basically get really good Yann LeCun: performance on translation with much smaller models than what Yann LeCun: Ilya had been proposing.
Yann LeCun: And the entire industry jumped Yann LeCun: on this, Chris Manning's Yann LeCun: group at Stanford, kind of, you Yann LeCun: know, used that architecture and Yann LeCun: basically beat, you know, Yann LeCun: won the WMT competition for a Yann LeCun: particular, uh, type of Yann LeCun: translation. Yann LeCun: And the entire industry jumped on it.
Yann LeCun: So within a few months after that, like, you know, all the Yann LeCun: big players, uh, in translation, were using attention type Yann LeCun: architectures for translation. Yann LeCun: And that's when, the transformer paper came out. Yann LeCun: Attention is all you need. Yann LeCun: So basically, if you build a neural net just with those kind Yann LeCun: of attention circuit, you don't need much else. Yann LeCun: And it ends up working super well.
Yann LeCun: And that's what started the, you Yann LeCun: know, the transformer Yann LeCun: revolution. Yann LeCun: Uh, and then after that came Bert, that also came out of Yann LeCun: Google, which was this idea of using self-supervised learning, Yann LeCun: where I take a sequence of words, corrupt it, remove some Yann LeCun: other words, and then train this big neural net to reconstruct Yann LeCun: the words that are missing. Yann LeCun: Predict the words that are missing.
Yann LeCun: And again, people were Yann LeCun: amazed by like how how good the Yann LeCun: representations learned by the Yann LeCun: system were for all kinds of NLP Yann LeCun: tasks. Yann LeCun: And that really, uh, you know, kind of captured the imagination Yann LeCun: of a lot of people. Yann LeCun: And then after that, the next revolution was, oh, Yann LeCun: actually, the best thing to do is you remove the encoder, you Yann LeCun: just use a decoder.
Yann LeCun: And you just train a system, you feed it a sequence, and you Yann LeCun: just train it to reproduce the input sequence on its output, Yann LeCun: and because the architecture of the decoder is strictly causal. Yann LeCun: Because a particular output is not connected to the Yann LeCun: corresponding input, it's only connected to the ones to the Yann LeCun: left of it.
Yann LeCun: Implicitly, you're training the Yann LeCun: system to predict the next word Yann LeCun: that comes after a sequence of Yann LeCun: words. Yann LeCun: That's the GPT architecture that Yann LeCun: was, you know, promoted by Yann LeCun: OpenAI. Yann LeCun: And, that turned out to be more scalable than Bert.
Yann LeCun: And so in a sense that you can Yann LeCun: train gigantic networks on Yann LeCun: enormous amounts of data and you Yann LeCun: get some sort of emergent, Yann LeCun: property. Yann LeCun: And that's what gave us llms. Tom Mitchell: So that brings us up to today with Transformers. Tom Mitchell: And you can see this very strange evolution in wandering Tom Mitchell: path of, uh, progress exploration over decades.
Tom Mitchell: So before we leave, I Tom Mitchell: want to let's just take a look Tom Mitchell: at that history And say, what if Tom Mitchell: this is a case study of how Tom Mitchell: scientific progress was made in Tom Mitchell: this field? Tom Mitchell: What are the main themes we see? Tom Mitchell: Well, I think the first one is progress happens in waves. Tom Mitchell: It's paradigm after paradigm, right?
Tom Mitchell: First there were perceptrons, Tom Mitchell: but that got, uh, thrown away Tom Mitchell: and replaced by symbolic Tom Mitchell: representations being learned, Tom Mitchell: eventually to be replaced by Tom Mitchell: neural nets, which were replaced Tom Mitchell: by probabilistic methods and so Tom Mitchell: forth. Tom Mitchell: So there's wave after wave of paradigm. Tom Mitchell: Another theme is that a lot of Tom Mitchell: these ideas really came from Tom Mitchell: other fields.
Tom Mitchell: Even the very notion of Tom Mitchell: perceptrons came from somebody Tom Mitchell: who was fundamentally a Tom Mitchell: neuroscientist interested in how Tom Mitchell: neurons in the brain could even Tom Mitchell: learn stuff. Tom Mitchell: Pack learning. Tom Mitchell: You heard less valiant talk. Tom Mitchell: He's very much a Tom Mitchell: computational complexity Tom Mitchell: researcher who found that this Tom Mitchell: was an interesting theoretical Tom Mitchell: result.
Tom Mitchell: Bayesian networks heavily Tom Mitchell: influenced by statistics and so Tom Mitchell: forth. Tom Mitchell: Many of these advances really Tom Mitchell: were new framings of the Tom Mitchell: problem. Tom Mitchell: So, uh, Winston's work on Tom Mitchell: symbolic learning was really a Tom Mitchell: reframing of what the problem Tom Mitchell: was.
Tom Mitchell: The work on reinforcement Tom Mitchell: learning is really changing the Tom Mitchell: definition of what the training Tom Mitchell: signal even is for these Tom Mitchell: systems. Tom Mitchell: So that's another theme that you see. Tom Mitchell: And finally, I think like a lot Tom Mitchell: of scientific fields, machine Tom Mitchell: learning is really a blend of Tom Mitchell: technical forces and social Tom Mitchell: forces.
Tom Mitchell: Certainly in the long term, Tom Mitchell: the cold, hard facts of what Tom Mitchell: works best come out and those Tom Mitchell: methods win. Tom Mitchell: But in the shorter term, the Tom Mitchell: question of who works on what Tom Mitchell: kinds of problems is very much Tom Mitchell: influenced by the personalities Tom Mitchell: of people.
Tom Mitchell: Their ability to persuade other Tom Mitchell: people to jump in and start Tom Mitchell: working with them on their Tom Mitchell: problems. Tom Mitchell: So these are some of the themes you see. Tom Mitchell: And I think if you look around at other fields, sometimes you Tom Mitchell: see similar themes. Tom Mitchell: Finally, what are the lessons from all this for researchers? Tom Mitchell: I think the first lesson really is question authority.
Tom Mitchell: Because really, if you think Tom Mitchell: about the major advances, many Tom Mitchell: of those came from just, uh, Tom Mitchell: going against what was currently Tom Mitchell: the conventional wisdom in the Tom Mitchell: field. Tom Mitchell: Inventing a new framing or Tom Mitchell: taking a radically different Tom Mitchell: approach. Tom Mitchell: Another lesson don't drag your feet.
Tom Mitchell: I've seen decade after decade, new paradigms emerge in the Tom Mitchell: field, and every single time that happens, existing Tom Mitchell: researchers take longer than they need to to recognize the Tom Mitchell: benefits of the new paradigm. Tom Mitchell: And the most guilty people are the senior researchers.
Tom Mitchell: You can probably explain that by Tom Mitchell: taking into account who has the Tom Mitchell: most to lose if there's a new Tom Mitchell: paradigm replacing the current Tom Mitchell: approach. Tom Mitchell: Another lesson learn to Tom Mitchell: communicate and learn to follow Tom Mitchell: through. Tom Mitchell: You heard Geoff Hinton when he Tom Mitchell: was talking about in the mid Tom Mitchell: eighties, the development of Tom Mitchell: back propagation.
Tom Mitchell: You heard him say we didn't invent backpropagation, but we Tom Mitchell: showed that it was important. Tom Mitchell: And actually, to be fair, they Tom Mitchell: thought they were inventing Tom Mitchell: backpropagation.
Tom Mitchell: They they actually reinvented Tom Mitchell: it, but they had no idea that Tom Mitchell: somebody had invented it before, Tom Mitchell: because whoever did that didn't Tom Mitchell: succeed in waking up the Tom Mitchell: research community to the fact Tom Mitchell: that they had a really good Tom Mitchell: idea. Tom Mitchell: I don't know why. Tom Mitchell: Maybe they didn't put in the Tom Mitchell: effort or succeed in Tom Mitchell: communicating.
Tom Mitchell: Maybe they dropped it after they Tom Mitchell: did it and went some other Tom Mitchell: direction so that they didn't Tom Mitchell: follow through to provide the Tom Mitchell: evidence. Tom Mitchell: But that kind of thing happens Tom Mitchell: frequently in successful Tom Mitchell: researchers are good Tom Mitchell: communicators, and they follow Tom Mitchell: through to to push the field to Tom Mitchell: pay attention.
Tom Mitchell: The final lesson, I think, is Tom Mitchell: the philosophers were actually Tom Mitchell: right. Tom Mitchell: We really today, despite these amazing capabilities of our Tom Mitchell: learning systems, we don't have a proof or anything like a Tom Mitchell: rational justification of why you can generalize from examples Tom Mitchell: to get these general rules that work well despite the success Tom Mitchell: that we have.
Tom Mitchell: We don't really understand at this very fundamental level why. Tom Mitchell: And I think that if we did pay more attention to that question, Tom Mitchell: we might have a better chance to develop algorithms that Tom Mitchell: outperform what we have today. Tom Mitchell: So I'll stop there. Tom Mitchell: Thank you very much. Speaker 12: Tom Mitchell is the Founders Speaker 12: University professor at Carnegie Speaker 12: Mellon University.
Speaker 12: Machine learning How Did We get here? Speaker 12: Is produced by the Stanford Digital Economy Lab. Speaker 12: If you enjoyed this episode, Speaker 12: subscribe wherever you listen to Speaker 12: podcasts.
