The data black hole at the center of AI - podcast episode cover

The data black hole at the center of AI

Jun 19, 202612 min
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Summary

This episode explores AI's surprising reliance on vast, specialized datasets, contrasting its "data black hole" approach with human sample efficiency. It debunks common objections regarding evolution, multimodal data, and scaling, revealing a fundamental difference in learning curves. Ultimately, the discussion highlights why addressing AI's sample inefficiency is crucial for achieving goals like white-collar automation and accelerating AI research itself.

Episode description

Read the transcript here.

Thanks to Mercury for sponsoring this essay!

Mercury just released a new feature called Command, which gives me AI right in my banking platform. And since I use Mercury to run basically my entire business, Command has access to all the info it needs to get real work done. I can ask it to send invoices, or categorize expenses, or even transfer money… and Command just handles it. Learn more at mercury.com/command

Timestamps:

(00:00:00) – What is really driving AI progress?

(00:03:11) – Comparing human vs AI sample efficiency

(00:08:46) – Does sample efficiency matter?



Get full access to Dwarkesh Podcast at www.dwarkesh.com/subscribe

Transcript

What is really driving AI progress?

A

So one definition of intelligence is sample efficiency. That is to say, how much data do you need in a given domain to operate fluently and competently? And it's actually not clear that we've made that much progress in training sample efficiency over the last few years.

It seems like more so we've just dramatically widened and improved the data distribution. The main way that AIs have been getting better is from adding more and better data and scaling the compute required to develop that data in the first place.

Obviously RL is the main way that this has happened. You can think of RL as basically a kind of synthetic data generation where you dump a ton of compute against a verifier or a rubric if you have an LLM as a judge, and you do this in order to find out what the good data is in the first place. And then you train your model to predict these correct rollouts, much in the same way that you might train that model to predict the next word in internet.

For this process to work, the model must have at least some prior probability to anticipate the correct solution in the first place, which is why you need mind-stretching amounts of human expert trajectories in every single field and skill that you want the model to eventually be competent in. It's hard to overstate how task-specific and bespoke this human expert data is. If you want some intuition, I recommend checking out the job descriptions on Mercore or Surge's website.

There are listings for word specialists who will convert legacy documents into polished word files, and legal experts who will write realistic MA diligences or securities filings, and management consultants who will write up template market research. And there's not only that the data have to be so domain specific, but there has to be so much of it.

Each skill corresponds to at least hundreds of human experts who are generating example completions, writing rubrics, and explaining their chain of thought. There's a reason that the data industry that is producing these expert labels and the RL environments in which these meticulously cataloged skills can congeal is earning billions a year in revenue, soon to be decabillion.

Now imagine if it took a couple decades worth of courses with hundreds of concurrent professors and millions of practice tasks for you to learn how to polish a word. Even the task count difference here understates the gap because the models have to grind their far more numerous tasks, each far harder.

Whereas a human student might practice a textbook problem once or twice. With GRPO, these models are generating hundreds to thousands of rollouts per task, and they need to to solve the credit assignment problem. The correct way to think about these models is not like a human who has learned all these different skills that you see these models displaying. It's more like a Frankenstein's monster, which has been built out of a billion graphs of carefully constructed examples all sewn together.

Epoch recently reported that open models lag state-of-the-art frontier models by four months. I think the reason it is relatively easy for open source and previous laggards to catch up to within months of the frontier is that data is the real driver of progress. And data can be easily distilled from public APIs, whereas hyperparameters and training tricks and architectural optimizations cannot.

And if the latter were driving most of the progress, then catching up would be far harder than we are observing it to be. It is easy to forget how much data these models are trained on, and how much more it is than what we humans see in our lifetime. We see these AIs as a galaxy glittering with capability.

But at their center, invisible to the naked eye, holding all the constellations together, is an unimaginably massive black hole of data. Just a couple of points of comparison to help drive home how big this difference is.

Comparing human vs AI sample efficiency

Here's one. If a person sees and hears on average, let's say generously, 2,000 words an hour, then between the time they're bored and the time they're an adult, they'll see about 200 million tokens. Now by contrast, these frontier models are trained on somewhere between tens to hundreds of trillions of tokens. That is close to a million fold difference. Here's another point of comparison. If you wanted to, you could learn to tell or operate any random humanoid or robot arm within an hour.

And if we could get AIs to learn just as fast, robotics would be a decatrillion dollar industry and you'd have an endless army of unit free G ones doing all kinds of useful work in the way. But the reason we can't do this is that our AIs learn much less efficiently than we do. And even with the millions of hours of demonstrations that we collected, this is not enough to allow them to perform complex open-ended tasks.

And a final point of comparison, a teenager can learn to drive a car with about twenty hours of practice. And even if we include their 16 years of growing up and understanding how the world works and building physical intuition. There's still three to four orders of magnitude less data than Waymo and Tesla are using to train their self-driving car models. Now I want to deal with a couple of common responses and objections that people have to these kinds of comparisons.

One thing people will say, and I think Karpathi said this when he came on my podcast, is that For humans, many billions of years of evolution had to go into basically pre-training us. And so we're being unfair when we're comparing how little data we see within our lifetimes to what these cold-started LLMs who are just starting off with a totally random initialization have to learn from.

I think this is not the right way to think about it. Our genome is only three gigabytes big, and only one to two percent of it is protein coding. And that is simply not enough space to store the parameters of this network that supposedly evolution has pre-trained. I think the closer analogy is more that evolution found the right hyperparameters and the right loss functions.

And that within our lifetime, we are still from scratch building up the connectome in our brain. That is to say, the analogous thing to the weights and parameters of the neural network. And even if you granted this comparison and you said, yes, the hundreds of trillions of tokens that these models see to get pre-trained is similar to just

catching up to evolution, that still doesn't explain why any new marginal capability that you want to give these models takes so much data. So once you have been educated, again, you don't need a hundred different professors to teach you how to learn a new programming language. But these AIs, even once they're pre-trained, still require enormous amounts of data to learn the next marginal skill and the next marginal skill after that.

Another objection to this kind of comparison is that we're not including multimodal data that we're seeing in our lifetimes. So if you include all this sensor information that we see from birth to adulthood, that's probably tens to hundreds of billions of tokens of data.

And my response to this objection is simply that blind and deaf people who have been cut off from all the sensory information still have general intelligence. And that suggests to me that all these billions of sensory tokens are not really the thing that is making humans smart.

And in fact, deaf people who don't have the ability to hear any tokens, who just have to consume them via sign language and reading, are probably ingesting far less than the two hundred million language tokens that we ballparked earlier.

Which suggests that even the million fold difference that we calculated earlier might be an understatement. Okay, the third common objection people make is that we just haven't scaled enough. We have these scaling laws, they tell us that bigger models are more sample efficient. The human brain, we know, is about 100 trillion synapses. And we have frontier models that are currently around five trillion parameters.

And so maybe we could just achieve human level sample efficiency if we made these models one to two orders of magnitude bigger. The reason this objection is off mark is actually quite interesting. So if you look at the way the scaling loss equations work, they tell you that the parameter and data terms

or add it to the loss independently. So suppose you have a model and you've trained it compute optimally and you say, I want to be sample efficient. I want to use as little data as possible and I'll throw in as many parameters as is necessary to make that happen. So take the constants from the Chinchilla scaling law paper. Even if you increase the number of parameters by infinity, that would only decrease by a factor of 10 the amount of data that you need in order to keep the same law.

Humans are somewhere between thousands to millions of times more sample efficient than these models. So scaling the size of current models simply can't make up for that discrepancy. And this really does suggest that humans are on a different scaling curve altogether. As soon as I earn money, I want to put it to work. But I also need to save for things like upcoming expenses and estimated taxes.

So to figure out exactly how much I need to set aside, I ask command. Command is AI that is built into Mercury, which is my banking platform. And since I already use Mercury to run my entire business, Command has access to all the information it needs to get worked on. I just tell command the date I'm interested in and it does the rest.

It ticks my current balance and adds whatever invoices will we do by the cutoff. Then it reviews my last six months of transaction history. So it can subtract out my monthly average expenses along with any scheduled payments. And if there's anything relevant coming up that's not in Mercury yet, I can just flag it.

Things like, heads up, there's a$12,000 contract or payment that's slated for July. And that gets included in the final output. Because this is all happening in chat and every answer has links to the underlying data, I can easily double check commands work.

And once I'm convinced, I can just tell Command, all right, that looks good, just transfer the surplus to my personal account. And you will immediately draft the transfer for me to approve. Command is live now. Visit mercury.com slash command to learn more.

Mercury is a fintech company, not an FDIC insured bank. Banking services provided through Choice Financial Group and Column NA members FDIC. AI generated responses and suggested actions may vary and are not guaranteed. Okay, all these nerdy comparisons aside, you might ask.

Does sample efficiency matter?

Why do we even care about sample efficiency? Is this actually necessary for the labs to achieve the two overarching objectives they have, which are one, automate white-collar work, and two, automate AI research itself.

The bet that the labs are making with white-collar work is that the common tasks that our software engineer or analyst or accountant needs to do are common. And as a result, you can bring them into the training distribution quite easily. If you look at the revenue curves of these labs over the last few months, It does suggest that there's an enormous amount of value from bringing into distribution these kinds of common tasks.

even if we can't replicate whatever is making human learning so special. And it might be more inefficient to train AIs to do these kinds of tasks than it is to train humans. But so what? Human lifespan simply does not allow for the quantity and the breadth of training that these models experience.

If you, as a human, had some weird learning disability where you needed to read through every public repository on GitHub before you could be a competent software engineer, then it would simply not make sense to train you up. You'd be on Social Security by the early stages of your education. And even once you were trained, you would only be able to work on one project at a time.

But AIs can learn these skills by fire hosing gigawatts of training at a time. And what they learn can be amortized across billions of sessions at once. So we can be ludicrously inefficient in training them up and still be wildly in the green. And then there's a question of, well, how much auto-distribution thinking do white collar employees need to do that you simply can't train for in advance? This is more a question about the nature of different jobs than it is a question about AI research.

And it also depends on which job you're talking about. Some jobs are so mechanical and predictable that we were able to automate them long before the modern era of BI. For example, bank tellers or travel agents. But there are other jobs which require dealing on a daily basis with problems that are quite distant from the data distribution. I think software engineering is probably one such. This is the job that AIs are supposed to take first.

But I would be willing to bet that there's overall more demand for human software engineers in 2027 than there is right now, largely due to the complementary input of AI. The lab's plans for this latter category of jobs is first to automate AI research and then have the automated AI researchers solve the sample efficiency problem.

So then the question is, can AIs, which do not have human level sample efficiency, nonetheless solve the remaining research problems that stand on the way of human-like intelligence and learning? This is a very complicated question, and I'll have to address it in a much longer future blog post.

But just to tease it a bit, I think that the way that people currently think about an intelligence solution is very clumsy because either people dismiss the possibility of AI speeding up AI progress altogether, or they assume that some kind of god pops out the other end. They don't reason carefully about what it looks like to to have a period where AI progress is much faster than usual, but have that happen atop LLMs and the particular kinds of intelligences that LLMs are.

But I'll save that for next time. In the meanwhile, if you want to read this blog post or all the other blog posts I write, or be alerted when I write a future blog post, go sign up for my newsletter at my website, thwarcash.com. All right, I'll see you later.

This transcript was generated by Metacast using AI and may contain inaccuracies. Learn more about transcripts.
For the best experience, listen in Metacast app for iOS or Android