We just launched this experiment and then we are very surprised to see that the hugely over-parameterized model not only train out of the box like you have very nice training curves but also they don't over fit aggressively at all and what we found empirically is that we just, out of the box, use typical supervised training. We don't have to play with hyperparameter, optimizer, and you have very, very stable training.
So this brings us to the question, is it worth it to spend so much money to gather gigantic pre-training data set, spend like... months and many GPUs to produce those models but at least for some applications it seems to not be much better than Rondo. MLST is sponsored by Tufa AI Labs. Now they are the Deep Seek.
based in switzerland they have an amazing team you've seen many of the folks on the team they acquired minds ai of course they did a lot of great work on arc they're now working on o1 style models and reasoning and thinking and test time computation the reason you want to work for them is you
You get loads of autonomy, you get visibility, you can publish your research and also they are hiring, as well as ML engineers, they're hiring a chief scientist. They really, really want to find the best possible person for this role and they're prepared to pay top dollar. as a joining bonus. So if you're interested in working for them as an M.O. engineer or their chief scientist, get in touch with Benjamin Kruzier. Go to toothalabs.ai and see what happens.
Originally, the main motivation was to see how much information you gain by doing pre-training, right? And is this next token prediction really making your network learn something about language and reasoning? And so then we are saying, okay, one way to compare this at least empirically is to just take a randomly initialized model.
train it from scratch on a supervised task like sentiment, prediction, sentiment analysis. And then in theory, because we have a very, very small training data, let's say like 20,000 samples, and because those models have like... seven billion parameters the pre-trained one will perform very nicely with a little bit of LoRa fine-tuning because it already knows how to reason about the world right so maybe you just adjust a little bit to the specific task that you want but since you have so many
knowledge you will solve the task very easily but the random one either will overfeed completely because you have like seven billion parameters and only 20 000 training samples or maybe it will not learn at all because you know training dynamics will be completely chaotic and so we just launched this experiment and then we are very surprised to see that the 7 billion or like the hugely over parametrized model not only train out of the box
you have very nice training curves almost like you train MNIST but also they don't overfit aggressively at all like they overfit less than if you just train a MLP on MNIST basically and this is very surprising and so basically this we said okay actually maybe there is a deeper question we could which could be
how much implicit bias you have in those language models because already we knew from computer vision that for example ImageNet you can have a 50 million model on a 1 million data set so you have like this 50 to 1 ratio and you have the implicit bias that prevents you
from overfitting and just solving the task, right? But still, it's 50 to 1, so this may sound a lot for, you know, like... statistician but uh now it's like seven billion to twenty thousand so like the ratio is like gigantic right and uh yeah no to me it was very surprising that the scale uh like the size of this threshold still
allows you to learn something that does not overfit. This is very surprising because in Vision, for example, Transformers are known to overfit more easily than ResNet, so they seem, at least in Vision, to have actually less.
implicit bias or implicit regularization but at least with this type of next token causal architecture LLM uh yeah you don't seem to overfit easily to your data so this was quite surprising yeah and we should bring in the name so this was your your workshop um paper at the self-supervised learning workshop here at neuropson school for perception tasks is llm pre-training by next token
worth the cost so this is absolutely fascinating right so we've been given this this belief that we need to have these huge
pre-trained models. They're trained on all the data on the internet. And it turns out that certainly for discrimination tasks, so things like classification rather than generation, actually you can just start from scratch with a fairly small model and you get sometimes even better results yeah and even small or even large model like you just start from scratch you do this very simple supervised classification task right okay given this prompt is it
good or bad sentiment, or what type of job is the problem describing. You know, this type of, we will not call it reasoning, but more semantic classification. And it turns out that you start from random. Even if you have a small training data set, you will have performances that are sometimes as good as a pre-trained model. So this brings us to the question, is it worth it to spend so much money to gather a gigantic pre-training data set?
like months and many GPUs to produce those models and for some cases for generations all right there is no question this is what you need to do as you have your next token prediction you learn how to generate samples but at least for some application it seems to not be much better than random so it's quite quite interesting so what are the differences
in the learned representations so that's something we do not really look at like low dimensional representation of what you learn it's possible so some work try to look at the attention entropy and the like you know those mechanistic interpretability viewpoint of LLMs so it will be interesting to see if you have this sort of you know neural collapse things that happen so even if you have like a seven billion parameter maybe you
end up learning a very very very simple sub-network that does the task a bit like Clutter Etiquette hypothesis as well and that naturally emerge from the training dynamics or is it really exploiting all the parameters I think that's one thing so to extend the workshop
paper to confirm we want to probe into more like what are the useful parameters what did they learn are each layer actually learning something or maybe the first layers don't really learn anything just the last few ones are learning something so yes
Lots of open questions for this. What does it tell us about the nature of understanding and maybe even intelligence? Because we think that the reason why these things understand is because they just have all of these representations to all of these different, you know, things in there.
experience and and now we can shortcut to you know to want of a better word what does that tell us yeah i think that's a good question so in this case we must look at very specific classification tasks so for example a description of what job it is is it like a good or bad sentiment and this you are able to solve it good but you are not able to go out of distribution to solve a new type of question for example for this job description then you cannot answer okay is this job
paying you more than this job because this was not present in the training data right so i think you get very good models cheaply quickly from random initialization but they will be very specialized and i think the benefit of having Maybe the pre-training may come if you want to do more of like open-ended classification or reasoning. So I think it really depends on the type of application you want to solve, what's your downstream task and how much you want to generalize to a new scenario.
But at least now it shows that it's not just pre-training with next token prediction is better for everything. So, I mean, going back five years, data scientists used to build... specific classification models for doing everything. And now we're in this regime of we need these really big models and we do in context learning and maybe even some fine tuning and we get them to do fairly specific discriminative tasks. But now you're saying we should almost...
go back to where we were five years ago and start building specialized models again only now rather than building classification models we're actually we're still using the transformers and the llms but we're we're making them do specific tasks yeah yeah exactly i think if you only
to solve a few specific tasks use this prior knowledge to have a nice architecture supervised data set for that and just do that from scratch this is something that's gonna probably work much better but again you need to make sure that the downstream application will never go too out of distribution so that's why it really depends on the application and the type of use cases that you have but I think
At least here, it shows that there exists some tasks where a next-second prediction is not the answer. And in fact, it's not just not the answer, but it's not better than random initialization, which is really sort of the worst-case scenario.
Interesting. From a fairness and bias point of view, a lot of people say that large language models are bad in a way because there's the dominance of North American cultures and so on. But you could also argue the converse, which is that the good thing about them is that they do have some awareness of... value you know so we can fine-tune them to have guardrails and to sort of say the right thing and so on is that
harder to do with this approach? Yeah so here because you are in a fully supervised setting you don't have as much flexibility to let's say change the behavior of your model or it will have to take the form of supervised fine-tuning. But because you don't have a generative capability, it certainly restricts the type of interaction you have with the model and how you can improve it, right? Because the output is just a good or bad sentiment. It's not something that gives you a full...
answer that then you can try to argue against and generate a fine tuning that asset from is just okay good bad and that's it another thing is training strategy so you know like the big players building these llms they have lots of internalized knowledge around
You know, even the order in which you train the language models, everything is important. You know, certainly in the old days of like basic models, you know, you just stick a load of data in there. No one really cares. So, you know, now do people need to be sort of thinking about the specialized knowledge, maybe thinking about curriculum.
learning and all of this kind of stuff yeah so this is a good point so we did a paper recently called the fair language model paradox where we show that when you do this next token prediction because you have some tokens that are very low frequency it's very hard to train
them and it takes a very long training so it's very wasteful right and the problem is that because you do this next token prediction you need to really capture all your distribution of tokens and so you spend a lot of time but in this case if the low frequency tokens are not useful to solve your task you actually don't need to capture it at all so in terms of training dynamics this is actually a much simpler problem in many cases and what we found empirically is that we just
out of the box use typical like supervised training we don't have to play with hyper parameter optimizer and you have very very stable training so that's one thing that could be also interesting for future work is to see is this something that is easier to optimize and maybe that's why those like seven billion parameter models can learn and not overfit on like 10 000 samples and then it's also bringing other things that maybe this on its own
could be a better initialization for next token prediction as well. So this is like very open up in the air, but maybe you could think of a simpler supervised objective that would be... better pre-training solutions that then you can use for like next token prediction if you wanted to but at least this would be a better starting point from random so you almost reverse the trend
so we've spoken about two extremes so on the one extreme we have pre-training and and you can like use it for any downstream task and on the other extreme we have you know you start from scratch just with one task is there an intermediate solution so what if i did this new approach but for multi-task let's say
for five tasks yeah yeah so that's a great question so if you really think about it in the limit you could formulate a next token prediction as a multi-task where you want to each task is painting is the next token this one or not So in the li... extreme case you could just recover a next token prediction on one hand and on the other hand you have what we have here so just one task very coarse high level predict if it's a good or bad sentiment or whatever so in between you have a huge
spectrum that you can exploit and if you can find as you said maybe five very different representative tasks this you should be enough to or could be enough to learn the representation that is as general as possible and then you can use this for maybe new tasks that come on the go. So I think the research question is
how to design the minimum amount of tasks so that you have as diverse representation as possible. And of course, you don't want to go to the extreme of just doing, again, next token prediction. But this is a very, very nice research. question because if you have this spectrum and you can control where you want to be then you can really have a pair use case choice so it's not okay you're always here or always here tell me what you want to do how much new tasks you expect your model to be
exposed to and i tell you where you need to be in this spectrum so this could be like very interesting as well very cool very cool it does make me think though that these models understand through naive statistical alignment and is it possible that the benchmarks we use
just don't cap you know they the the gap of understanding that we've lost from moving from the pre-trained models isn't being captured yeah i think uh because especially in the recent years we focus a lot on generative decoder only methods the evaluation and the type of objectives we put on ourselves in really is really about
good generation right even if you want to answer a question you need to generate a good explanation you need to understand what are the intermediate steps and i think the fact that we focus on generative models means that we're completely biased the evaluation and the way we approach this thing. And maybe you could have still knowledge that is learned without being able to generate anything. So I think this is also something that could be interesting to look at, or at least keep in mind.
when we explore those models but philosophically though isn't generation analogous to thinking in some sense so don't don't models that generate aren't they smarter in some deep way probably what you want to do is maybe imagine what could be, but I don't think you want to do Generation is...
with very granular details like next token generation because if you think about it even just in terms of like a classification task you have a lot of different uncertainty depending on the token if i start the sentence okay i saw this movie for minutes there is no way you can tell what was the next token for after four right so this means that you know a priori it would be like a time
component, right? Maybe it's like one hour, 10 minutes, two hours. But do you really need to be able to generate the, I don't know, 52 minutes or whatever the answer was to actually understand that i was seeing a movie therefore i was staying in a place for at least more than five seconds right so i think token is way
too granular and if you had maybe like concept tokens that's where you could start seeing okay this is meaningful because that's closer to maybe what we do but right now we are very very very low level because tokenization is a lossless compression right so this is too close to the raw data and yet yet we have the life easy compared to computer vision because already you work in language which is very compressed representation of knowledge but still token is Probably too low level still.
Well, that was a fascinating paper. Let's move on to your next one. So the birth of self-supervised learning and supervised theory. And that was with Jan LeCun. Yes. And yeah, basically you said that the observed differences between self-supervised learning and supervised learning are not due to the loss function.
themselves but rather the labeling of the data set using training give us the elevator pitch yeah so basically what we show in this paper is that you can have a supervised objective like let's say least squares to make it simple so you have the inputs you have your networks
and you have the labels and you can turn this objective which tries to predict sample XN to prediction YN into a self supervised learning objective which tries to compare samples with each other. So basically you go from saying okay this image is a car or a dog, to saying are those two images the same or not, which is like the self-supervised type of joint embedding world. And so you can show that
if you have labels or you have knowledge of this pairwise relationship, they are actually learning the same representation up to some symmetry that is irrelevant if you do linear probing. So the loss function in itself, the SSL one or the supervised one, try to do the same thing, they just operate on a different view of the labeling.
whether this image is that or are those two images or two samples representing the same thing so given that then the next question is how come self-supervised learning is able to generalize better than supervised and from this perspective what you can say is that it's because it's as if they were solving a supervised task where the labels are not about predicting all the cars to cars but are very very very fine grain label where in the limit each image is its own class.
So if you think about supervised learning in this extreme setting, you also don't overfeed to the task because you don't collapse any image to another one. And so theoretically speaking, you can solve many downstream tasks as you want.
this equivalence of loss at least brings a slight new perspective on the fact that it's not really about the objective it's more about how you design the ssl pipeline or you say okay this sample is related to this sample but it's not the objective that makes you learn a better representation.
OK. And in the paper, you were talking about how SSL can maximize the worst case downstream task performance. Can you sketch that? Yeah. So basically, if you think about all the possible realization of downstream task, you could have some very coarse scale ones. We have maybe different pictures of
cars and buses and you just want to say it's a car or a bus, so no details need to be encoded to solve this. But then you can have downstream tasks where you want to say, okay, which brand of car is it or which color of car is it? So you have a distribution of downstream tasks, right? And so the point now is that you want to learn representation so that if you look at the distribution of downstream task performance you are able to be as good as possible on most of them.
right so you don't want to be very good on some and then in the tail you are very bad on the majority of them and so then from this you can try to say okay what would be the labeling that tries to make your worst case as good as possible and from this you can say okay this is actually the labeling that self-supervised are actually implicitly doing how does the class the class balance affect the the difference in the losses oh yeah so this is a very good point actually in a follow-up
we are doing right now, we show that current SSL objective assume class balancedness. And this is something we already highlighted quickly in this... SSL supervised learning as a uniform cluster prior paper we did a couple years ago. And we show that current SSL objectives assume, balance, representation of classes or concepts. And this means that if you train on ImageNet, things work out very well.
because concepts are sort of equally represented. But then if you go to other data sets like iNaturalist, which are very heavy tail, then you have a huge bias in your representation. So until now, people did not really know how to solve this. One way people approach this is through data curation. And they say, OK, I'm just going to remove the oversampled concepts to try to make it more uniform. And then I do self-supervised learning on this. But because now we have this theoretical formulation.
and this equivalence of losses, we can use the exact same setting that people used in supervised learning to re-weight depending on the frequency of classes. We can use that to come up with a new self-supervised learning loss that takes this imbalance into account.
count so this type of thing is enabled from this mathematical formulation and it's principle so the way we do this weighting you can prove that is the right way to do it from this supervised theory and so this is really nice because suddenly from this seemingly naive connection you cannot come up with new generation of self-supervised learning models where you can
actually match what the real world data distribution is like. So non-uniform distribution of classes, maybe even if you have some samples that are more noisy than others, you can include that information as part of the SSL objective as well. So suddenly you have a wall.
new world of possibilities that comes and because there is this connection you can actually prove okay this is the right way to do it at least from the supervised theory viewpoint you also pointed out a connection to vcraig exactly so basically what we do in the paper is that we show if you have a least-square supervised type of objective and you turn it into a SSL1, what you obtain is basically VCreg. So then you have a few variations. It could be VCreg or WMSE depending on how you do this.
from supervised to SSL, but you can show that depending on the type of supervised loss, you recover different types of SSL ones. If you look maybe more at cross-entropy supervised learning, it's going to be more like a SimClear type of loss, but you have this one-to-one correspondence.
and this is also very nice because in supervised learning at least you know when one loss may be preferred compared to another one and this has been studied for a long time right because supervised learning has been around forever and so now we can reuse those insights for self-supervised learning so this to me is also a very very strong benefit of this thing is that suddenly all the theory and like the
thousands of papers that have been done in supervised learning, we can just take it and apply it in SSL. Another example is a neural collapse, for example, that has been proven in supervised setting. Now it applies... like in five lines in a ssl setting as well so this connection is really beyond just trying to say okay it's not the objective that make ssl better it's really
tying those two huge communities together towards the goal where you have a single unified objective to learn representation. And this is nice too, because if you speak to people, they will think, okay, you have supervised learning on one side. SSL on the other side and basically you are either in one camp or the other. But now what we show is that you actually, SSL is...
pretty much everything in Representation Learning and supervise is just one realization of SSL. Then Vcraig without labels is another one. Then this one is another one. you really have a better understanding of this relationship and what Representation Learning is trying to do.
Galaxy Brain question incoming. Could you combine SSL and supervised objectives in some way to improve generalization? Yes, yes. So there is one paper, which is supervised contrastive learning. So the way they do it is that they use the labels within a
clear framework to try to basically do fully supervised learning but with a simpler objective so first of all we can show that indeed this makes sense and that basically we can explain the empirical result that they had but actually we can do a little
bit more than that so if you are in a semi-supervised setting for example it may not be clear how to combine those two losses anymore or maybe you could say okay you have the two and i have a coefficient to weight them but then you need to do cross-validation and so on but now from this perspective you can combine
them in a very principled way and you can understand which weighting makes sense depending on how much sample you have in one or the other and you can use all the literature again from like supervised learning for this setting as well so this is something can do very easily with this formulation as well. Okay, so if SSL and supervised are two sides of the same coin, I mean, of course, we can use this theoretical framework to design new forms of SSL framework, but is the distinction relevant?
if they are the same thing i think it's not just two sides of the same coin ssl is more general than supervised earning right so it's really ssl could be the more general objective to learn representation. The more prior knowledge you have, the more you know about your downstream tasks, the more you know about your labels and then SSL.
like slowly become supervised learning through the labels that you use for the SSL objective. But then because as you said you have this hierarchy now it does not really make sense to say you have either supervised learning or SSL rather what makes sense is
to say, okay, what's this relation matrix, what's this pairwise matrix? If you build it from labels, it's supervised learning. If you build it from other a priori knowledge, for example, two consecutive frames in a video, basically have the same class, then you are more... in an unsupervised SSL setting, but it's all about how do you build this pairwise relation matrix.
that's the main question very cool right let's move on to the next paper so no location left behind measuring and improving the fairness of implicit representations for earth data so there's loads and loads of modeling frameworks now that do these implicit neural representations of geospatial earth
data, so things like climate modeling, resource allocation, environmental modeling. I was actually interviewing Johannes from NXAI yesterday. I don't know if you know him, but he's working on similar stuff. The problem is you've studied this, and you found that there's loads of biases and fairness problems.
Yeah, exactly. So basically, what we show is that when you want to model, for example, let's say temperature or precipitation to make it simple, and you want to learn, for example, an implicit normal representation, it means that you want a model so that if you give a location and
date, for example, it can predict what was the temperature there. So if you have this type of implicit neural representation, it's very good because if you learn a nice model, then you can actually interpolate those values. So maybe estimate what the temperature was.
in this part of the globe where you did not have a sensor. But you can also do extrapolation as well. If you assume you really learn the true physical model of the world, you could start saying, OK, what the temperature will be two years from now, right? So this is very nice to have this type.
type of model for all sorts of applications. The thing is that when you do this nowadays, depending on the architecture and the different design choices that you do, you will maybe have a very good prediction on average. So when you look at the average performance or
around the whole globe but actually if you look for example around islands or coastal area your prediction is going to be very bad almost random so this is something that can be very concerning because if you use this type of model to decide about a policy
that will affect a specific island. Using this model prediction is... as good as using like random guesses so it can be very detrimental and people need to be aware of those biases so what we found is that for example for this type of climate data islands are often disregarded area basically region where you have a big gradient in the type of data that you try to model how much of a responsibility do modelers have you know to
detect these kinds of biases in the data so i think there is like two components as you said so one could be that just the dynamic of the data you are trying to model is harder near island or maybe it's even unpredictable because you don't have enough observation chance to do that so you have some uncertainty that probably you can never recover from good design but still what we found here is that a lot of the biases now
comes from the architecture and all you want to do to encode those positions the type of basis you use to do the prediction so right now it seems that a big chunk of the bias comes from the architecture but i totally agree that I don't think we can remove the bias entirely because there is maybe just different type of uncertainty at different parts of the planet as well.
i mean the world is a very very complicated place i mean realistically to what extent can we mathematically model it yeah so that's a good question so i think it depends the type of horizon that you have and the type of data that you want to model if you have a system that is much more chaotic or can very very quickly without
much changes in the past observations. That's something that current models are having a very hard time with. If you want to predict something else, for example, temperature in North America, not near the coastal area.
inland maybe that's why you have less gradient dynamics things are a bit more stationary spatially and through time so then it can become much better but I think at this point we don't have an architecture that is really able to understand that you have physics, different dynamics models at different parts of the globe and so because of this you just
see what's the best on average and it means you miss out a lot of details. Can you tell us about some of the the technical framework? So one thing we showed for example at least for this type of globe data representation is that people use a Fourier basis to
the prediction and this is something that is better than not using any basis at all but what it means that you imply the type of signal you're predicting is very stationary and not localized at all and this is a very strong prior right so this may true for some things but for other things like precipitation or temperature where you have localized very high gradients then it's a strong bias and if you come from signal processing community you know very well that
to have better localization you go from Fourier to wavelets and so that's one thing we did in this paper and we showed that using wavelet basis to encode those data allows you to have better localization and this removes some of the biases and here it's more of a proof of concept that different design choices give you a different type of bias trade-off.
the answer to everything, right? But I think the next step is to really be able to encode less and less a priori which basis to use and let the model learn from the data on its own. And we are not yet at this point at least. for this type of climate data.
How could it handle noisy or missing data? This depends really on the type of model you use. So for example, if you have INR, then you will not use the missing data as part of your training pipeline, and that's one of the benefits of them. So if one of your sensors stops recording during...
some years you just don't use that as part of your training data because you really control where do you have the data and when you have it what the prediction should be so these earth models they are now informing policy around the world
Who should we hold accountable? I mean, is it the technology? Is it the scientists who design the models? Is it the policymakers who interpret the results? I think it's very hard for the person who designs the model to know a priori what is going to be used for.
So I think it's more downstream when you know clearly what you want to do with it. You should first set up a nice evaluation pipeline to make sure that it's something you can actually use to make those decisions. And then you can report any... type of failure mode you observe for people to improve on the design but a priori it's very hard to imagine what this model will be used for so in the ideal setting you wish that there would be no bias at all but in practice
of possibilities being so large it needs to be more of a feedback loop and then iterate until you have something that you can really trust and then you can act on it earth modeling data is very anthropocentric right so you know we we focus on on human populations and so on should we also focus on you know like just ecosystems and places
Oh yeah that's a great question and in fact that's one of the big issues with a lot of the data set which is crowdsourced because by definition the amount of data that you get is proportional to the number of users you have depending on
the location and this means you have a huge bias in what your model is learning and what your model is focusing on which means you miss out on a lot of things so i think that's also one thing that okay crowdsourcing can give you a lot of data quickly but it's very
data so then the question is how much of this biased data versus maybe paying a lot more and capturing other parts of the globe how much of the two you should have and maybe you could be able to show that under some specific condition just having
10% of the data which is high quality, uniformly sampled, and then 90% which is crowdsourced. You can try to use those 10% to anchor your representation and then... use all that data together but there is a huge amount of research question in that because that's a very big source of bias and there's a bit of a policy question but we are using these things you know to do resource allocation right So...
giving more resources to to some populations might be taking it away from others and then there's the fairness over time thing as well which is that what is fair like now might not be fair in 100 years time so how should we think about it yeah that's a good question i think this is also very uh
application specific. So for example, if you want to predict where to build a house to solve some specific problem, maybe you don't really mind having bad prediction where there is no population anyway because you are not going to build a house there. So in this case, maybe the crowdsource.
seeing a type of data is actually good, but this could really be dependent on the type of application. And just one thing I will say regarding the point you made before, this type of bias actually is something that you have in computer vision. nice paper done by Mark Ibrahim. Basically, they showed that most of the data we have from Imagineat is from North America. And so maybe you reach like 90% state-of-the-art performance to predict, for example.
type of chairs, cars, but only for North American models. And when you start looking at type of cars or chairs in Central Africa or East Asia, suddenly the model performance is extremely bad. So this type of problem is something you have across modalities and that's something that's a very big big issue randall it's always a pleasure and an honor to have you on the show thank you so much likewise likewise thank you so much