So said. I am deeply humbled to be here tonight, and I'd like to, if I do well. Which is a big if. Dedicate this lecture tonight to my. High school teacher. And here is why. Many years ago, more than three decades ago, when I was in high school, I was selected to prepare for and then to compete in the Mathematics Olympics. Because they thought people thought in Austria that I was smart enough to be a good contestant.
And to my embarrassment, they didn't do very well. So my high school teacher, Hans Kraus, came to me and he said, You're just not smart enough for a real good mathematician, but maybe you can be an applied mathematician. So how about the how about physics Olympics? So I did. And I did very well. And. With that, I shall start and dedicate this lecture to Hans Kraus, who gave me the guts to speak to an audience of mathematicians today.
I'm starting to talk about big data with a case study, a particular case that has to do with something that you're all familiar with, and that is the flu. Every year, tens of thousands of people around the world die of the seasonal flu. But in 2009, a new flu virus was detected, the H1N1 virus. And at that stage, there was no vaccine available. So the best that public health authorities around the world could do was to try to limit the spread of the virus.
But for that, they needed to know where the virus was. In the United States. This role is taken on by the Centres for Disease Control in Atlanta and they have doctors, general practitioners from around the country to tell them each and every H1N1 case that they have. And based on that and some hard labour over days and days and days, they're able to tell the policymakers. Were the flu is at any given point in time. Two weeks later. Which is an eternity.
If you have a pandemic underway. Just around the same time. Engineers at a little Start-Up Company called Google in Mountain View had a very different idea of how to predict the spread of the flu. They said, we'd like to predict the spread of the flu by just using Google searches.
That is search requests sent to Google. Google receives about 5 billion search requests every single day and has stored and saved all of them, including information where they came from and so forth over the last 15 years. So the idea was to take the last five years or so of Google search requests that it received and to take official figures from the Centres for Disease Control and to see whether there would be a correlation,
some kind of a connection. And in that process, Google's engineers tested 50 million different search terms for 150 million mathematical models. And then they found one that provided a pretty darn good prediction. Here you see the official CDC data and the Google Trends predictions. But. Importantly were the CDC prediction was always about 10 to 14 days late. Google could do it almost in real time. And. The idea behind this, the direction that this points towards is big data.
Now, big data is not just about public health. We find it everywhere, including in finance. In financial services. The vice chancellor has already alluded to, I think, in his speech, the importance of this matter, especially in light of the Great Recession of 2008, when we look at September of 2008. And when we now read the protocols. Of the Open Market Committee. Of the Federal Reserve. We find that they were clueless about what was going on because the data was not there.
In fact, the deflationary tack after the Lehman Brothers default, after the Lehman Brother bankruptcy, was only felt in the CPI and the Consumer Price Index many weeks later because it takes many weeks to do the Consumer Price Index. There's a small Start-Up Company out of a research project at the Massachusetts Institute of Technology. Start-Up Company now is called Price Stats. And what they do is they go out on the Internet and every single day, every single hour, they suck down.
Over a billion price points of consumer goods from hundreds of thousands of different offerings. And they do this in order to get very early indications of inflationary or deflationary trends. And so they were able to. See. The deflationary impact of Lehman Brothers and what came afterwards much earlier, if in fact. Policymakers had known if in fact, Federal Reserve policymakers had known perhaps some of the policy decisions would have been made differently.
It is big data. The idea that we can gain insights from a large number of data points that we couldn't otherwise. Now, all three human history. We human beings have tried to make sense of the world around us. And we did that mostly by observing it. And in order to do that, we needed to collect data. But for all of human history, until extremely recently, the collection of data was hard, difficult, time consuming, expensive.
And so in this world of small data, we collected as little data as absolutely necessary in order to answer the question that we had. We defined, designed, created the processes, the institutions, the structures that we used in order to make sense of the data. Knowing that we had very little of it. And so that we needed to squeeze every last drop of meaning out of them. What if that changes? What if we could envision a world in which there is.
A lot of data available. Then we would have to rethink. Perhaps the processes and the institutions and the structures that we use to make sense of the world, much like this gentleman who is speaking out of the old world view and into a new one. It all began perhaps some 15, 20 years ago in earnest in the natural sciences. Consider astronomy. When the Sloan Digital Sky Survey came online in the year 2000, it collected.
More data in its first few weeks of operation than had been amassed in the entire history of astronomy. Since then, since the year 2000, it has amassed over 200 terabytes of astronomy data, 200 terabytes of astronomy. But a successor telescope, you see a rendering that is going to go on stream in the year 2016 is going to collect that much data. Every five days. Or take genetics. In April of 2003, the world celebrated a tremendous achievement.
That is, after ten years, $1 billion and a global effort. We, the world, were able to have a full sequence of one person's DNA. 3 billion base pairs. Fast forward to today and sequencing my DNA. Not just one person's DNA. My DNA costs about less than $1,000 and takes 2 to 3 days in one lab. And that generates another 3 billion base pairs of data Internet companies to a drowning in data, 500 million tweets with Twitter every single day.
800 million YouTube users upload an hour of video every single second. Even if you stop sleeping, you could never watch that amount of video. And on Facebook, 10 million photos are uploaded every single hour. Google processes dozens of petabytes of data every single day. Dozens of petabytes. Petabyte. Petabyte. Petabyte. Petabyte. What did you have for lunch? A couple of petabytes. How much is a freaking petabyte of data?
If you take all of the characters in a book and all of the books and magazines and all of the other holdings of the largest library in the world, the Library of Congress together, and then multiply it by 100. That's about a petabyte. If we look at the growth of data in the world, the best guesstimate that we have over time tells us that the amount of data in the world from 1987 to 2007 grew 100 fold. Now. That's quite amazing, isn't it?
100 times increase if Elizabeth Eisenstein maintains we go back in human history to find. A time when data are increased as much. We have to go back to 1450 to a rough 1506 in these 56 years of the Gutenberg revolution. The amount of data in the world doubled. Here in 20 years, we have 100 X. But that's only half the story. And it's quite a powerful one already. The other half, the story is denoted by the different colours. The light pink denotes analogue data.
The dark purple denotes digital data. And if you look at this white vertical line here, that is the year 2000, the year 2000, three quarters of the data in the world was still analogue. Today. It's less than 1%. Within 15 years, we have moved from an analogue world to a digital world. And that, of course, means that it is easier to collect, easier to store, easier to analyse, easier to retrieve. What does this do? How can we imagine?
What are the results of this? The consequences? Well, think about it. The real element, the essence here that I want to convey is that if you increase something radically in quantity, it can take on a new quality. Consider photography. If I take a photo of a rider on a horse. That's a photo on the rider. On a horse. If I take a photo every second of a rider on a horse.
That's a lot of photos. All right. On a horse. But what if I take 16 photos per second on a rider on a horse and show them in fast succession? Then this added quantity of data, a quantity of images that I have, translates into a new quality into in two. Moving pictures. In essence, what big data provides us with is a new perspective on reality. Now. How can we characterise that perspective on reality? Let me try with three words here more messy and correlations. First, more.
More means that we have more data today available relative to the problem we are studying or the phenomenon that we are trying to investigate. It's not necessary to have a billion data points. You can have 60,000 data points and still be doing big data. If if that encapsulates almost all or quite close to all of the phenomenon that you are trying to study.
For instance, if you have that amount of data, that comprehensive dataset available, then you can let the data speak rather than have it answer questions that you already had in mind when you. Collected the data. What do I mean with that? Let me use photography again. This is the moment here that I like the most. It is the moment when I take a picture of you. Now. Would you please smile? Okay. Now, as I take this picture, I have to make a decision.
The decision is, who do I focus on? Do I focus on the Dapper Bill Nighy and the first row? Or do I focus on you back there and the last row? If I focus on you, Bill, unfortunately, you get back there, you'll be out of focus. You'll be blurred. And afterwards, I can bring you back into focus. It's gone. The data isn't there. So I need to know at the moment of collecting what really is important for me and what isn't. But what if I don't? Did. Start over again.
Good luck. What if we could have something that would be better than that? Well, take a photo again as an example. This is a photo of a toothbrush. It's in focus back in the blurry part of the photo you find, actually. A photo of my four year old son. I can't put him back into focus. Right. He's blurred. Too bad to lend a. This is not a normal photo. This is a photo taken with a big data camera called a lateral light field camera. Here it is. And so when I take a photo, it's a huge file.
It takes all of the focal pains. It takes all of the light rays in. And I can click on my son and there he goes and comes into focus because all of the data is present in the photo. Or I can click, of course, on the toothbrush. If I'm more interested in that and that comes into focus, I therefore can let the data speak and ask it question. That I didn't know that I wanted to ask when I collected. The second element is messy.
And let me just very briefly say that in the big data age. We will combine datasets of varying quality. And that requires us to give up a little bit on our desire coming from the Small Data Age to focus on exactitude of data and data quality. What we gain in volume, we can trade off in quality, at least to a certain extent. More and messy together lead to insights through correlations.
Now, in the first class, in statistics, what they tell you is data does not give you causality, only correlations. That's true. It's incredibly hard for human beings to understand. So take the example of global supermarket chain Walmart. Walmart captures all kinds of transactional data about what people buy and when and so forth. So they get a big data analysis to find out more.
And they discovered that just before a hurricane hits a Wal-Mart location, people go to the Wal-Mart and they buy batteries and flashlights. Of course I would have thought so. But then they discovered that they also buy Pop Tarts. Actually strawberry Pop-Tarts. Pop-Tarts is a sugary American snack. Please note that I do not call it food that is being sold at Walmart. When they found this out through correlational analysis, these researchers immediately said, Oh my God, why is this the case?
So why is this the case? Oh, and they came up with all kinds of hypotheses, right? You buy these pop tarts to feed your kids, you buy the Pop-Tarts, because if you eat a lot of Pop-Tarts, you basically hallucinate the hurricane away or whatever. Until a researcher said, Stop, timeout. We don't know. The data doesn't tell us. And guess what? We don't need to know. All that we need to know is what is happening, not why. And that is good enough in this particular instance. And so the other said.
That's right. So since then, Wal-Mart before hurricane moved the Pop Tarts from the back of the shop to the front and sells even more of them. But for us human beings, that's really hard to understand because we human beings are almost hard coded to see the world as a sequence of of cause and effect. We can't escape that. Daniel Kahneman, the Nobel laureate in economics, said that this is this fast thinking that we think we make sense of the world, even though oftentimes.
It's kind of like it makes us feel comfortable. It gives us the feeling that we understand the world, even though we don't. And so, for example, when I would have had dinner with you in Hall and one of the colleges that shall remain unnamed here, just a hypothetical here last night. And I would have had a stomach bug this morning.
I would immediately have connected what I eat in hall with the stomach bug, even though it is far more likely that I would have gotten my gastroenteritis bug by shaking hands with some of you. Our brain cannot stop creating these causal linkages. But when we do so, we must be incredibly careful because often times we are just following the wrong path. So rather than jumping to a quick conclusion about why things are the way they are, it would be better to just first know what the things are.
They are. Or in other words, we need to learn to walk because we can run. Hmm. Now. One way of looking at this is through the example of machine translation. In the 1950s, the US government has had amassed a lot of documents in Russia, but they didn't have enough translators, so they said. Will translated with the help of computers in the wind, the computer scientists and the computer scientist said, Oh, this is easy.
We teach the 200 or so grammatical rules into the computer, a dictionary to it, and in three months we are done and we have machine translation. 12 years later and about $1,000,000,000 later, that project was declared a failure. Nothing happened in machine translation until the 1980s, when engineers at IBM had a very different idea. They said, Why don't we give up on teaching the computer?
Why one word in one language translate into one word in another language and just go with what that is with statistical correlations and what word is most likely going to be translated from one language into another language? And they had a great training text that with the proceedings of the Canadian Parliament available in English and French, and they did it and it worked beautifully. It was the first machine translation to actually was useful.
Then they thought that they could make it even better by. By triggering, by changing. By tweaking the algorithm. But it didn't really matter very much, so they gave up. Ten years later, this Start-Up Company in Mountain View that I already mentioned. Google got into the fray. A German Google engineer by the name of France UX said The problem is not the algorithm. The problem is the amount of data. We need just more data.
And so they sucked the entire World Wide Web, you know, all the different language versions of the websites of the European Union. Finally, they're good for something. We're sucked in all of the multilingual websites of these multinational corporations where all of the PDF users manuals that can be downloaded from your route to do your ironing board. To your VCR. And when I read this and when I heard this and when friends told it to me, Ken and my and myself, we said, You must be kidding.
I can't even read the English version of the manual of my VCR because it was written in Shenzen or somewhere. And he said, it doesn't matter if you have so much of it. That little blip in the quality really doesn't matter. And so machine translation in Google Translate is a wonderful example of more messy and correlations. But if you think now that this is all just about the Internet and Internet companies, think again.
We heard about health just a little earlier. And health is an area where big data will make tremendous inroads. Think, for example, of a particularly vulnerable group of human beings, premature babies like this one. Dr. Callum McGregor at the University Hospital in Toronto had the idea to use big data to help the premature babies. Premature babies are particularly vulnerable because we discover that they have an infection, often too late.
Symptoms manifest themselves too late. So what did they do? They had digital sensors that measured the vital signs of these premature babies to the tune of 1200 data points a second. And they collected them over hours and days and weeks and dozens of babies and then look for patterns with a high degree of likelihood would predict the onset of an infection later on. And they found the pattern.
And so they now can predict the likely onset of an infection 24 hours before the symptoms manifest themselves, just by looking at the patterns. And guess what? There are two kickers in the story. First, what is the pattern? The pattern is that suddenly the vital signs stabilise. Rather than going haywire. Who would have thought of that? You know, every doctor that I ask would tells me if the vital signs stabilise, it's time to go home.
Patients are doing well. It's exactly the other way around with premature babies. And the other kicker is Dr. Callum McGregor, who is saving babies. They're. Her doctorate is not in medicine. She's a computer scientist. So what this tells us is that we need to be humble. Vis a vis the world around us. Because we understand less than we think we understand. And we need to take this on board and we need to see whether we can let the data speak in order to understand the world.
With more of its complexity. Now go back to the Google flu trends prediction. You may have heard in the in the media that Google flu trends were off, were wrong in December of 2012. Here on the very right hand side, you see the spike. Google flu trends predicted more flu cases than there actually were. What was going on. What was going on was actually something that is crucial and brings in two Europeans, Mr. Bull and Mr. Bayes.
Because if you ask somebody, what's the chance when you throw a coin that it lands, heads up. Most people will tell you 50%. That's a really good approximation, but it is actually wrong. Because if you throw a particular coin, it's not 5050. It never is. Everybody throws slightly differently. Every coin is slightly different. We just have an approximation 5050 works pretty okay. Like Newton's gravitation law was pretty okay until we needed to have GPS and then we need it to go beyond that.
But. But it's the same here. And so we need to whenever we have more data available and new data available, feed that in and rethink our view of the world, our perspective of the world. That's why the Bayesian idea of using priors is so fundamental to what big data is all about. And the mistake that Google made is that they once developed a model for flu trends in 2009 and stuck to it and never updating it based on the additional data that it had when they updated it.
With 2010 and 2011 data, their December 2012 forecast was almost spot on. So we need to abandon our idealised worlds and approximating reality through big data. No. This means. That what is happening here at the Mathematical Institute is core is essential because we need to develop almost a new language, a new way of thinking, a new toolset for this big data world. And you are doing it here.
And when we look at it, most of the tools that we have in the small data world of making sense of data, you know, like our square. Most of the software tools that we have like are developed here at Oxford were originally developed in a small data context. Now we need to find their equivalents in the big data. And that is, as I understand it, with the Oxford Nye NIE Lab will provide us.
Cutting edge research and trying to help us enter this big data world with the right tools so that we can make sense of it. For that. Of course, we need data ification, that is to render every more aspect of our world into data form. We all need to do that location, right? I still remember that when you would take a car and drive into a new city and somebody would have a map on the lap sitting next to you. None of my students here in the audience ever would do that anymore.
You did that? They asked me. Didn't you have sat, NAV? No. Location has been data fied and therefore we can analyse it. But it's not just that. In Japan, researchers have data five human behinds through a 30 sensors. They measured the size of the bum. Why? Never ask researchers that question. Why do they do that? Because they think. That every person has a different bum. They discover that the bum is as unique as a fingerprint. And so the idea is to use this as a car anti-theft device.
You get into your car. Your car measures your bum. You can drive off. The thief gets into the car. Bum is measured. Bum is way too big. Car stops and the tracks. That's the data ification of another aspect of our reality. Now, you know, of course, this this is Google Glasses. In this version, Google Glass doesn't do what I think Google Glass is going to do and where Google is investing in it, and that is to data for the human gaze. What are we looking at?
Can you imagine how valuable it is to know what people, what human beings are looking at like, what advertisements they are looking at, what are they looking at in the shopping window, what men are looking at when they walk down the street? Well, skip the last one. We know that. Cars. Data ification that is rendering ever more aspects of our reality into data form permits us to then extract value.
And if you take anything about the value proposition of big data away from this talk, please take away the fact that in the small data rich, we. Use data for a particular purpose for which it was collected, and then we threw it away. And the big data age that we understand that the value of data is not exhaustive by using it once, but we can reuse it and reuse it and reuse it and reuse it for multiple purposes. Extracting more and more data, much like a iceberg.
With data. Most of the latent value has been untapped so far. But we can tap into it to reuse. For example, global financial payment company Swift, that is transferring money across borders. Swift discovered that it can use its data to predict the health of local economies because of the correlation that it found. That's the re-use of data that it has. Or Start-Up Company.
INRIX in Seattle helps over 100 million users every single working day to find their way to work or back with their car around traffic jams. Creating heat maps like this. Telling them where there is heavy traffic. Now. SAT maps do that right. But my satnav is stupid. My satnav tells me that there is a heavy traffic. When I'm already in it. This knows when heavy traffic is forming. Why? How? Because they have the data. How do they get the data?
Because every one of their 100 million users is a sensor sending back data on where they are, how fast they are going. And so. And then, you know what? INRIX found out? That they can reuse the data. And they teamed up with a hedge fund because it turns out that there is a correlation between heavy traffic on weekends around shopping malls and revenue numbers of the shops in the shopping malls. So they are buying or selling stocks before quarterly earnings results based on the predictions.
That, too, is a reuse of data. And when you have so much reuse happening, you can rethink your business model. Take Royce Weiss, not the luxury car company, but the jet engine producer. Number two, world's jet engine producer. Royce. Royce used to produce and sell jet engines. These jet engines have lots of sensors in there that measure vibration and temperature and pressure and so forth and send it to a computer in the jet engine that manages the jet engine.
And then the data are thrown away with the Airbus 380 engine here. You see, when they discover that they can actually capture the data and then send it back to Royce Royce headquarters once the plane has landed. They do that. It's an enormous amount of data, a couple of gigabytes per plane, per plane ride. But what do they do with it? They do an analysis to find patterns that show them when a part in the jet engine is breaking before it actually breaks.
So they can do what is called predictive maintenance, which is great because you can do them the maintenance before the part breaks. That is when the plane is on the ground. That helps, but it also helped tremendously Royce West to change its business model around from a company selling jet engines to a company selling fixed fee maintenance contracts and going into the service sector. Today, they are now 70% of their revenue derived from services.
Many of you will now look at this and say this means only that the big get bigger, the Googles, the Facebooks and so forth. And there is some truth to it because the big ones are buying up data ingestion platforms. Remember that Google bought a thermostat company earlier this year called Nest for almost $3 billion. A thermostat company. Give me a break. Thermostat company? But Google didn't buy a thermostat company.
Google bought a data ingestion platform that collects data about how cold or warm people want their rooms in their houses to be. And that was worth, in their mind, $3 billion. So. So big data we will have in this context in the financial services industry to improve forecast and decision making in the industry. We already see that happening. We also have it to stimulate innovation generally because with big data there will be many businesses coming up with many new ideas.
But also it will make the financial information sector itself into a data platform, which is precisely what Bill NI and FTT have been doing. So it's not just that the big get bigger, but that there is a place for Start-Up companies, for vibrant new entrepreneurs. Think of the site dot com a company that predicts whether for 50,000 consumer goods, what are the prices going up or down? And they are so convinced that their prediction is right that if you buy the product based on their prediction.
And it's wrong. They will refund you the difference. This is the brainchild of Oren Etzioni, and they do have the desired outcome, had hundreds of thousands of customers every day, billions of data points in order to do the calculation of the prediction for 50,000 consumer products. You know how many employees they have. 30, including the cleaning lady. How many servers? Zero because they all do it in the cloud.
And that means that there isn't a huge investment anymore necessary to start this up. So we'll see a lot of activity on the small end of the spectrum as well, with nimble companies being extremely successful. Now, let me briefly tell you that this also changes how we internally operate. In organisations. I'm not talking about 50 Shades of Grey. I'm talking about 41 Shades of Blue. That is a few years back, Google had to decide what kind of colour to use for its Google search box.
And the designer used a particular colour and his supervisor said, Why does colour? And he said, Because I'm the designer. And she said, Did you do a test? And he said, No. If you want to do me a test, I will resign. She said, Resignation accepted. And then they did a test. And then they found out testing 41 different shades of blue, that there was a slightly different shades of blue. That worked better. And gave Google $12 million more in annual revenues ad revenues per year.
So the person who fired that, chief designer, Marissa mayer, now chief of Yahoo! Said this was the best business decision she ever made. It means that self-styled experts will be questioning. What's your email? What's the factual basis? They will be asked for your assumptions, for your suggestions. You know, and that especially is valid in the financial services industry.
Companies like Imaginative in the Silicon Valley are looking at creating interesting trading platforms for specialised products. But then there is FTT, which creates with its big data set, not just a platform for people to learn, but for those to teach to people what they are learning. Learn as well, learn about the learning, and therefore create an environment that keeps people engaged and interested. Just amazing. But then I'm sure you have also heard of Target.
This superstore in the United States that was able to predict with a relatively good degree of likelihood. Based on transactions, shopping transactions that one of their customers was pregnant. Even though that customer might not herself know if the pregnancy and. Oh. And so there is a dark side of big data, too, and we need to mention that. Many of you now think of George Orwell's 1984 and Surveillance Society. Yes, yes, yes, yes. Snowden, I get it. But this is only half the problem.
The other half is that with predictions, we are also able perhaps to predict human behaviour. And as we predict human behaviour, we might be tempted to hold people responsible not for what they have done but what they are only predicted to do. Now if you think of the Hollywood movie Minority Report, that's precisely what I'm thinking of, too. In 30 states, in the United States, the decision of whether or not you are coming free out of prison on a parole.
Is being made in part by a big data algorithm that is predicting whether or not you're going to be a criminal in the next 12 months. Is this going to be the end of free will? Because everything will be predicted about human behaviour. We need to be very aware of those ethical dilemmas, but we need to also keep in mind that the problem here is not big data. The problem is how we use the results of big data. Correlations are not causality using correlational insights for causal purposes.
For example, to assign responsibility is abusing big data. It's stupid, it's wrong. You know, in the United States, they did a big data analysis to decide what kind of car has the least repairs, and it turns out it's a car of the colour orange. Half of you are already thinking why? Right. Is it because an orange car is more visible at night? Is it because our owner of an orange cause drives more carefully because it's a special car?
Is it because it was manufactured specially? Timeout guys and gals, time out. The data doesn't tell us. The moment you start imbuing data with more meaning than it has that moment you succumb to the dictatorship of data. And we need to be aware of that. So let me come to a conclusion. Big data is going to change how we make decisions, how we live, work and think from how we learn to what kind of medications we are getting to how cars drive themselves.
But big data also has. A number of challenges. So what is really important is as we enter this big data age that we where we remain fully in control of that technology, aware of its constraints and its limitations. That. As much as. We will utilise the insights from big data. We also protect and preserve a space for the human. For creativity, originality. For irrationality, for sometimes acting in defiance of what the data says. Because at the end of the day. Data is just a shadow of reality.
And therefore it is always incomplete. And always a little bit incorrect. And so we need to approach this new big data world with a lot of humility rather than hubris. And we need to do so with a lot of humanity. Thanks very much.
