#175 - How to Solve Real-World Data Analysis Problems - David Asboth | Tech Lead Journal podcast

00:00

All data scientists and all analysts should spend more time in the business outside of the data sets, just in the actual business to see how it works. They should be shadowing their colleagues who are in charge of either entering the data or just doing business operations because then you have the context and then you understand the columns that you're seeing in the data. So just understanding that data generating process is really important.

00:31

Hey everyone, my name is Henry Surya Virawan and you're listening to the Technically Journal Podcast, the show where I'll be bringing you the greatest technical leaders, practitioners and thought leaders in the industry to discuss about their journey, ideas and practices that we all can learn and apply to build a highly performing technical team and to make an impact in your personal work. So let's dive into our journal. Hello guys. Welcome to another episode of technicianal podcast.

01:07

Today I have David Aspart here. He is the author of soft any data analysis problem. So any data analysis, I'm sure today you know that the topic we are going to cover is about data analytics or maybe data science or whatever data problems that you're facing right now, right. So I think I hope today David will be able to give some insights how we can actually learn from his experience. So David also have a podcast, probably we will touch on a little bit about his podcast.

01:35

So welcome to the show, David. Hi Henry. It's great to be here. Right. David, in the beginning, I always love to ask my guests to maybe share a little bit more about yourself, right. So if you can mention any highlights or turning points that you think we all can learn from you. Sure. So I've changed careers a little bit a few times. I mean, it's always been in the tech space. I started off as a software

01:57

developer. My undergraduate degree was actually in video games programming because I thought that's what I want to do. I like video games, so I thought, well, obviously I'd love to make them. And it turns out it's really difficult. You have to program some really difficult things around, like graphics and there's actually a lot more maths involved. And so you know, that was, it was a very interesting degree to do. And the thing I definitely learned from it is that I really like coding.

02:21

That's something I I wasn't really exposed to before that degree. And so I became a software developer and I did that for a few years. I was really enjoying it writing sort of enterprise software as it sort of happens if you're in a a small team. I was also in charge of the reporting. Eventually that became one of my roles as well is that the sort of we didn't have a data team as such in the company.

02:43

And so I took on a lot of that responsibility and ended up delivering answers to internal customers that they had by pulling data from our database. And over time, I sort of started to prefer doing that part of the job because I found that I was closer to the value generating aspect of the business. I was closer to real business problems where software development in a lot of cases is a little bit devolved somehow

03:11

from the business. Like if you're a software developer, you don't necessarily have to understand how the business runs, right? You don't necessarily have to understand how does the business make profit? Who are the customers? How did the customers make profit? You know, what is the operating model? Those things are not that relevant to software developers a lot of the time. Whereas if you're a data person, I mean, you can't provide any value unless you know how the business works.

03:34

And that's something I learned in that role. And so I thought, Oh well, maybe I should make a career of working with data instead. So that was my, I guess that was my second pivot from games into software and then from software into data. And then I did a masters in data science because turns out it's called data science. That's what I found at the time. So that's something that I should do. I was promised it was the sexiest job of the 21st century and all that kind of stuff.

03:59

And so I thought, OK, I'll study that. And so I sort of did a master's degree and then transitioned into being a data scientist in industry. And I did that for a few years, learned some very important, very interesting things about the difference between data science education and data science in practice, which I'm sure we can talk about. And then again, the sort of undercurrent this whole time in my career changes was that I

04:23

always wanted to teach. Like, education is one of the things I'm really passionate about. And I I was trying to find various opportunities over the years. But it wasn't until I landed in the data world that I found my niche of data science and education. And when the pandemic hit, it was the part of it that was fortuitous for me.

04:41

It was that a lot of teaching. Well, all teaching became online and so I suddenly had all these teaching opportunities that meant I didn't have to leave the house. It just made it things logistically a lot easier. And so in late 2020 I quit my job and started teaching full time as a consultant. And so that's sort of what I do. These days I call myself a data generalist because I've done all these different things and I haven't really pigeon holed

05:03

myself into any particular role. Mostly these days I do educational work, so designing and delivering workshops, anything from half a day lectures to A10 week accelerators in data science and Python And things like that. And I'm really enjoying it because there's a a variety of clients to work with a variety of problems for the people are trying to solve that I can help with. That's where I've sort of landed. I mean I don't know if anybody can learn from that.

05:28

I mean what what I've learned is that just follow your interests like whatever has interested me. I I just put everything else down and just went towards that. And that's how I've sort of ended up in my like fourth job. Hey, thank you for being part of the Techno Journal community. This show wouldn't be the same without your ears, and you are the reason this show exists. If you're loving TLJ and want to see it keep on growing.

05:52

Consider becoming a patron at techledjournal dot dev Patron or buying me a coffee at techledjournal dot dev coffee. Every little bit helps field the research, editing, and sleepless nights that go into making this show the best it can be. Thanks for being the best listeners any podcast could ask for. And now let's get back to our episode. Thank you for sharing your story. I think it is very interesting,

06:18

right? I think many people started their computer science study because of the interest in gaming, right. So maybe people love playing games. I also did my computer graphics course back then. I think it was, yeah, difficult if you didn't get the math, I guess. So I think the career that you took probably is also quite common for some people, right? So they started by being a generalist software developer, but found into a specific area. Dive deep into that and become a specialist.

06:44

And you also host a podcast house Tech Data Science. So maybe tell us a little bit more about that. What can we learn from that podcast? Yeah, so that podcast. Grew from my first real data science job after finishing my degree and my Co host Sean was actually the guy who hired me. He was the hiring manager who hired me into that role at the time. And you know, very early on in that job I realized and Sean was the same. He sort of came from academia

07:10

into this kind of job. And we both quickly realized that what we thought the job was going to be is not at all what the job is like in reality. You know, a lot of data science education is focused on tools, techniques, algorithms. And so you get this picture that, OK, well, I'm going to be deriving formulas and doing all this complicated machine

07:29

learning at work. And then you go in and it turns out, you know, a lot of the job is navigating the complexities of an enterprise environment, working with so office politics and things like, oh, the data's not actually available and no one actually has any solid research questions. So we have to find those as well. And so we were very, very quickly hit on this difference

07:51

between education and reality. And so we we started having these conversations internally about OK, what are the things we've learned about how industry is different, What are the skills that people should actually be trained on or at least be warned about upfront so people have a better picture of

08:06

what the job looks like. And one day we just said, well we've had these conversations quite a lot and every time we went to a meet up, we would talk to a like minded people and have the same conversations and we thought well we might as well just put them on the Internet for other people to learn from. And so initially the podcast started off with the two of us booking a meeting room at work and taking our work Samsung phone, putting it on the table and just having a chat.

08:28

And then eventually we became a little more professional and got some proper tooling and proper microphones and things. But that's how it started. And so currently we're running a season where we're talking to educators in the data space, because I think at this point in time, getting the education of future analysts and future data scientists right is really important. So we've spoken to people who are Python trainers, but also people who are spreading the

08:54

idea of data literacy. You know, we've talked to a variety of people and it's been very interesting to see what their perspective is. And there is a lot of commonality with our philosophy, which is trying to morph education into something that is much more applied and much more ready for the real world. I'm quite interested in the name itself, like half stack. Why are you calling it half stack? I mean, is that the opposite of full stack, right? Why data science is half stack?

09:20

So probably you can explain a little. Bit no. That's a good question. It it's a response to the full stack idea. I mean, one thing we never liked was this idea that a single data scientist has to be this Unicorn who does everything in a company. I mean that's just that's not the reality that would never function in a business, especially a solid decades old enterprise. You can't just drop a single data science Unicorn in there and hope that they'll make the company lots of money with a

09:47

machine learning model. Immediately I. Just remember there was 1 presentation. We gave somewhere and Sean we talked about things like the difference between academic data science and business data science. And just somewhere on those slides he coined the phrase half stack data science. You just put it on the slide and it's sort of it just sort of

10:03

stuck. And the idea is that because data science is so different in the real world as opposed to in education, then you need people with a sort of this hybrid skill set and more generalist skill set. And I we don't really think that having someone be full stack is realistic, at least you know, being an expert in everything from data cleaning to statistics to software development to

10:28

business strategy development. I think these days is pretty hard to be full stack and even the stack gets deeper and deeper right? So I think there's so many technologies that these days people need to learn. So I think your hands probably half stack, 1/4 stack. Probably makes. Yeah, exactly. Fractional stack. So let's go into the topic of today's conversation which is about the data analysis.

10:51

So I think in the beginning you mentioned you realized there is a big gap between what normally data analysts or data scientists learn throughout their education or maybe boot camp courses, whatever that is compared with the real life problems, right. So what are the typical gaps that you see challenges for people from Academy or maybe learning from their study and thrown into the deep end into real world problems. So what are typical gaps or challenges that we have to think

11:19

about? Yeah, so I've taught accelerators and boot camps and I was faced with this problem as well of trying to teach the right skills, but also within the framework. Of what I was. Expected to teach. And so you know there's a list of technical topics that you absolutely have to teach in order to make someone a data analyst, right? You need to be able to read data, combine data, clean it, identify missing values, outliers, all this kind of

11:43

technical stuff. You know, we usually teach like Sequel, some kind of business intelligence tool like Power BI or Tableau, maybe Python, ideally have some kind of programming language in there, or maybe R See, these are sort of technical skills that you just have to teach in the foundational training, because otherwise you can't do the job right. Excel counts as well, so all all the different things you can do in Excel. And then sometimes, depending on the course, you'd also teach

12:07

machine learning. You know how to build a machine learning model, how to do some of the the sort of the practitioner, things like cross validation, other things like that. But technical skills are not the whole job, right? And I'm sure multiple guests on your podcast probably said the same thing, that software engineers don't spend most of their time writing code and data analysts don't necessarily spend most of their time analysing

12:29

data. There's all sorts of other things to do, like identifying problems to solve in the first place. That's not something we teach in foundational training. How do you have a conversation with another human and read between the lines of what problems they're actually trying to solve? And you don't even necessarily realize that that's a skill

12:48

you're going to need. And my problem with that kind of thing is that we often hand wave it away and say, Oh well, of course people would just learn that on the job, but how, right? There's no one to actually teach them in any sort of formal. Way it it's just sort of, oh, you just pick up these skills as you go. So anything about navigating like priorities in a company where five different stakeholders ask you for seven different projects, How do you

13:12

know what to work on? How do you generate a value statement for any analytical work? So how do you think about the actual quantifiable value that this project is going to have, what kind of impact it's going to have? And then some of the other things is, so you've got building all these machine learning models, but then how are these models going to be

13:31

used by the company? And so sort of planning after the first model component as well, that's something that you just sort of have to learn on the job rather than being prepared for it. And then finally, the other thing that I try and teach as much as possible is like the most realistic data sets that you can work on rather than the toy examples that we teach. Now almost everyone who's taken some kind of data science course has predicted the survivors of

13:57

the Titanic, right? That's one of the classic data science machine learning examples. And I I think the real world applicability of that problem is not that high. So the fact that the data sets in education are not that realistic is one of the problems. And the other one is often they're they're actually quite clean. You know, there's a meme in data science which is 80% of data sciences spent cleaning data and the other 20% is spent complaining about the fact that we have to clean data.

14:23

But it's true that often, you know, we're the ones who have to either collect the data or find where it is in the 1st place, combine it, document it for the first time. And these are all things that take time. And these are all things that we don't think about much in the classroom. And you know, one reason for that is we just don't have the time we have to teach these

14:41

technical skills. But I think there is room to make education better at the foundational level by incorporating more of these real world elements. Yeah, I didn't study data science, data analytics, but I did some kind of data projects. People write data reporting, real time analytics and things like that. I think the level of complexity and ambiguity I guess increases as the amount of data sources, right. And also depending on the cleanliness of the data, right.

15:08

So I think when you study in maybe, I don't know, Bootcamp or a course, right, typically you're given a a data set which is kind of like clean enough and you know, well defined and things like that. But I think as soon as you hit the real industry, right, so you realize that it's actually not as simple as that. Sometimes also like identifying the problems to solve, right, the question becomes much more

15:28

abstract probably, right? There's no, like, real formula and maybe there's no even like 100% solution that you can actually come up with. And hence the first thing that you mentioned is actually identifying the problem. So how can people who learn more about technical stuff, right? Because typically it's quite straightforward in the boot camp, like, OK, here's the data set, here's what I want you to find. And it's kind of like

15:50

straightforward. But actually in the real business world, sometimes identifying the problem is a challenge. Not to mention as well that you don't understand the domain of the business. So maybe from your experience some tips that you can teach us here. Yeah, I mean I was lucky in the the company I worked for. So we worked in the used car industry and like on I think my first day was at an actual used car auction that the company was holding.

16:13

So I actually got to see the company in operation. There's no mention of data on that day. It was all about like this is how the business operates. And I think that's a really good model. I think all data scientists and all analysts should spend more time in the business outside of the data sets, just in the actual business to see how it works.

16:33

They should be shadowing their colleagues who are in charge of either entering the data or just doing business operations, the sales people, the customer engagement people, everyone who's contributing some way to the company. I think data science should understand all those functions because then you have the context and then you understand the columns that you're seeing in the data.

16:55

Again, in education, we normally say, OK, here's the data set and then here's something called the data dictionary where each column is labeled and then we we tell you what each of the columns means, which which is great, right? For for educational purposes, great. That thing doesn't exist in the real world. Usually we have to write our own data dictionary, but even then that's a very narrow view of looking at it just as a table of

17:15

numbers. What we should look at it as is the wider context in the business, right. So where does this data come from? Who enters it, when do these records get generated? So just understanding that data generating process is really important. And yeah, you said about understanding the domain, it's really important for a data scientist to understand the domain they're working in.

17:34

And that's the kind of thing you can learn that on the job for sure, like you don't need to spend a. Year. As a used car salesman before you go into a data science job in the used car industry. But once you're in there, you know having that additional curiosity and context about the domain is definitely important and not something that you can delegate to other colleagues. Yeah, sometimes. I think also like data scientists or data analysts typically, right?

18:01

They love playing with data or their tools, their sequel or BI tools, whatever. Don't forget that you should also collaborate, right? You're good with crunching the data, cleaning the data. But if you collaborate with the domain expert, maybe sitting side by side, show the data and ask questions why this matters or what kind of data that you

18:18

are dealing with, right? So maybe if you collaborate more, you'll get to learn much better, because otherwise you'll probably won't be stuck into identifying the problem and also giving a solutions. Which brings me to the next

18:30

topic of discussion. Right in your book you mentioned this result driven approach being pragmatic when you come up with data analysis because sometimes I can see we crunch data, we use sophisticated tools and techniques, right, but doesn't necessarily bring results. You know after a few weeks of time you come up with the result, maybe the business or the stakeholders don't really get it right or don't feel satisfied with the answers.

18:54

So tell us more a little bit about your result driven approach so that we can be more pragmatic in our data analysis. Yeah, I'm glad you mentioned collaboration because that's partly what it comes down to is as data people, we need to remember that our primary goal in any company is to provide value of some sort, whether that's clearly monetary or saving time through automation. Whatever it is, that is our

19:21

primary goal. And something I say to students is if you can solve a stakeholders problem, answer the question with a single bar chart, then fine. It's the problem solving that matters. It's not the level of sophistication in your tools. And that's what I'm trying to get across in the book, is that the key thing is to have an end goal to start with. Like it's one of the first things you need to do is define an end goal that you want to reach.

19:47

And it might be like the simplest version of the problem. It might be the smallest possible answer. I call it the minimum viable answer in the book, which is you know, what is the absolute minimum amount of work that you can do to get something. A result that will take you to

20:02

the next step. And then you know, it's usually then a conversation, as you said, a collaboration with stakeholders to say, look, this is what I did based on our conversation, what do you think, what direction should we take this in? And just having that at the forefront of your mind throughout the analysis I think is really helpful because as data people, we can go down the rabbit holes. If you're working on the data set, usually you get more questions generated than

20:25

answers. Every time you look at a new column. You're like, oh, there's missing values here, oh, there's a relationship between these columns and so you can keep exploring it forever and as you say, never get to a result. So if you already know upfront what your results should be, the whole exploration has a goal that you're directing towards, which you know makes it quicker to get to an answer, and it will also make the answer more useful

20:46

and in textual. So very interesting that you mentioned we should start with an end goal in mind, right. So I think typically maybe from my experience I I didn't see many data analysis work that way, you know giving an end, end goal 1st and typically it's like stages, right. So you start with the first crunching of the data, the milestone and then you give the preliminary result, right. But I think, understanding the end goal, how should probably

21:10

the result look like? Confirming with stakeholder. That's what you advise in your book actually, right? So you do the first iteration, do as minimum as possible, which is the minimum viable answer that you have, and ask for feedback and then shape from there, right? Rather than going the rabbit hole. I think many people, especially for me when I work with data, it's always fun to crunch data, you know, firing different sequel statements and wait for the results and storing it to

21:37

somewhere, right? It's always fun and maybe generating reports or charts which are fancy, but I think sometimes it doesn't solve the problem. So I think it's very very important. Have a goal in mind, the end goal in mind. Clarify that with stakeholders and iterate right. The iteration here. How? How would you suggest people to do right? How short should the iteration be? Or what is too long for you so that you have to be wary about? So maybe a little bit of tips on the iteration.

22:03

Yeah, that's a great question and and something that we often wrangle with on our podcast. We had a whole episode dedicated to estimating time because in software world what I was used to is that we got pretty good at estimating how long a task would take, right? It's like, oh, we need to add 2 buttons to a web page and you're like, OK, I need to write some functions in the background, maybe I need to like create a new database column, I need to write this code.

22:27

It's probably going to take me a day, and we were pretty close most of the time for these little chunks of work. In the world of data analysis, there's so much uncertainty, partly because the problem is ill defined, partly because we don't know the data very well, partly because we could find anything of interest and go down all these rabbit holes that it becomes very difficult to estimate how long something will take. But you can't just say that to a

22:50

stakeholder, right? You can't just say, I have absolutely no idea when I'm going to get back to you on this. It's not, unfortunately, not viable in the real world. So what we usually said was, again, we didn't call it a minimum viable answer at the time. We just said we'll do some work to get towards this particular answer, which we've all agreed on. Looks like a plausible first step. So in a week's time we'll report back or in a few days we'll report back.

23:15

And then rather than saying when the work will be finished, we would just check in at intervals. That's one way to do it is to just say we're going to tackle this problem, we're going to work on it for a while. This is the end goal we have in mind and we'll check in in a few days and the check in might be here it is. Here's your minimum viable answer or the check in is this question that you had actually is very difficult to answer with

23:35

the data we have. Here's the other kind of data that we need to collect before we can give you an answer. Again, it's about collaborating and keeping your stakeholders in the loop so you have shorter iteration cycles basically are what you're after. So I think in the typical software development project or maybe product development, right, many teams actually don't start with a good data design so

23:57

to speak, right. So they come from the operational point of view, you know, they just do transactions, store the data and that's it, right. So there will be tables, mostly relational tables probably, and then these tables will be given to the data analyst to derive some insights, right. So I think in your book, in all the problems that you have in each chapter, right, you always come up with the data dictionary.

24:19

I know probably it's a bit luxury in the real world to actually see a good data dictionary, but tell us really the importance of this data dictionary and is it just defining, you know, table column types and column description or is there something else beyond just that? Yeah, in the real world it is rare to have a document like that. I mean, you might have pieces of it scattered around but not collected together in something as coherent as a data dictionary.

24:47

I mean, the purpose of the data dictionary is at a surface level, to record all the different columns in the data and what they mean, what data type they should be and then what they represent. Sometimes it's because the column names are abbreviated, so it just tells you sort of what the column actually means or what what the abbreviation stands for. But deeper than that, what you want a data dictionary to tell you as well is the process that

25:09

generates those columns. So for example, just to give you an example, we had sale data from the used car industry. Every time there was a sale at an auction that was recorded and we had a couple of different columns. One of them was called the sold date and the other one was called the date sold, which just sounds like we have just this redundancy of two columns that measure the same thing, but it turned out they don't measure the same thing.

25:33

What it turned out was one of them was the date that the sale happened, so when the auction happened, but there was another date that could be in the future because sometimes there's a dispute around a used car. You know, like they buy it and they look at it and there's a scratch and they didn't see it before. And so they dispute with the vendor. Maybe they'll take the negotiation offline and agree on a different price in a couple of days and then that date get stamped separately.

25:57

And so if you have a very high level surface level data dictionary that just says this is the date sold. And then for the second one it also says something like this is the sold date, that's not useful. It needs to give that deeper context of why we have these columns in the 1st place and what are the possibilities of the different ways that the values could be filled in.

26:16

So ideally the data dictionary also talks about the data generating process and how the sort of the business operations translated to this particular data set. And I think in particular it's becoming more important if let's say you have a multi stages kind of process that derives the data, right? Hence probably these days people refer to it as data lineage, right?

26:35

Where you start from a typical business process that generates the data, but then it goes through different transformations, maybe different systems, different processes until it gets to the final sync or data, the last place where the data gets stored, right. So I think the data dictionary will be much more important because you don't just see you know the column names and the values right? Which is sometimes misleading just like what you mentioned, right? I think date.

26:58

So So date is not just a common thing, right? But it's also like probably you can see it all over different data sources because different people maybe creating the column, different teams creating a column and also maybe different department using the same term, but actually means that different things, right? Hence, probably the Domain Driven design kind of a practice makes more sense and I think not just data dictionary. In your book you also mentioned

27:21

this thing called data modeling. So you mentioned this is probably the most important step as well. Before you start data analysis, tell us a bit more about data modeling. What is this step and what should people do in the data modeling exercise? So one of the ways I think about this is that there's a good definition of data science. It's been around for a while, which is the process of turning data into information and information into insights. And it's this first step of

27:47

turning data into information. And again the terminology is very sort of blurred. But when I think of data turning into information, it means the data is whatever you have lying around. As you said earlier, you alluded to operational transactions that happen have to be stored in a database. They weren't collected with analytics in mind necessarily. They're just database records that power some kind of customer facing application. And as analysts we want to come in and analyse that data.

28:15

But in it's raw form, it's almost never usable. We need to do something to it. And when we do things like what we might call data cleaning as part of an analysis, what we would like to do is to do that data cleaning once and have a cleaned version of that data stored somewhere. Part of data modelling is doing that data cleaning once, so that the logic of the cleaning is encoded already in the data that

28:36

we use. I mean, lots of companies will have this problem where for example you have a bunch of Tableau dashboards and in every dashboard there's a formula that calculates some relevant metric but that. Calculation is duplicated across every dashboard, so if you ever make a change to it. You need to remember which dashboard it's also in and it becomes the sort of mess that you you don't necessarily have a handle on.

28:57

If you have a clean data model where that metric is already pre calculated and the dashboards just read from that clean data model, then you've got that problem in one place. So if you need to change the metric, all the dashboards will update and that's the very sort of simplistic way to look at it. And the other reason to do data modelling is to sort of capture. Business entities in the right way.

29:18

So one of the problems we had to work with was what is a customer in our business and that sounds like a very simple question like obviously business knows what their customer is, but we had different business areas that worked with individuals. So individual used car. Dealers. So they were people. But then we also had customers who were entities. And they might be again, they might be a single dealership, or they might be like a parent.

29:42

Group. And so of those different entities, which one is a customer? Well, that depends on who you ask, and it depends on the purpose that you want to use the data for, And So what you need to do then is have some sort of. Customer data model where everybody agrees on the definition of a customer and then if somebody needs to know how many customers do we have, ideally all they have to do is just count star from that table.

30:05

That's the sort of dream scenario where you've done all the business logic and all the work up front to have a clean data model that can then be analyzed much more easily. And this again, is something that we don't really talk about in foundational training. We say, yes, you need to clean your data, but we don't say once you've cleaned your data, you should probably have a clean version of it somewhere and stop cleaning your data multiple times, stop repeating yourself.

30:27

And so that's why I dedicated a whole project in the book to data modelling to sort of practice this idea of taking raw data and turning it into a specific structure which is tailored, again specifically tailored to the questions you're going to ask in the business. Specifically about data cleaning, actually this is probably like what you said, the

30:46

meme, right? 80% of your effort probably is spent first understanding how dirty your data is and then like because there are many variations, right? Sometimes it could be the user input that is probably not clean. Second thing is there's no validation in the software that captures it.

31:02

And the third thing, for whatever reason, right, people put different formats, you know, like from different systems, probably they don't have a uniform format, so they just use whatever format that makes sense for them. So data cleaning probably is something that is really, really

31:14

hard. And first of all, right, if you have millions of records, for example, you probably won't understand how clean because you might look at the first, I don't know, 100 rows and you just deduce, OK, this is typically the data, but actually there are many other columns or many other

31:28

data that you don't see, right? So maybe a little bit of tips, how can we actually do this data cleaning much much efficiently so that we don't fall into the gotcha where actually you clean maybe 50% of the data, but the other 50% is something else, you know, like a different rubbish altogether. So maybe from your practical world example. Yeah, I think what's funny is we used to have this term, big data, right, to describe data that cannot be processed on your

31:56

laptop. And you don't see that term around very much. And that's because processing power, even on individual laptops, has grown so much. And even getting access to a remote cluster that has a lot more resources in your laptop is is pretty easy these days. So I don't think we often have that problem where you can only clean half of your data and the other half you can only revisit when you run some code later or something.

32:21

I mean, some companies obviously have data that's so huge that they need special methods, but I think most cases is not the case anymore. But when it comes to data cleaning, I mean one thing I would tell students is don't try to clean the whole thing at once before you do anything with it because you're going to find some issues down the line

32:38

anyway. So again, it's just this pragmatism of figure out what part of the data you need right now and have a look at it. And there are some checks you can do, right? There are some surface level checks you can check, like what are the unique values in this column, are there missing values, are there outliers? You can do those things, but some of the more complex problems or patterns in the data that will either invalidate your analysis or require you to to redo some of your work.

33:06

You won't notice them until you start working, and so again, don't be wedded to this idea that you have to make a perfectly clean data set before you start. Working. Do again the minimum that you need and just start doing the analysis with the knowledge that you're probably going to have to go back to step one again and again. So if you would ever watch me doing an analysis, it's never like I have the analysis in my head and I just have to type it out.

33:29

It's an active process where you do some stuff and you go, Oh no, this doesn't make sense at all because I found something in the column. I have to go right the way back to the top and start again. And so you might have to rewrite some of your code. You might just have to rewrite a bit at the top and keep going. And there are cases in the book where you know, I have example solutions for each project and I go down specific particular rabbit hole that I followed to

33:52

get my particular answer. And you'll see things in there where I say, oh, it turns out this is the case. There's an e-commerce data set in there and then we have some products that are miscategorized. And in the way that I wrote the example solution, it's trying to be as realistic as possible. So I haven't done the analysis and then written it up cleanly. I sort of write it up as the real process. And so halfway through, you're like, oh, these labels don't make sense.

34:16

We need to go back and fix this quality issue in the data and before we can carry on. That's the realistic way to think about it is you will find data issues throughout the process, so don't worry about getting it perfect the first time. Yeah, I think it's worth to emphasize, right, probably 100% accuracy, sometimes it's not possible, right, especially if you're dealing with really large

34:36

data, right. So maybe some kind of percentage where you would accept, OK, maybe these are the normally kind of a data, maybe state that assumption or maybe state that signal that you can see from the data. And I think there are plenty of useful tools these days that can actually give you a sense of like a distribution, for example, in a column, how different or how is the variance

34:54

of the data inside, right. It can give you a statistical distribution or it can give you some kind of patterns that you can probably deduce what kind of data is inside. So use that kind of tools. I think speaking about big data, I think these days people want to build like a data lake in a company where you put everything together into a data lake.

35:14

Maybe in your practical experience, is there some kind of different challenges that people have to deal with dealing with data lake or maybe big data in general as well? Yeah. I think it's very tempting to take absolutely any data that you have lying around and dumping it somewhere and saying, oh, we'll come back to it when we need it. And so the technology is there to allow people to do that quite easily.

35:39

And the problem with that is that there's no thought given again to the end product of like what are we going to use this data for? And dumping stuff into a data lake because at some point in the future we might need it, is not necessarily the best approach. It can create a lot of problems down the line. And if you think about the phrase data science, half of it is the word science. So that's not how science works, right? Science.

36:02

When you need to collect data, you have a hypothesis, you set up an experiment, you actually have sort of theoretical framework to build around before you even think about the data part. And I think some of that could be applied. To sort of. The business world where we don't dump stuff into a data lake for the sake of it, we think a bit more about, you know, what is the problem we're actually trying to solve. Therefore, what is the data that we need?

36:26

Therefore, where should we store what information? Thanks for the tips. I think, yeah, because of these cloud technologies and potentially storage cost is cheap, right. So they will just dump everything and maybe think about it later how we can use the data. But I think sometimes it's not why simply because yeah, the amount of data is just large, right. And how to deal with it, what kind of insights probably is just difficult if you start with

36:48

that big amount of data. And I think these kind of challenges are quite typical in a day-to-day world. But maybe from your experience, what are the typically business problems that a data analyst should know about should equip themselves with? Maybe in your book you mentioned things like categorization, dealing with time series, or maybe What are some of the favorite typical problems that

37:10

people should be aware of? Yeah, I picked the projects in the book specifically to address topics that I thought were missing from foundational data training but that actually come up a lot in the business world. You mentioned time series

37:26

forecasting. That's one of the things I talk about a lot with students is that, you know, we usually have like maybe one session on time series forecasting and we'll teach them a little bit about how to reshape time data, how to think about time data differently from tabular data and how some of the methods are different. We don't spend a lot of time talking about like econometrics or anything, which is where. There's a lot of. Time series forecasting problems.

37:49

But I think the opportunity to forecast things in the real world is actually that there's a lot of those opportunities and it's actually much bigger than we let on in basic training. So that's why I have a project dedicated to time series data. And then there's also this other idea of working with categorical data. Now that's something that we mentioned as an aside in foundational training, we'll say, yes, sometimes your data is categorical.

38:12

And here are a couple of methods that you can use to transform that data into something else. But if you work with operational data like people filling in forms and entering things in records into a system, anytime there's a drop down, you've got categorical data. And so it it's actually a lot more prevalent than we let on. We spend a lot of time talking

38:32

about correlation. We spend a lot of time looking at distributions and things for continuous data, but we don't talk about methods for categorical data enough. And one problem with that is then you're not equipped to deal with all these columns that you'll actually see in the real world. But the other problem is that people accidentally shoehorn continuous methods into categorical data. So I even break down an example

38:56

in that chapter in the book. There's a famous heart disease data set where you're trying to predict whether someone has heart disease based on various different measurements. And like the entire data set is numeric. So it looks like, oh great, we have all this continuous data, we can just throw correlation at it. We can throw all these continuous methods at it. But if you actually read the data dictionary and going back to what we said before, you actually see that most of those

39:21

values are categories. And they're like one of them is something like the slope of, I guess like a table or or treadmill or something that was during the test. And it's not a measurement, it's not an angle of the slope, it's just one of some values. So they're not on a continuous scale. So if you start applying methods that are meant for continuous data on that column, you're going to make incorrect inferences from it. And the really difficult thing about this is that you don't get

39:47

an error message if you do that. The data analysis tools are not going to tell you. Are you sure your methodology is correct? No, because you've just said I want an average of this column, but it doesn't make sense. It doesn't make sense in that context to average that column. And so the difficulty here is to remember that you need to think through your methodology harder, because the computer is not. Going to tell you otherwise. Yeah.

40:09

Hence, I think the data dictionary again that you mentioned is very, very important, right. Understand where the data gets generated, right, which business process, which system, what kind of inputs that can be possible, not just looking at the data and create your own assumption. So I think that's pretty dangerous.

40:25

And I think when you mentioned about prediction, there are a lot of problems that data analyst has to come up with, which is to actually derive predictions or maybe models to actually predict a result, right. And this is typically unknown problem where you don't actually know the accuracy of what you come up with. So how do you deal with that kind of ambiguity, first of all? And how second thing is that you can come up with a much better

40:49

prediction. So maybe something in the typical real world, do you do much more rapid iteration and test it in the production before you actually come back and derive a second derivation of what you did? So maybe some tips here as well. Yeah, that's a great question because prediction is obviously something everybody says they want. The question is, you know, what is the output of that work? That's something we found out

41:13

the hard way on a project. As you know, I built a predictive model for something that was from a technical point of view, it was accurate enough to use. And when we tried to put it into production, we found various organizational barriers to it, like the data that the predictive model requires doesn't arrive in time, so we can only make the prediction when.

41:32

It's too late. And then the clients that we would use this with didn't actually have the levers in their business to change anything based on our predictions. So it was a sort of twofold failure from an organizational point of view. And from then on, we were much more strict about, again, starting with the end of like why do you want us to make these

41:51

predictions. And I think as a data person, that's a question you should ask immediately when somebody says, I want you to build a predictive model or we should be able or we should be predicting this thing is OK. But what are you going to do with the predictions? What is going to change in the business? How, how are you going to respond to these predictions?

42:11

And so it's nice to have that conversation up front because then you know, your stakeholders forced to think about, OK, if we had this predictive model, what would we actually do with it. So again, it's not a technical challenge because I think the technical challenge of prediction is pretty well catered for. There's lots of libraries to do machine learning. There's lots of tips and tricks out there, but the organizational side of it is really where these projects are won or lost.

42:37

So my biggest advice would be again to have that human conversation of what are you actually going to do with your predictions first and foremost? Yeah, you mentioned something very interesting, right. So organizational challenge. So not necessarily all the time is a technical problem or data problem, but actually organizational challenge. And I think what you mentioned also very very insightful in my

42:57

opinion, right? Don't just build any predictive model as if like you just want to learn different algorithms and tools right, And use whatever fancy techniques. But actually thinking about how is the model going to be used in the real life scenario, what kind of value can be derived from there? Is it even possible to be used by the business? And speaking about predictive model machine learning, I mean the topic of AI. These days there are so many discussions about using AI to do

43:24

some kind of mundane analysis. Can AI be used also for data analysis? What have you seen in the industry typically how AI is going to change the landscape of data analysis? That's a very interesting question. I played around with the data analysis capabilities of these various tools. Some of them are more sophisticated at this point in time. Anyone listening in a few months time is going to change anyway, so there's no point naming tools

43:50

specifically. But, you know, some tools are more advanced in data analysis than others. And on the one hand, it's great to democratize the ability to say, Here's a somewhat messy spreadsheet, give me some information about it, give me some insights, give me the biggest drivers to success or to a sale or something to, you know, what drives property prices based on the spreadsheet of retail transactions, that kind of thing.

44:16

On the one hand, that's great because you don't have to put people through technical training to get there. But I think what it does create is the necessity that everybody understands how data analysis is done from a more sort of theoretical point of view, to understand what is possible, what are the limitations, what are the biases to look out for? What are the biases, societal biases that will be baked into

44:42

the data. And this is true for any output these AI tools generate, but also for any analysis that comes out, and also any analysis you do, regardless of AI or not, as you know, there's going to be these biases in there. One very, very basic example that I showed on a course was uploading survey results, right. So imagine you've done some kind of online survey and you've got a CSV or a spreadsheet of some sort of the responses that you can download from this tool.

45:10

And so yes, you can upload it to one of these AI tools and say, you know, give me some information. And one of the things I demoed was telling the AI to give me the average response time. So what is the average time that people took to fill in this survey? And So what the AI tool did is it identified that there is a start time and end time column. It understood that those are

45:31

supposed to be dates. It understood that they should be different so you can tell what the difference is and then that difference should be averaged. And so the answer we got from this, it was a realistic, it was a real survey data set. And the answer, it was like, oh, the average response time was something like 49 minutes from

45:46

this survey data. And you know, I said to the participants, you should never believe what comes out from the data analysis because you should think about, does that answer make sense in context? We were all very sceptical. It shouldn't be 49 minutes. It was a very short survey. And you go into the data and sure enough, there's one outlier, somebody who left the computer on all day. And so their particular response time was 8 hours, Everybody

46:09

else's was like 5 to 10 minutes. But when you said to the computer, you know, I want the average, it just took the mean of that column. And so that was heavily skewed by that one outlier. And obviously as an analyst you should have maybe taken the median something that's more robust to outliers and got a more realistic. Value. On the one hand, it's good that you there is transparency in these tools. It would actually gives you the code it ran to get to that

46:31

analytical answer. So you can double check and you can check it's homework. So in that sense it's less of a black box. But if you don't have the required data, literacy or statistical training, even that little bit of statistical training to understand what does an outlier mean? What is the difference between a mean and a median? You wouldn't necessarily see what was wrong, and you might be trapped into just believing the

46:53

answer that comes out. So just like with any response from an AI tool, and people just need the healthy skepticism of just reviewing the answer and checking whether it makes sense. Very interesting example that you gave, right? So I typically use AI these days to generate code, you know, like a coding assistant. I think it's much more well defined problem, right? You can actually test it straight away. Given an input you can see the

47:17

output. But dealing with data, I think it's a different kind of a problem because maybe you don't even know the answer, right? So if you just believe what AI is giving you, I think that's a danger. That's the first thing. Second thing, I think all these LLM tools is not definitive. So maybe you ask the first time it gives you this answer, maybe second time is a different thing. How can you actually work with that kind of request response model, which is probably

47:40

different every time you ask? And the third thing is the assumption that is baked in into how AI actually processes the data, right? Which is why the critical thinking aspect is very, very important. Do you think that data analysts should feel that their job is safer because of this? Or how should they equip themselves with the AI so that they can have AAI assistant that can power their data analysis process much better? I like the way you phrased it, which is is the job safer?

48:08

Most most of the time people will ask whether the jobs are going to be replaced by AII think it's just the same as with. Programming is for very basic tasks. I can see automation happening and I can see some of these tools taking some of the work away, potentially generating boilerplate code generating like I want to create a chart to do this. I'm not familiar with this particular visualization library, getting up to speed with that and and getting the right chart out of the other end.

48:39

You can definitely see AI accelerating that process, but just like with writing code. If. You don't understand fundamentally how that task is done. You can't check the outputs and you can't debug any problems. And the problems with data analysis, as we said earlier, are even trickier than with programming because you don't get an error message that you'll get some kind of answer. You can average a numeric column and still get an answer, whether or not that makes sense from a

49:05

methodological point of view. So I think the future might be that we have AI built in to help us accelerate those little bits of code and little bits of manual tasks that we might want to automate away. But I don't see an analyst's job changing fundamentally because we're still supposed to be trusted business advisors, we're still supposed to be generating value for the business, and AI is just going to be another tool in our. Toolbox to do that.

49:33

Yeah, I think in the, I don't know like in the non tech world some people predict that they can replace all people by AI maybe like a question and answer model right, where you can just give a data and then you start questioning them and they just give you an answer. But I think this is quite dangerous if you actually don't really understand the analysis and like for example it's a very simple example, you know survey

49:54

average response time, right. While you have outlier, the result can be really really different. So I think here the critical thinking is very very important, right? Don't just assume that everything AI generates is actually valid. So maybe from your experience, I don't know how much you have applied AI. So any kind of problems, like your favorite problems that can be solved by AI more effectively, Maybe you can share some of your power AI user

50:19

experience I guess. Yeah, I do have some AI examples in the book as well. I try to identify places where AI will actually accelerate the process. One of the examples there's a chapter where the the data set is actually a bunch of PDF files and so the task is to extract data from these PDFs and then do the analysis. And that's not something that we usually train people on. It's quite a niche thing. Although PDFs are everywhere, it is quite a niche thing to have to analyse data from PDFs.

50:50

Coming into that problem with an AI assistant means that you can accelerate the process of finding, for example, the right Python library. That's one of the examples I have in the book is I want to extract data from PDFs. What are my options in Python? Like I could go away and Google it when it'd take me a lot more time, so I I do like it using it for that if it's a domain I'm not familiar with.

51:10

If there's a problem I'm trying to solve where I think there must be a Python library for this so somebody else must have solved this problem in a way that I can use it. AI acts as like a super powered search because you it's not just a search query, you can actually give it some context about what you're trying to do. That's definitely one aspect I see. And then just, you know, accelerating smaller tasks. I mentioned creating charts.

51:33

If you want to figure out how to do a specific kind of data visualization, and again, you're unfamiliar with the library, you

51:39

need a bit of help. You know, I can give you the starter code for it. Although one problem, and this is I think true for Copilot and other kinds of tools that generate code, is that it's only trained on what it's found on the Internet and there's a lot of sort of not even incorrect code, but maybe inefficient code or maybe not quite the right way to do it. Again, a very specific example is there's a Python plotting library called matplotlib, and

52:03

there's sort of two different ways to use it. One of them is the old MATLAB style. If anybody listening has ever used Matlab, there's a specific way to create plots in MATLAB, which is how matplotlib was originally written, and that's how you create plots in it. But now there is a much more modern object oriented interface for building with map plot Lib where you create your chart. Object and you assign the properties and stuff like more

52:24

how you. Would sort of write software in general, but unfortunately a lot of the code on the Internet uses the old style, and so the AI might perpetuate the use of the outdated, less modern style of code. So even for tasks like using a library, you've got to be careful about what the training data is out there. That's practical tips for people, maybe some creative idea how you apply AI in your day-to-day job, right? So I think extracting text from PDF also like coming up with

52:51

different charts. I could also imagine like for example downloading data from a typical API, right? Because if you're not familiar with the API, maybe AI can help accelerate that. Or maybe just like transforming data into like a different format. That is probably also another small task that you can use to solve the problem using AI. So David it's been quite a great insight you know conversation about data analysis.

53:15

As we wrap up the conversation, I have one last question that I would like to ask you which is something that I call tree technical leadership wisdom. So if you can think of it just like an advice that you want to give to the listeners here, maybe if you can share your version of tree technical leadership wisdom. Yeah, sure. So one of the things that I think differentiates like good analysts from the best analysts is curiosity. Just wanting to find out the answer to something.

53:42

I don't know if that's good advice because I don't know if you can teach that to someone. I don't know if you can learn to be more curious. But even if just mechanically you try to find the answer to things and persevere beyond the first answer or beyond the obvious answer, that is really such an important skill in life. But it's particularly in data analysis, like really wanting to dig in to find the answer and not resting until you're satisfied with the answer is a

54:07

particularly good skill. And just from that, I think practicing your skills, just whatever your tech skills are, just constantly practicing them is so important. I mean, I'm just writing, you know, this book is a project based book full of practice opportunities for people. Because I think that's really once you've got the foundations, the best way to learn is to apply your knowledge and practice. And so just keep doing that in

54:29

the data realm. That just might mean solving problems for yourself, even if it's optimizing your fantasy. Football team or whatever. It doesn't have to be a money making business opportunity every time. Just something that solves a problem with data is great. So practicing your skills to stay relevant and to keep them fresh and to learn new things is vital. And just on the sort of more organizational side of things, I think it's very important to have a good reason.

54:56

To do data. Work in the 1st place. I think there is a maybe mistaken belief in business that doing data analysis is inherently useful. It's just inherently a good thing that we should be doing. But if you don't have a purpose, if you don't have an end goal, if you don't have a good reason to do it, it's not going to be successful. And that's true for analysts as much as entire organizational strategies just have a good reason to do data work in the first place.

55:23

Very interesting. Last wisdom there. So I think I can actually relate with that kind of advice that you just gave, right? Because many people think, OK, we have the data, we have the data. So let's just come up with the insights. What insights are? Maybe they don't know what kind of insights they want to derive from it. So I think, yeah, know the reason why you want to tackle data problem. I think it's really, really

55:42

important. So for people who love this conversation, they want to learn from you further. Or maybe they just want to find more about yourself and your book, maybe. Is there a place where they can find you online? Yeah, I think probably following me on LinkedIn is probably the best place to look. So my name is pretty uncommon, so you can put it in the search

56:00

bar and find it pretty easily. Yeah, I don't post much on other social media outlets anymore, so I think LinkedIn is probably the way to see what I'm doing more day-to-day. I also have a website where you can check out the book and there's a link to the podcast and other things that I've written. And yeah, the book is available on the Manning website, and when it comes out, it'll be available on Amazon. Thank you.

56:20

I wish you good luck in the process of publishing that, and I hope people today who listen have much more equipped into solving any data analysis problem, just like the title of your book. Thanks Henry the.

Transcript source: Provided by creator in RSS feed: download file

#175 - How to Solve Real-World Data Analysis Problems - David Asboth

Episode description

Transcript