166: Hugo Bowne-Anderson – The Trader’s Guide for Learning to Code (with a Data Scientist) - podcast episode cover

166: Hugo Bowne-Anderson – The Trader’s Guide for Learning to Code (with a Data Scientist)

Sep 27, 201858 minEp. 166
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Summary

Ian speaks with data scientist Hugo Bowne-Anderson about the compelling reasons for traders to learn programming. They explore the advantages of coding over traditional spreadsheets, offering insights into language selection (R vs. Python) and practical first steps for beginners. The discussion also covers overcoming common coding frustrations, the importance of asking the right data questions, and Hugo's perspective on building a basic quant trading strategy. The episode delves into data cleaning challenges, predictive model pitfalls, and the ethical considerations in data science.

Episode description

For this episode, I speak with Hugo Bowne-Anderson; a data scientist at DataCamp (an educational platform for learning to code) and host of the DataFramed podcast.

The idea for asking Hugo to appear on this episode, was to chat about learning a programming language. Because for some traders, having the ability to write code can have great advantages—such as having the ability to collect stats on market behavior, perform research in a robust data-driven way, visualize large amounts of data, backtest and analyse trading ideas, implement algorithmic strategies, etc.

Plus more professional trading firms and finance related positions now require applicants to have some programming skills. And the same goes for many industries, which should be no surprise, considering a recent IBM study revealed that ‘90% of the world’s data has been created in the last two years alone.’

Hugo and I discuss when someone should consider learning to code, determining what’s relevant, the time it takes to become fluent in a programming language, working with new datasets, what to be wary of when using predictive models. And for fun, I ask Hugo (as a data scientist) how he’d go about creating a basic strategy…

Learn more about your ad choices. Visit megaphone.fm/adchoices

Transcript

Episode Sponsors and Welcome

Chat with Traders is brought to you by Trade the Pool. Did you know that every decade the market reinvents itself? Online brokers opened the doors, mobile apps made trading seamless, and commission free trading erased barriers. Now a new era has begun. Meet, trade the pool, limited risk trading. And now you also have unlimited time to reach the profit target. From now on, your trading risk is capped. and your trading opportunities are limitless.

Trade the pool funds home-based stock traders with up to$200,000 in buying power. That means you can trade larger positions and scale your strategies without risking your own savings. It's time to trade with more capital, making it truly worth your time and effort. Ready to trade the pool? Click the link in the description and join the stock trading revolution today.

Are you ready to get serious about trading? Then join Tasty Trade, Investopedia's best platform for options trading in twenty twenty six. Options, futures, and more. Tasty Trade has everything you trade all in one platform. Get low commissions, including zero commissions on stocks. So you can keep more of what you earn. Trade smarter with advanced charting tools, a pre-built strategy selector, risk analysis tools, and more features. Visit Tastytrade.com slash.

Chat for more information. Tasty Trade Inc. is a registered broker dealer and member of FINRA, NFA, and SIPC. Podcast.

Introduction to Coding for Traders

Hey, hey, what's up crew? Welcome back. Or if this is your first time listening and I anticipate this episode may attract a few first-time listeners. Then welcome. It's nice to have you tuned in. I'm your host, Aaron Firefield. For this episode I got to speak with Hugo Baum Anderson, who, as far as I know, has never traded a single share in his life, but he is a brilliant programmer.

More specifically, Hugo is a data scientist at Data Camp, which is an online educational platform for learning data science. And Hugo also hosts DataFramed, which is a podcast by DataCamp. So I guess the idea for asking Hugo to appear on this episode was essentially to chat about learning to code. Because for some traders, having the ability to write code can have great advantages. such as having the ability to collect statistics on market behavior, perform research in a robust data driven way.

visualize large amounts of data, backtest and analyze trading ideas, and also to go as far as implementing algorithmic strategies, etc, etc. Plus, as some of you will already be aware, more professional trading firms and even broader finance-related positions are requiring applicants to have some programming skills. And you know, this spreads across all industries too.

Which I guess should be no surprise considering a recent IBM study revealed that about 90% of the world's data has been created in the last two years alone. So some of the topics we dig into during this episode include when should someone actually consider learning to code? How to determine what's relevant for you? How long until you become somewhat fluent in a programming language, and how to work with new data sets, what to be wary of when using predictive models.

And just for the fun of it, I also asked Hugo, as someone who's a non-trader, how he'd go about creating a basic quant strategy. Now if you like the episode and if you're keen to learn more about this type of thing, I do have an affiliate link for Data Camp, which you can use. That is chatwithraders.com slash datacamp.

Unfortunately, you won't get a special discount for using this link, but you will be helping to support chat with traders. You know, I myself, I've been a paying subscriber and a user of Data Camp for the past couple years. So I've got no hesitation recommending this resource to anyone else who wants to learn how to code. So again, that's chatwithraders.com slash data camp. Thank you very much. And I think that's everything. So here is Hugo Baun Anderson for episode 166.

Hugo's Data Science Journey

Well obviously Most of the things we're gonna be chatting about uh on on this particular podcast are gonna be around programming. But that aside, I'm interested to hear a little bit about your own personal background. Um, I know you're at Data Camp now. What did you do leading up to this?

Yeah, that's a good question because it does my passage to programming, I think, is one of many routes that people take take these days. So my background is actually in uh science and humanities, science and arts. So I did my undergrad. I did a combined Bachelor of Arts, Bachelor of Science, and I majored in, I had a double major in pure and applied maths. So I finished the science part.

And then I went to grad school. So this was at University of Sydney. Then I went to grad school to do a PhD at the University of New South Wales in pure maths.

tried to finish my arts degree at the same time, but they kicked me out because s purportedly you can't be enrolled in two different degrees at different institutions and I wasn't aware of that. So I got kicked out of my my my second undergrad. Um When I finished my uh pure maths PhD, which was in an an area called algebraic geometry, I I realized I wanted to get more into the applied applied mathematics, applied math, or applied maths, um, as as we say.

And so I actually started working in biology. I started a postdoc in Germany, in in Dresden. There's a big Max Planck Institute for Cell Biology there, doing applied mathematics, working in a biology lab, thinking about cell growth. cell dynamics, uh the physics, the biophysics behind these types of complex systems. And This is the first time I started working with a lot of of data. And I actually started calling it data then because people

You know, my American colleagues thought I was talking about a surrealist movement when I said data to too often. But I started working with a lot of data a at this point, and biologists kept asking me the same questions about Statistical tests about cleaning their data, about learning R and Python and this type of stuff. So I essentially taught myself. Data analysis, statistics, a bunch of machine learning in order to stay ahead of the curve working in working in in in in these fields. I then

wrapped up my postdoc. My postdoc boss um accepted a professorship in in the US at Yale University in New Haven, Connecticut. So I spent two years there um doing research. And I was doing a lot of teaching and a lot of education there because biologists kept coming and asking me the same questions. I'd started to run what I called practical statistics and data science workshops. Um, at that point

I met I met the datacamp team who was around eight or nine people then. Um, and they were really trying to get moving on their Python curriculum. So, you know, they put put the hard cell on me and I I I jumped on board to come and come and build kind of online education systems to teach in in browser experiences for learning data science essentially. So that's really how I ended up at at Data Camp. A lot has happened since then though. Okay. And how long have you been with Data Camp for now?

So I've been here for two and a half years now. Um and we've we've grown from ten people to seventy five uh in in that time. and move from teaching Essentially R and Python to now teaching R Python SQL uh command line stuff, Git, which we may get to, which is version control. So kind of versioning your code for reproducibility.

Um and spreadsheets now as well. So we've got our first several spreadsheets courses, which we're really excited about. Yeah, I did see that, which is uh which is kind of interesting. Uh just going back to your background, it kind of sounds as though And I might be might be wrong here, but it sounds as though you kinda took it upon yourself.

learning how to code, like it wasn't something that was expected of you in your role. You just sort of saw it as uh something that would be very beneficial to what you were doing. Would that be a fair statement? Yeah, that's absolutely right.

You'd you hit the nail on the head. My job was to do applied mathematical modeling. And whatever that may involve, right? And I realized at that point that it actually involved learning a lot about the data I was trying to model, learning a lot about the data generation process. thinking about it statistically and

The way we do that these days is using programming languages. I mean, you know, essentially people used to flip coins and use use pen and paper. But if you want to deal with large amounts of data, complex data from different sources these days, you need to do it.

Using using programming. So that's I I was uh sell I did self start in in that way. And I think that's a lot of people's journey to coding as well. I mean a lot of A lot of data scientists and programmers and coders don't have computer science or or software engineering degrees that they did learn on the job.

I'm a little bit surprised that you didn't learn any of this type of thing with the studies you were doing at university. Yeah, yeah, that's a that's a good point. And I think that definitely in in pure maths, I did a bit of um a small amount of MATLAB Uh and a bit of an a language called Maple, but very minimal. in the applied math stuff, it was really more focused around

Um differential equations, real analysis with kind of leans towards applications. But this is something that, you know, it's a big question in how we're teaching people mathematics and statistics now. Um, in that maybe we haven't necessarily done our job with too much of a focus on calculus, for example, and not enough on, you know, getting people analyzing data and and and simulating stuff and looking at distributions and all of this this fun stuff.

Why Traders Embrace Programming

Well, let's kind of turn the conversation now to maybe help some of the people who are listening to this podcast. So most of the people naturally who tune into this podcast are traders, uh and investors, et cetera, who are, you know, participating and betting in financial markets. When or why should someone consider learning to code? So I think Firstly, I wanna say that The one of the alternatives is the same.

uh is spreadsheets. And I think spreadsheets are amazing in in a lot of ways. I mean Tens of millions, if not hundreds of millions, of of people are are using spreadsheets to do do their job, right? I do think there are challenges involved in in in using spreadsheets and I'll kind of d discuss this in in In in a variety of ways. But you know, there are even online resources as to the types of mistakes people have made with with spreadsheets.

Um, you know, we don't have access to all of this information, but I've met, you know, a significant number of people in in finance working for, you know, big companies who've, you know, lost a significant amount of money due to spreadsheet errors for for example. So if you're working in spreadsheets, I'd I I definitely think

it can serve you well. Um but essentially in today's job market and even for a lot of traders, if you want if you want to have an edge or an a a differentiating factor for the for your practice and the type of work you do, I think programming is is definitely a very a very viable option. Um and I think You know, we've we see now companies um and organizations like Citigroup and Bank of America and JP Morgan. I mean, all of these places uh put Python first now. Uh and I think

Who was it? It was Robin Wigglesworth, right, in in the Financial Times a few years back, uh, wrote that. It used to be traders who were first class citizens of of the financial world. Um But it's technologists now who are the are the priority and people who can work robustly with large amounts of data coming from from disparate sources. And the way to do this now is with with programming. Whether that's Python or R or or or databases, um I'm not really

You know, we can have that conversation. But I think the real conversation um needs to revolve around whether you can do this stuff in a GUI, a graphical user interface which, you know, Excel or spreadsheets is an example of, or you want to do it in in in in a programming language. Um and if I may, I I think I'd like to just give you a few things about programming that I that I think really uh um help help with this this this type of process. The the first is that

It's reproducible. So, you know, when I'm using a spreadsheet or some sort of GUI, I'd I do a bunch of clicks and I can send you my results and you can't reproduce what I did. All you have is the results. Whereas if it's text-based, I can share it with you s straight away. Um so it's reproducible in that sense. You can reproduce it on your own operating system. It's readable. Um and on top of that, it's what what we call diffable. So you can see the diffs. And what I mean by that is

If you write some code to automate a process and then you change it, because it's plain text, we can see exactly what the changes, um what the changes are that have been made and and we can see the the workflow and uh and the process there. Um On top of that. you know, one of the s strongest aspects of coding and programming is that it automates stuff. And I've hinted at this throughout what I've said already. But, you know, if you do something like if you're like doing pivot tables

a few days a week or whatever, and you can write three lines of code that do it for you. I mean, that's just That's just common sense to me. Um, and pointing and clicking isn't scalable in in in that way. So I think these provide general arguments for why.

Why programming is good. We can talk about why open source programming in particular i i is good depending on on on how you'd like to frame the rest of this conversation as well. Okay. Yeah, cool. Well I'll make note of that and we might get to it. Um just Right now I'd like to ask you, you spoke there about spreadsheets and I presume you're talking mostly about Excel, right? Yeah, or or Google Sheets. I mean these are the two the the two big ones that that I know of. Okay.

So it's probably fair to say that there's a a large majority of people who are very familiar with Excel or Google Sheets and spreadsheets in general. When is it beneficial for someone like that to actually take the next step and learn an actual programming language? I think it's beneficial when they're doing the same stuff all the time. So doing pivot tables was one example, but if they if you know, if someone finds themselves

in Excel on Monday, Tuesday, Wednesday with different data data sets doing a similar a similar thing, it makes sense for them to learn to write some code in order to automate that, I think. If Uh w when it wouldn't be beneficial for them is if it's like a one off job doing a bunch of data entry f for example. But in terms of automating time consuming rote tasks, it definitely definitely makes sense then. Um, it also makes sense when they want to do more robust modeling and understand.

why their models are saying what they're saying. I mean you can model stuff in Excel. Excel's relatively powerful. um with with with basic modeling. But if you want to build models that you can dig into and find out why they're doing what what they're doing, um, I think both Python and R are are are exceptional at at this as well. Another aspect I think is if you wanna your workflow with with other people. As I said, you're doing a lot of pointing and clicking

um collaborators and colleagues can't really get insight into what you're doing. Whereas if you're working on a team of people writing code together, you can see exactly each step, especially when it's well commented, right? Which I'm which I'm a huge fan of. So these are these are several reasons. Um I would also say one of the problems with spreadsheets is that

you know, your data source and and your logic and your functions and formatting are all intertwined. I mean, you see nightmare spreadsheets where people like highlight a row in order to mean something. And I think separation of of data from this type of logic and from formatting is incredibly important to do robust data analysis and data science in the end. I I like that last point you gave there. Um I hadn't really considered that actually.

Now you mentioned uh Python and you've mentioned R. Obviously there's many programming languages which exist and which are open sourced, which anyone can access uh easy enough.

R vs. Python for Traders

How do they decide which one they're going to focus their time and energy on to learn? Like as a trader. Yeah, so I think this is a deeply mm personal personal question. Um and I I I need to be as as sensitive as possible'cause there are, you know, p very strong sides on on E i each side of the argument. But I I I think um

So if if your main interest is in doing statistics and exploratory data analysis and data visualize data visualization and these types of things, R is a really good place to to do this. Um Particularly in the why in the way there's a a set of packages referred to as the tidyverse, which are by Hadley Wickham, who's from from New Zealand as uh as well. And he uh and and his colleagues have developed this kind of uh set of packages which all allow for very uh

I think thoughtful exploration of data and data visualization. And one of the great things is they've written these packages that you can you write code kind of the way you think about the data. And they refer to not as functions in in these packages, but as verbs. So you'll filter something, then select something, then arrange something. Um and it's very easy easy to read in that that respect.

Um so for these types of things I think R is in incredibly strong. Uh I do think Python, if you want to do, you know, um Machine learning, uh, on large data sets, in production and and serious automation uh of tasks, which I think perhaps is is more what traders are i interested in. Python is probably a a a win there.

uh particularly in the in the machine learning and and and deep learning space, although R is seeing strides, strides there also. I would also add that um I personally The barrier to entry for Python is slightly slightly higher, I I think, but I like to refer to Python as the Swiss Army knife of programming languages, because rarely will it be the best tool for the job, but you can do anything with it.

Yeah, okay. So what are some of the limitations of R which you're going to be able to do with Python? Obviously you mentioned uh machine learning there. Is that kind of the key one or are there a few other things which which separate the two?

Yeah, I I think probably productionized machine learning models as opposed to, you know, getting a machine learning model working on your laptop, which you can submit to, you know, an online um machine learning competition, getting something productionized, Python is

Definitely um so getting it, you know, coming in with your streaming data source and getting all the results out there or, you know, feeding it into another API or something like that. Um, Python is a is a lot stronger. I I do think though we are seeing massive, massive strides in R on on on that side. But in all in all honesty, I think the the deep learning packages um uh uh

uh probably more mature on on Python. Um and the way that they all in interact as well. Kind of the way these packages all interface to each other. So I think those are the the places that Python win. Having said that You know, the open source software community on both fronts, they're moving so much more quickly so so quickly that what I say could could be outdated tomorrow. Yeah, well that's where we live in, right? Exactly. Uh so let's say that

Practical Steps for Learning Code

Let's say that someone wants to begin learning R or Python. Uh it's you know, they've never looked at a programming language before, they've never written any code, uh, maybe some basic Excel. Where should they even begin to start out? Like what are some of the best first steps you can take when learning to code? Right. So I'm I'm definitely biased.

Um I would say Data Camp is a fantastic place. Um and I'll I'll tell a few others a afterwards, but one of the reasons I initially joined Datacamp and one of the reasons that I'm I'm I'm still here is Data Camp has helped to lower the barrier to to entry for so much.

the the learning process. Uh in particular the the the initial steps, right? So you don't need when you come to Datacamp, you don't need to install anything locally on on your machine. We spin it up for you, it's all in browser. You can start learning and writing Python code straight away and feel Functional straight away. And that's one of the most important things for our learners and for learners in general is to feel like you're doing something, not like spending the first 90 minutes.

installing something and then having an error returned, right? So I've been there. I I Exactly, we've all been there, right? So I I definitely think Um that's very helpful and and and motivating in in all honesty. Um I also think um

So for people interested in finance, we we have finance curriculum here. I also think um you know, you've you've spoken with Eve Hilpish in the past and his quant stuff is also his Python quant stuff I think is really, really interesting. Uh and he has on online courses. Um for those very interested in in the finance stuff. Um I'd also depending on whether you've chosen R or Python. Figure out how you where you want to write your code. Um

And what I mean by that is, you know, there are these things called IDEs, integrated development in environments, and that sounds like a mouthful. All it is is like a piece of software that you that you'll click on and open, and it's where you write code, save it, execute it, this type of stuff.

For Python in in data science and for you can do R and Julia and all of these things in in in this infrastructure as well, there's an incredible project called Project Jupyter, and they're Jupyter notebooks and now they're Jupyter Lab. um, are really good for getting started running code because it's kind of like um an interactive data science notebook essentially where you can write text

Say um store images, videos, write code, execute it in your browser, um, and it's all going straight away. And on top of that, you know, on on the internet with like there are lots of galleries of interesting Jupiter notebooks that you can get up up and running with. There's one actually called a gallery of interesting Jupiter notebooks actually, where, you know, there's a whole bunch on finance, among other among other things. The two other things I'd I'd really suggest um is to

read as widely as possible. Um, you know, blogs, Um, we've got, you know, a lot of a lot of blogs and tutorials on our our data camp community. Fast Forward Labs has a lot of great stuff on on on data science. ODSC, which is a conference, the open data science conference. Got a lot of finance stuff as well. And wherever you are, go to meetups and and meet people and and hackathons,'cause in all honesty, doing this after work, like

in your own bedroom, you know, it it can get it can get hardcore. And, you know, we've all been there. But seriously, coding with other people and and pair programming, um, the learning curve there um makes it a lot more fun. And the last piece of advice I'd probably give is um try to get your day job to give you time andor money to do this. Um so for example, if you're like wanna subscribe to Datacamp, you know, we've got a whole bunch of free stuff that you can check out first. But get

Tell your boss that, you know, this is your learning stuff that's gonna help your job. Um, or try to get Friday afternoons from your work to, you know, invest in your future. to make you more efficient at at at your job and and and work at that because wherever you're working is going to win in the end, in all honesty as well. Yeah. That's a really cool tip actually. I like it.

Yeah, and obviously I recommend Data Camp as well because, you know, that's one of the reasons I got you on. I mean, I'm a paying subscriber of Data Camp. Uh, I'm always on there checking out new courses and material, etcetera. So I mean it it's certainly helped my learning curve and and learning how to program as well. So obviously that's why I thought it'd be really cool to get you on the podcast'cause You know, I've I've seen some of the courses which you've been a lecturer of.

Appreciate that. And I w I would be I d I one thing I'd like to add is As you know, I I host a podcast called Data Framed, which we put out weekly episodes for for Data Camp and I'd urge you to listen to that a as well, because, you know, the premise of this uh of Data Framed is to interview thought leaders and and working data scientists from all o all over the place, whether it be finance, tech. Um the most recent one was

health, it was uh data scientists from Doctors Without Borders. So I think this type of stuff will definitely give you give you insight into how people think about um data science and and what's kind of really current and modern in the space at the moment. Yeah. And I said this to you when we um spoke, I wish it was a few months ago when we first um made the connection, but uh that You know, I thought it was really cool when you guys actually started a podcast because

Prior to that I hadn't really found any great and I'm not saying they don't exist, but I just hadn't really come across any other sort of really cool uh podcasts on programming. So it was I was kind of excited to see yours pop up. Yeah, I appreciate that.

Mid-Episode Sponsors

Are you ready to get serious about trading? Then join Tasty Trade, Investopedia's best platform for options trading in twenty twenty six. Stocks, options, futures, and more. Tasty Trade has everything you trade all in one platform. Get low commissions, including zero commission on stocks. so you can keep more of what you earn. Tasty Trade is packed with advanced charting tools, backtesting, a pre-built strategy selector, risk analysis tools, and more features to help you trade smarter.

See equities and derivatives with high trading volumes, dividends, upcoming earnings reports, and more with their pre built watches. Or create a custom watch list to keep an eye on the companies and sectors that matter to you. Manage your positions with speed and precision using Active Trader Mode. one click trading and smart order tracking. Plus Tastytrad's Stellar Trade Desk team offers live support during trading hours if you need it. Visit Tastytrade.com slash chat for more info.

KC Trade Inc. is a registered broker dealer and member of FINRA, NFA, and SIPC. Have you ever watched a stock explode and thought, if only I had the capital? or sat on the sidelines because your account balance felt too small to matter. Good news With Trade the Pool's limited risk platform, you don't need millions or even thousands to start trading the U.S. stock market. Bypass the PDT and tap into over twelve thousand U.S. listed equities.

From penny stocks to big caps, ETFs, even the newest IPOs, and short anything you like, with zero locate or hard to borrow fees. Start your evaluation, get funded with up to$200,000 in buying power so you can go big without risking your own savings. And now you can also have unlimited time to reach the profit target. It's a game changer. Not ready to trade yet? Trade the pool offers a free demo and educational resources.

practice on live data, master the platform, and build confidence risk-free before you even pay a cent. Click the link in the show notes to start trading with Trade the Pools Capital.

Project-Based Learning for Relevance

Now obviously there's a lot that you can explore. There's plenty of subjects and areas that you can go into and I guess it's very easy to get uh sort of led astray in some ways. when you're new to programming because you don't really know what's relevant. You don't really know what you should be learning. You know, as a trader who's learning to code, you're mostly going to be wanting to learn to code for the purpose of exploring data, getting some sort of statistics, some probabilities.

uh visualizing data, backtesting certain strategies and trade ideas and uh market behavior. How do you determine what's relevant? Like whether this is something I I should be learning at this stage early on in my, you know, while learning to program, or is this something more of an advanced topic that would be nice to know a little further down the track? Like how do you determine what's relevant?

Yeah, that's such a good question, Aaron. Because on top of every all the good reasons you've just said, if you go and look at like the amount of Python packages out there for, you know, time series and and data analysis and this type of stuff, it is so overwhelming. That you can, you know, you can just freak and probably like run away forever. So I definitely think that's this type of conversation is super, super valuable. And the way

I generally approach this is take a bunch of courses, sure, and get introduced to the main players in in in the space of Python packages. Um and for those who haven't heard the term package before, a package is really something that's developed that's, you know, a set of functions and utilities that will help you to do stuff, right?

Um and of course you'll come across the main ones like uh Pandas, which is, you know, seminal now, um and and absolutely undeniably um necessary for all data science, all all financial uh data analysis in in in Python and it's really um a way of structuring data in order to facilitate analysis. So the basic unit in in Pandas is a data frame, which kind of looks like a matrix of of numbers or whatever, with with column names, uh uh essentially.

Um, but that having been said, um besides learning courses and becoming familiar with the main players in the space, I always advise learners to to do a project on something they're interested in. And of course, you know, in this case, you know, if you've got a a trading project that you want to do, definitely jump in. And what will happen then is

you'll learn all the tools you need for that particular pro project, right? And you'll be drawing on a lot of stuff that will be relevant in future for you, whether it's um, you know, pandas or some nice visualization. uh libraries or if you start doing statistical modeling um packages around uh around that or time series analysis Or even, you know, machine learning, right? You'll you'll meet Scikit Learn among other packages then. So if you do a project from go to woe.

Um, I'm not saying it's gonna be gonna be easy. It'll actually probably at some point be super frustrating. But if you're invested in the goal, right, you'll you'll learn what you need to. The last thing you want is to do a project that you kind of

half assed about, right? And you get to a roadblocking like, Oh, I don't care about it so so much. So if you pick something that you're super passionate about and and get to that end, um, and by the end, I don't only mean getting out a result or an actionable, I also mean uh talking about it with with mates or giving a talk about it at a meetup or writing a blog post about it. Um I think these types of things definitely get you far more acquainted with everything you've used.

Um and on top of that, it doesn't only have to be a tr uh a trading project. I mean I I I encourage people to get out of their professional zone when doing this. So uh I met a a a student who Um, is all about their fitness tracker. Um, you know, it wasn't a Fitbit, but it was something something like that. And I was like, dude, just get your get all your fitness tracker data.

um and you know analyze that and see if you can predict, you know, what's what's gonna happen in the next three months and that type of stuff. And they learn a whole bunch of uh cool tools around that. If you're a foodie, you know, you can scrape y Yelp reviews, right? And do some analysis there.

Um, I met a law student who wanted to get into to data science and they did a project of um It was something like I'm gonna g I'm gonna bug this up, but it was something like looking at Supreme Court judgments that were handed down and trying to cluster them according to whether the judge was like Republican or Democrat or in the middle or or whatever it was, right? So find something you absolutely love um and and get in there and and and get into it.

Coding: Frustrations, Solutions, Timelines

You said something really key in your answer there that there are going to be points when you try to step out on your own and do an exercise that it's going to be very frustrating and you're going to get stuck. What's the best thing to do when you do get stuck? Because as you said earlier, if you're just at home and you're trying to work this out and it's not something you've done before and like I've had it before and where I've I've gotten stuck

And it I've literally been stuck on it for two days. I I just can't work out why this won't work and it just keeps spinning out the same error. And often some of the errors that you get in Python, especially for new programmers, are quite confusing and and not always very clear. W what's the best thing to do if you're in that situation where you're trying something, you get stuck, you don't really know where to go, where do you go?

Yeah, so firstly, I d I agree completely. The amount of frustration you can experience with like really idiosyncratic, esoteric error messages is is wild. And I do think Open source package developers are kind of getting onto this and they realize that if they want more users, they need to provide more helpful v feedback messages. One example is uh Keras, which is uh a deep learning library developed by Francois Cholet.

who uh who's a Google and it provides some error messages that are along the lines of, Hey, you use this function, but why don't you try this instead? Right? So it's As opposed to error syntax error backtra what whatever it is, right? So if you're getting wacky error messages that you don't understand. My number one answer to that question is Google it. Google Google knows or your search engine of choice. I don't know if many people use Ask Jeeves anymore, but I would I would Google it.

And most of a lot of the time you'll end up on a website called Stack Overflow. Which is essentially a QA, a forum for uh answering these types of questions. So if you copy and paste your error message into Google, a lot of the time you will find something you know, you might end up in a in a stack overflow question that links to an issue on um

you know, the repository, which is the a repository is where, you know, you store code um for the particular package and then you need to go through a few other things. But definitely search engines will will be your best friend, I think. Yeah. Yeah. Agreed. I I know this is gonna depend kind of person to person. It's a a again, a very personal thing.

But I know there will be some people listening to this who are kind of wondering, what is a reasonable timeframe for someone to learn the basics of programming, let's say in in Python, for example? what you want to do. Um, but if you want to get a sense of how to analyze data, of how to get a result, um and of the basic packages you need in a space and and you work full time. Let's say you've got

A couple of nights a week, let's say five hours a week or something like that. I'd say six months you could get get up and running pretty well. Yeah. Maybe maybe maybe less, but I w I would say six months is is pretty good if if you're efficient and you know, and and and scrappy and resourceful yourself. But it's hard it's just always remember

that it's hard work as well. And it's as we said, frustrating. It can be infuriating at points, in in all honesty. Particularly when, you know, you work full time and you've like got a family and all all of this stuff. You're taking a lot of time out of, you know, your full time life in order to learn something new. But just remember that this has a a pretty serious payoff as well. And I do think that, um

you know, we all know this, right? But computation is becoming so much more important and the ability to code and have these conversations um will make you a far more valuable member of o of the workforce, uh f for example. Um and I think, you know, the there are really big wins to be made if you put in the time here. Yeah. And on the flip side

When you do get some code working exactly how you wanted it to, it's also very r rewarding. Like it it feels awesome. Yeah, it's beautiful. Couldn't agree more. Yeah.

Formulating Effective Data Questions

Uh I'd like to ask you a couple of questions, Hugo, about Maybe what questions to ask your data? So I I think this is interesting because obviously you're not a trader. You don't come from a trading background. Uh you come sort of more from a you know, you're a data scientist. I I'm interested to sort of hear a little bit about your thoughts about um, you know, how you perceive some of the challenges that traders um may come up against.

So I I guess maybe one of the first questions on this this topic would be, you know, how do you think about what questions to ask your data? Because essentially that's what you're doing as a data scientist. You get A data set and then you're trying to use that to solve problems or to answer certain questions. You're trying to find answers within the data. How do you think about what questions to ask that data?

Yeah, that's that's a really interesting question and I there are actually several ways to approach this. A very common way is actually to flip that entire process And a lot of the time you ask the question first before you even have have the data. Um so One great example is I think, you know, in kind of the modern age of uh of data science, um

emerged in in tech, right? So LinkedIn is a really good example where they Um one of their first MVPs of of data science was they were trying to figure out um, how to get more people active on on LinkedIn. um and how to get more connections made. And they they realize, for example, if you and I are connected on LinkedIn and you're connected with someone else on LinkedIn.

then it would make sense for them to suggest that I connect with that other person on LinkedIn, right? And so then they decided that the data that they needed to work with was a network graph of people, a social network or professional network, of of nodes of people and and edges as connections, professional connections between people. So All this is to say that their question existed pre any data and then they figured out

was that they needed to to look at. Um so essentially in this framework, the data science question emerges from a real world business question. Um and then you kind of transition down to this level of data science and data and then to get the business solution, you transition back up to this world of um of of uh of business questions. Um another prevalent example is fraud detection for for credit cards where, you know, you can ask a question straight away like How do we detect whether

a particular credit card transaction is fraudulent, right? And that doesn't involve any data at all. And then you're like, oh, okay. Damn it, we need to look into the data and figure out some patterns and do some prediction and wait, that's machine learning. That's probably supervised learning, fraud or not, um a binary classification problem. And suddenly you're speaking the language of data and and and data science.

Um and then you can answer it in that realm and then you transition back up to this business question of, you know, is this fraudulent or not? And the reason I like to frame um at least part of data science this way is because it actually speaks to this huge challenge. Getting decisions made. in in organizations, right? Because people hire data scientists to turn data into insight.

Uh but in the end, if you're not making business decisions around it, how do you see the value of it? And who makes business decisions? It's rarely a data scientist who makes business decisions. Uh it's usually someone that they report to or a team of people they need to convince, whether they're, you know, sea level uh or or management or w whatever it may be, right? So in the end, data scientists need to get non technical people on on board.

So the point there is that, you know, the question can can can preexist before data's even involved. Um having said that, of course. Um, you know, there is a lot of data already existing within tech companies, for example. So figuring out which questions to ask from them definitely, you know. exploring it. So uh an example that we're having in in in in datacamp at the moment um is we can look at uh this is something I'm actively thinking about we can look at um

Uh learner activity, right? So we can think about how often someone logs onto data camp, how many exercises they do. And and how how often do they s sorry, how long the duration of each each visit to data camp, right? And suddenly, um, if you start looking at this data. you start to see clusters emerge. You see, oh, the students who, um, come on once a week for a long time, and then you see the learners who come on every day for

for half an hour, right? And so I didn't ask a question there at all to start, but suddenly I'm seeing some sort of segmentation emerge and you can then ask questions like, Oh, maybe um can we would different parts of our product and our platform be uh more beneficial to some some students than others for for example. So and I think these two approaches um

are definitely complementary. Um and a lot of kind of the most thoughtful data scientists I think are w work in both across many organizations, if that makes sense. Yeah, no it does. And I imagine this is probably quite an important step that you actually do think about what tryna uh sorry, what kind of questions you want answered from your data instead of

uh, you know, just sitting down and then getting lost in the data. Yeah, absolutely. And as I said before, like Companies are writing, you know. Serious paychecks to data scientists, right? So they they need to see the see see the value. They need to have their business questions answered. And they need it early on as well. I mean, to get a company to invest in data science um and analytics as like a defining uh aspect of of of their business, they need to see value demonstrated early on, right?

A Data Scientist's Quant Strategy

Uh Hugo, I want to ask you this question. It might be a little bit of a curveball, um, but I think it'll be a little bit of fun. So um just answer it the best you can. You know, as and given the fact that you're you're not someone who is a trader, but let's say you were throwing some stock data. Let's just say open, high, low, close uh data for uh the stock price.

And you were challenged to develop a strategy or to gather some insight from that. Preferably a a strategy or something, so where you would buy, where you would sell, etc. Um, how would you go about that? Right. So I like this because I actually have I would have no idea how to do this, but the way I'd probably think about it is I'd get the data. I'd load it into Python in a Jupyter notebook.

um using the pandas package, which as we discussed, I think was, you know, literally developed for um these types of financial questions by by by Wes McKinney. Um, and then I'd start to visualize it. I mean, I'd literally start eyeballing um, you know, the the time series of of of what's what what's happening here.

um to to just get a sense of uh of what it looks like. Um and I think, you know, I don't know a lot about um you know, trading and finance in in this sense, but when thinking about time series, um, I suppose the three things that I usually think about are um trend to see if there's any general upward trend or downward trend. Um seasonality, so looking at those types of fluctuations and then whatever random noise is is is associated. Um

And then I think af after doing that, I'd probably try to develop some sort of um basic model um to predict what's gonna happen in the future, to see when to to to buy and sell. And the two approaches to that which in time series forecasting, which is essentially what we're talking about, I think. Um I, you know, the the two, I think, orthogonal approaches to that are some sort of like

a REMA model like auto regression moving average stuff. And once again, I don't know whether this is used a lot in in in trading, but to predict what's going to happen in the future with these time series. The other approach, which is becoming a lot more um prevalent is using machine learning approaches. So, you know, getting a whole bunch of machine learning models, um, having

you know, your in sample training and your out of sample testing and seeing which models do the best there with w with their prediction. Um and then basing my decisions around what I what I predict the movements will look like in in in the future. Uh and just a a fun fact, you mentioned um where's McKinney who created the Pandas package.

Uh he actually works at two sigma. Um, you know, some listeners will know that's a a serious um a major quant fund. So yeah, I I think that's that's kinda cool actually. Absolutely. And in fact too so he actually I think he may have left two Sigma very recently, but he still collaborates with them. He started his own uh company called Ursa Labs to develop open source software. But two Sigma also um

uh works with and and and pays a number of other developers of pandas and the open source community. So two sigma in open source land, they they do a lot of a lot of good work to support open source. And this is one of the other things, you know, like we haven't really talked a lot about open source stuff, but if you've got organizations like Two Sigma realizing the power of open source software development, I mean That's the writing's on the wall, if you ask me. Yeah. Uh Two Sigma also did um

What's that website where they run a lot of machine learning competitions? Uh Kaggle? Yeah. Yeah. Two Sigma, uh I don't know how it was a little while ago, uh, ran a competition on Kaggle as well. Yeah. So That's right. And in fact, another another fun fact, Two Sigma, um in the West Village in downtown Manhattan hosted our launch party for the Data Framed podcast in January this year. Yeah.

With the Pi Data NYC meetup as well. Yeah. Yeah, okay. Oh, very cool. Yeah. What are some of the common problems that data scientists encounter? Uh I know that's kinda bit of a vague question. I guess one of the things I I have in mind around that is actually the data itself, um, being able to clean data, access data, et cetera. Um Uh are there any efficient data cleaning methods which you know, you can suggest or e even like I said, broader common problems that data scientists encounter.

Data Cleaning: Challenges and Solutions

Oh dude, this is this is an absolute nightmare. This is this is, you know, a minefield of of pain and and and and suffering. I mean, there's the old joke that, you know, eighty percent of you know, data analysis is is cleaning and importing and and that type of stuff. Um, I actually saw a while ago there's um

There's a Twitter account and uh we we should include this in the show notes. There's a Twitter account called Big Data Borat. And it's like Borat's take on stuff. And I'm not gonna try to do the accent, but Big Data Borat says, you know, eighty Eighty percent of time is cleaning data, uh, twenty percent of time is claiming is is complaining about eighty percent of time being cleaning data. Um

So no, you're absolutely right. And even, you know, cleaning missing values, the fact that you you'll have data corrupted, you know, the fact that sometimes you'll get a data source which S for some reason some of the values have like three more zeros on the end. It'll be the millions instead of, you know, the the hundred thousands. Um And I think You know, there aren't necessarily a lot of good automated ways to to deal with this.

Yet there are, you know, some things in development, but essentially a lot of data scientists are siloed in different companies and have reinvented the wheel. I mean, everyone writes their own scripts to deal with this type of stuff. I mean, Pandas, for example. Um and a bunch of packages in both Python land and and the R ecosystem deal with this, but you need to kind of hand hand code it yourself, uh uh essentially. Um the other thing though that I find super cool, um It's something we get from

engineering which are testing principles. So you've got software testing, right? You get um you can write tests to make sure your software or your code that you're writing does what you think it does. Now you can also do this with with data. So if you load a data set into your Python environment, you can write a set of tests to make sure that the values of your price column are actually numeric, right?

And you can flag it if i if not. So having these types of automated tests, which once again you would hand code yourself, but you can use them across a variety of uh uh of projects, um, will allow you to do this type of stuff. We are seeing increasing automation around this stuff though, and I do think that in the next

This is this is pretty broad, pretty vague. In the next two to five to ten years, s what data scientists do on a daily basis will be a fundamentally different different thing. And I don't necessarily know the the the time scale definitely in the next ten years, a lot of, you know, the cleaning, the the basics of writing statistical models and um, you know

Selecting features and um tuning hyperparameters, which happens in machine learning a lot, all of this stuff will will be automated and it'll free up a lot of our time for far more creative, interesting work. How about on the the subject of

Predictive Models: Pitfalls and Ethics

you know, creating predictive models, what are some of the pitfalls that uh data scientists or traders who are learning to code could fall into? Uh like Is there anything you need to be mindful of when you're working with data and you're trying to create predictive models that I'm sure they're if you're not careful you can get results that are actually gonna trick you and and gonna have very little value when you actually try to implement that in the real world.

Yeah, yeah. Th there are and I think um the two big ones are if you have a small a small amount of data. You'll be prone to overfitting and what that means is You know, every data set, if you're trying to model it or something, you know, there'll be you'll be trying to model some sort of signal and there'll be noise. There'll be fluctuations around it. And uh overfitting means that you're really modeling the noise in in your data. And so if you try it on new data, you know you won't

signal, you'll be getting getting the noise as well. So it won't generalize well to new data sets. So when you've got a small amount of data, um, you may be overfitting qu quite a bit. Um if you've got a large amount of data, this won't be a problem though, because essentially the task there is

to um see how well it performs by, you know, building the model, otherwise known as training, on one subset of your data set and then testing it to see how well it performs on this what's called an out-of-sample or holdout, holdout test.

Um and you've had people on the on on the podcast before who's who who've explained this far far better than th than I have. But that's the basic idea there. So in the small sample size, uh overfitting can be an issue. I personally think um The one of the biggest issues uh in the machine learning and the deep learning space now is model interpretability.

Uh and what I mean by that is we discussed before this idea that, you know, as data scientists we need to communicate our findings to non technical stakeholders, whether they be managers, decision makers. citizens, um, if we're influencing, you know, legislation or, you know, parole hearings, right? Like literally there are black box models. So I'll explain that term in in a second and then then give that example. But the point is that when you have

a model that predicts something, um, there'll be people who wanna know why it predicts or or what it does. So if I'm telling my manager that we should make a business decision based around a model He or she may may wanna know why it makes that prediction and what the drivers are there. So

this idea of model interpretability as opposed to black box models and what a black box model is and deep learning models are a famous example of these where you just can't peer into it. You can't see why the prediction is w what it is, right?

Um and so the famous example, there's a there's a fan or there are a series of fantastic books. There's one in particular called Weapons of Math Destruction, which is not only a great title, but um a a great book about, you know, black box models among other people. others to having deleterious effects on on society. And um I think one of the one of the most um telling examples is uh in in the United States in the justice system. Um there there's a model that's used in parole hearings.

to uh to predict the recidivism rate, so how likely the person on parole is to reoffend, right? And this is a black box model and the judge uses the output of this model to m to influence their decision um on whether to grant parole or not, right? So this is an obvious case in which we would want to be able to peek into this model to see the inner workings, to understand what what's happening there.

Um so I think this is one of the biggest challenges moving moving forward, which in all honesty speaks to a far greater challenge of developing a system of um ethics around the practice of uh of data science as well.

Community Engagement and Final Tips

Yeah, it's funny you mentioned that title actually because I I have read that book. I think that's about it. Uh I don't know, is there anything else that you wanted to discuss? Not not really. I mean, I just wanna encourage people, n like I I wanna reiterate that I understand, you know, when you got, you know, more than we all work more than we should, we all got

you know, so many obligations in the world. So, you know, take your time, have some fun, do stuff you enjoy with with coding, you know. If you've got kids and you want to like learn to code with your kids, that that's also awesome. Also, if the other thing about learning to program that I haven't mentioned yet is that um Twitter is an incredible place to ask questions and to receive answers. So R in particular, the hashtag R

And I'm it's the letter R. I know because I'm Australian, it sounds like what the doctor makes me say. But um, you know, it's hashtag R R stats. Um, you know, a lot of people will jump on board and answer answer your questions there. But feel free to ping me on Twitter. It's at Hugo Bound, B-O-W-N-E for for Edward. Um and you know

Be part of the community as as well, because it's it's getting more and more welcoming. I know that there are some you know, there are some gatekeepers out there who'll, you know, shake their fists at you, but in the end we're we're all trying to help each other. Yeah, too right. Uh and I will throw it out there as well. I'm a affiliate for Data Camp. Uh so if you do want to sign up and uh, you know, check out the courses, obviously

Uh as Hugo's already mentioned, uh some of the courses are free. Uh there's also some paid courses as well if you want to take a deeper dive. So uh chatwithtraders.com slash datacamp. I don't think it gives you any special discounts or whatever, but if you do sign up, please use that link because it will obviously help to support chat with traders. So yeah, uh chatwithtraders.com slash datacamp. Um yeah, I appreciate it. If you could use that link.

Um, now Hugo, you've already given out your Twitter handle. Uh you also mentioned earlier that you do host a podcast of your own for Data Camp, and that's called Data Framed. Uh where's the best place to listen to that? Probably wherever they're listening to this podcast, right? Yeah, exactly. You can you know, it's on iTunes and and Google Play and all of that. Um, you know, it'll be on all your apps. You can also check it out at uh datacamp.com

slash community slash podcast. We host it on our on our website there. But yeah, data framed and it's one word. Um And it'd be great to, you know, if you checked it out and gave some feedback as well. We always welcome as much critical feedback as possible. So Cool. Okay. Well, let's uh let's leave it there. I'm I'm glad we uh could tee this up, Hugo. I know it's been a couple of months in the making, so um

I appreciate you making the time to make it happen. Oh look, this has been a load of fun, Aaron, and I really appreciate the invitation. Cool. Thank you very much.

Episode Outro

You've reached the end of this episode of Chat with Traders, but rest assured there are more episodes loaded soon. if you leave a rating. That with traders.

This transcript was generated by Metacast using AI and may contain inaccuracies. Learn more about transcripts.
For the best experience, listen in Metacast app for iOS or Android