This week’s episode is a special one, as we’re welcoming a guest: Joel Grus is a data scientist with a strong software engineering streak, and he does an impressive amount of speaking, writing, and podcasting as well. Whether you’re a new data scientist just getting started, or a seasoned hand looking to improve your skill set, there’s something for you in Joel’s repertoire.
Jun 10, 2019•40 min
What do you get when you cross a support vector machine with matrix factorization? You get a factorization machine, and a darn fine algorithm for recommendation engines.
Jun 03, 2019•20 min
We've already talked about neural nets in some detail (links below), and in particular we've been blown away by the way that image recognition from convolutional neural nets can be fed into recurrent neural nets that generate descriptions and captions of the images. Our episode today tells a similar tale, except today we're talking about a blog post where the author fed in wireframes of a website design and asked the neural net to generate the HTML and CSS that would actually build a website tha...
May 27, 2019•20 min
We often hear from folks wondering what advice we can give them as they search for their first job in data science. What does a hiring manager look for? Should someone focus on taking classes online, doing a bootcamp, reading books, something else? How can they stand out in a crowd? There’s no single answer, because so much depends on the person asking in the first place, but that doesn’t stop us from giving some perspective. So in this episode we’re sharing that advice out more widely, so hopef...
May 19, 2019•18 min
This week, we've got a fun paper by our friends at Google about the hidden costs of maintaining machine learning workflows. If you've worked in software before, you're probably familiar with the idea of technical debt, which are inefficiencies that crop up in the code when you're trying to go fast. You take shortcuts, hard-code variable values, skimp on the documentation, and generally write not-that-great code in order to get something done quickly, and then end up paying for it later on. This ...
May 12, 2019•22 min
If you’re like most software engineers and, especially, data scientists, you find it really hard to make accurate estimates of how long a project will take to complete. Don’t feel bad: statistics is most likely actively working against your best efforts to give your boss an accurate delivery date. This week, we’ll talk through a great blog post that digs into the underlying probability and statistics assumptions that are probably driving your estimates, versus the ones that maybe should be drivi...
May 05, 2019•19 min
53.5 million light-years away, there’s a gigantic galaxy called M87 with something interesting going on inside it. Between Einstein’s theory of relativity and the motion of a group of stars in the galaxy (the motion is characteristic of there being a huge gravitational mass present), scientists have believed for years that there is a supermassive black hole at the center of that galaxy. However, black holes are really hard to see directly because they aren’t a light source like a star or a super...
Apr 29, 2019•20 min
As artificial intelligence algorithms get applied to more and more domains, a question that often arises is whether to somehow build structure into the algorithm itself to mimic the structure of the problem. There’s usually some amount of knowledge we already have of each domain, an understanding of how it usually works, but it’s not clear how (or even if) to lend this knowledge to an AI algorithm to help it get started. Sure, it may get the algorithm caught up to where we already were on solvin...
Apr 21, 2019•19 min
It’s not news that data scientists are expected to be capable in many different areas (writing software, designing experiments, analyzing data, talking to non-technical stakeholders). One thing that has been changing, though, as the field becomes a bit older and more mature, is our ideas about what data scientists should focus on to stay relevant. Should they specialize in a particular area (if so, which one)? Should they instead stay general and work across many different areas? In either case,...
Apr 15, 2019•14 min
If you work in data science, you’re well aware of the sheer volume of high-risk, high-reward projects that are hypothetically possible. The fact that they’re high-reward means they’re exciting to think about, and the payoff would be huge if they succeed, but the high-risk piece means that you have to be smart about what you choose to work on and be wary of investing all your resources in projects that fail entirely or starve other, higher-value projects. This episode focuses mainly on Google X, ...
Apr 08, 2019•19 min
When you are running an AB test, one of the most important questions is how much data to collect. Collect too little, and you can end up drawing the wrong conclusion from your experiment. But in a world where experimenting is generally not free, and you want to move quickly once you know the answer, there is such a thing as collecting too much data. Statisticians have been solving this problem for decades, and their best practices are encompassed in the ideas of power, statistical significance, ...
Apr 01, 2019•23 min
OpenAI recently created a cutting-edge new natural language processing model, but unlike all their other projects so far, they have not released it to the public. Why? It seems to be a little too good. It can answer reading comprehension questions, summarize text, translate from one language to another, and generate realistic fake text. This last case, in particular, raised concerns inside OpenAI that the raw model could be dangerous if bad actors had access to it, so researchers will spend the ...
Mar 25, 2019•21 min
Imagine you have two choices of how to build something: top-down and controlled, with a few people playing a master designer role, or bottom-up and free-for-all, with nobody playing an explicit architect role. Which one do you think would make the better product? “The Cathedral and the Bazaar” is an essay exploring this question for open source software, and making an argument for the bottom-up approach. It’s not entirely intuitive that projects like Linux or scikit-learn, with many contributors...
Mar 17, 2019•33 min
It’s time for our latest installation in the series on artificial intelligence agents beating humans at games that we thought were safe from the robots. In this case, the game is StarCraft, and the AI agent is AlphaStar, from the same team that built the Go-playing AlphaGo AI last year. StarCraft presents some interesting challenges though: the gameplay is continuous, there are many different kinds of actions a player must take, and of course there’s the usual complexities of playing strategy ga...
Mar 11, 2019•22 min
For many data scientists, maintaining models and workflows in production is both a huge part of their job and not something they necessarily trained for if their background is more in statistics or machine learning methodology. Productionizing and maintaining data science code has more in common with software engineering than traditional science, and to reflect that, there’s a new-ish role, and corresponding job title, that you should know about. It’s called machine learning engineer, and it’s w...
Mar 04, 2019•21 min
You’d be hard-pressed to find a field with bigger, richer, and more scientifically valuable data than particle physics. Years before “data scientist” was even a term, particle physicists were inventing technologies like the world wide web and cloud computing grids to help them distribute and analyze the datasets required to make particle physics discoveries. Somewhat counterintuitively, though, deep learning has only really debuted in particle physics in the last few years, although it’s making ...
Feb 25, 2019•36 min
K Nearest Neighbors is an algorithm with secrets. On one hand, the algorithm itself is as straightforward as possible: find the labeled points nearest the point that you need to predict, and make a prediction that’s the average of their answers. On the other hand, what does “nearest” mean when you’re dealing with complex data? How do you decide whether a man and a woman of the same age are “nearer” to each other than two women several years apart? What if you convert all your monetary columns fr...
Feb 17, 2019•16 min
Deep learning is a field that’s growing quickly. That’s good! There are lots of new deep learning papers put out every day. That’s good too… right? What if not every paper out there is particularly good? What even makes a paper good in the first place? It’s an interesting thing to think about, and debate, since there’s no clean-cut answer and there are worthwhile arguments both ways. Wherever you find yourself coming down in the debate, though, you’ll appreciate the good papers that much more. R...
Feb 11, 2019•18 min
Ordinary least squares (OLS) is often used synonymously with linear regression. If you’re a data scientist, machine learner, or statistician, you bump into it daily. If you haven’t had the opportunity to build up your understanding from the foundations, though, listen up: there are a number of assumptions underlying OLS that you should know and love. They’re interesting, force you to think about data and statistics, and help you know when you’re out of “good” OLS territory and into places where ...
Feb 03, 2019•25 min
Linear regression is a great tool if you want to make predictions about the mean value that an outcome will have given certain values for the inputs. But what if you want to predict the median? Or the 10th percentile? Or the 90th percentile. You need quantile regression, which is similar to ordinary least squares regression in some ways but with some really interesting twists that make it unique. This week, we’ll go over the concept of quantile regression, and also a bit about how it works and w...
Jan 28, 2019•22 min
When data scientists use a linear regression to look for causal relationships between a treatment and an outcome, what they’re usually finding is the so-called average treatment effect. In other words, on average, here’s what the treatment does in terms of making a certain outcome more or less likely to happen. But there’s more to life than averages: sometimes the relationship works one way in some cases, and another way in other cases, such that the average isn’t giving you the whole story. In ...
Jan 20, 2019•17 min
When you build a model for natural language processing (NLP), such as a recurrent neural network, it helps a ton if you’re not starting from zero. In other words, if you can draw upon other datasets for building your understanding of word meanings, and then use your training dataset just for subject-specific refinements, you’ll get farther than just using your training dataset for everything. This idea of starting with some pre-trained resources has an analogue in computer vision, where initiali...
Jan 14, 2019•28 min
Facial recognition being used in everyday life seemed far-off not too long ago. Increasingly, it’s being used and advanced widely and with increasing speed, which means that our technical capabilities are starting to outpace (if they haven’t already) our consensus as a society about what is acceptable in facial recognition and what isn’t. The threats to privacy, fairness, and freedom are real, and Microsoft has become one of the first large companies using this technology to speak out in specifi...
Jan 07, 2019•43 min
Bringing you another old classic this week, as we gear up for 2019! See you next week with new content. Word2Vec is probably the go-to algorithm for vectorizing text data these days. Which makes sense, because it is wicked cool. Word2Vec has it all: neural networks, skip-grams and bag-of-words implementations, a multiclass classifier that gets swapped out for a binary classifier, made-up dummy words, and a model that isn't actually used to predict anything (usually). And all that's before we get...
Dec 31, 2018•18 min
We’re taking a break for the holidays, chilling with the dog and an eggnog (Katie) and the cat and some spiced cider (Ben). Here’s an episode from a while back for you to enjoy. See you again in 2019! You might sometimes find that it's hard to get started doing something, but once you're going, it gets easier. Turns out machine learning algorithms, and especially recommendation engines, feel the same way. The more they "know" about a user, like what movies they watch and how they rate them, the ...
Dec 23, 2018•16 min
Convex optimization is one of the keys to data science, both because some problems straight-up call for optimization solutions and because popular algorithms like a gradient descent solution to ordinary least squares are supported by optimization techniques. But there are all kinds of subtleties, starting with convex and non-convex functions, why gradient descent is really an optimization problem, and what that means for your average data scientist or statistician.
Dec 17, 2018•20 min
When you think about it, it’s pretty amazing that we can draw conclusions about huge populations, even the whole world, based on datasets that are comparatively very small (a few thousand, or a few hundred, or even sometimes a few dozen). That’s the power of statistics, though. This episode is kind of a two-for-one but we’re excited about it—first we’ll talk about the Normal or Gaussian distribution, which is maybe the most famous probability distribution function out there, and then turn to the...
Dec 09, 2018•27 min
Neural nets are a way you can model a system, sure, but if you take a step back, squint, and tilt your head, they can also be called… software? Not in the sense that they’re written in code, but in the sense that the neural net itself operates under the same set of general requirements as does software that a human would write. Namely, neural nets take inputs and create outputs from them according to a set of rules, but the thing about the inside of the neural net black box is that it’s written ...
Dec 02, 2018•17 min
Deep neural nets have a deserved reputation as the best-in-breed solution for computer vision problems. But there are many aspects of human vision that we take for granted but where neural nets struggle—this episode covers an eye-opening paper that summarizes some of the interesting weak spots of deep neural nets. Relevant links: https://arxiv.org/abs/1805.04025
Nov 18, 2018•27 min
At many places, data scientists don’t work solo anymore—it’s a team sport. But data science teams aren’t simply teams of data scientists working together. Instead, they’re usually cross-functional teams with engineers, managers, data scientists, and sometimes others all working together to build tools and products around data science. This episode talks about some of those roles on a typical data science team, what the responsibilities are for each role, and what skills and traits are most impor...
Nov 12, 2018•25 min