Linear Digressions - podcast cover

Linear Digressions

Ben Jaffe and Katie Malonelineardigressions.com
Linear Digressions is a podcast about machine learning and data science. Machine learning is being used to solve a ton of interesting problems, and to accomplish goals that were out of reach even a few short years ago.

Episodes

Rock the ROC Curve

This week: everybody's favorite WWII-era classifier metric! But it's not just for winning wars, it's a fantastic go-to metric for all your classifier quality needs.

Jan 30, 201716 min

Ensemble Algorithms

If one machine learning model is good, are two models better? In a lot of cases, the answer is yes. If you build many ok models, and then bring them all together and use them in combination to make your final predictions, you've just created an ensemble model. It feels a little bit like cheating, like you just got something for nothing, but the results don't like: algorithms like Random Forests and Gradient Boosting Trees (two types of ensemble algorithms) are some of the strongest out-of-the-bo...

Jan 23, 201713 min

How to evaluate a translation: BLEU scores

As anyone who's encountered a badly translated text could tell you, not all translations are created equal. Some translations are smooth, fluent and sound like a poet wrote them; some are jerky, non-grammatical and awkward. When a machine is doing the translating, it's awfully easy to end up with a robotic-sounding text; as the state of the art in machine translation improves, though, a natural question to ask is: according to what measure? How do we quantify a "good" translation? Enter the BLEU...

Jan 16, 201717 min

Zero Shot Translation

Take Google-size data, the flexibility of a neural net, and all (well, most) of the languages of the world, and what you end up with is a pile of surprises. This episode is about some interesting features of Google's new neural machine translation system, namely that with minimal tweaking, it can accommodate many different languages in a single neural net, that it can do a half-decent job of translating between language pairs it's never been explicitly trained on, and that it seems to have its o...

Jan 09, 201726 min

Google Neural Machine Translation

Recently, Google swapped out the backend for Google Translate, moving from a statistical phrase-based method to a recurrent neural network. This marks a big change in methodology: the tried-and-true statistical translation methods that have been in use for decades are giving way to a neural net that, across the board, appears to be giving more fluent and natural-sounding translations. This episode recaps statistical phrase-based methods, digs into the RNN architecture a little bit, and recaps th...

Jan 02, 201718 min

Data and the Future of Medicine : Interview with Precision Medicine Initiative researcher Matt Might

Today we are delighted to bring you an interview with Matt Might, computer scientist and medical researcher extraordinaire and architect of President Obama's Precision Medicine Initiative. As the Obama Administration winds down, we're talking with Matt about the goals and accomplishments of precision medicine (and related projects like the Cancer Moonshot) and what he foresees as the future marriage of data and medicine. Many thanks to Matt, our friends over at Partially Derivative (hi, Jonathon...

Dec 26, 201635 min

Special Crossover Episode: Partially Derivative interview with White House Data Scientist DJ Patil

We have the pleasure of bringing you a very special crossover episode this week: our friends at Partially Derivative (another great podcast about data science, you should check it out) recently interviewed White House Chief Data Scientist DJ Patil. We think DJ's message about the importance and impact of data science is worth spreading, so it's our pleasure to bring it to you today. A huge thanks to Jonathon Morgan and Partially Derivative for sharing this interview with us--enjoy! Relevant link...

Dec 18, 201646 min

How to Lose at Kaggle

Competing in a machine learning competition on Kaggle is a kind of rite of passage for data scientists. Losing unexpectedly at the very end of the contest is also something that a lot of us have experienced. It's not just bad luck: a very specific combination of overfitting on popular competitions can take someone who is in the top few spots in the final days of a contest and bump them down hundreds of slots in the final tally.

Dec 12, 201617 min

Attacking Discrimination in Machine Learning

Imagine there's an important decision to be made about someone, like a bank deciding whether to extend a loan, or a school deciding to admit a student--unfortunately, we're all too aware that discrimination can sneak into these situations (even when everyone is acting with the best of intentions!). Now, these decisions are often made with the assistance of machine learning and statistical models, but unfortunately these algorithms pick up on the discrimination in the world (it sneaks in through ...

Dec 05, 201623 min

Recurrent Neural Nets

This week, we're doing a crash course in recurrent neural networks--what the structural pieces are that make a neural net recurrent, how that structure helps RNNs solve certain time series problems, and the importance of forgetfulness in RNNs. Relevant links: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Nov 28, 201613 min

Stealing a PIN with signal processing and machine learning

Want another reason to be paranoid when using the free coffee shop wifi? Allow us to introduce WindTalker, a system that cleverly combines a dose of signal processing with a dash of machine learning to (potentially) steal the PIN from your phone transactions without ever having physical access to your phone. This episode has it all, folks--channel state information, ICMP echo requests, low-pass filtering, PCA, dynamic time warps, and the PIN for your phone.

Nov 21, 201617 min

Neural Net Cryptography

Cryptography used to be the domain of information theorists and spies. There's a new player now: neural networks. Given the task of communicating securely, neural networks are inventing new encryption methods that, as best we can tell, are unlike anything humans have ever seen before. Relevant links: http://arstechnica.co.uk/information-technology/2016/10/google-ai-neural-network-cryptography/ https://arxiv.org/pdf/1610.06918v1.pdf

Nov 14, 201616 min

Deep Blue

In 1997, Deep Blue was the IBM algorithm/computer that did what no one, at the time, though possible: it beat the world's best chess player. It turns out, though, that one of the most important moves in the matchup, where Deep Blue psyched out its opponent with a weird move, might not have been so inspired after all. It might have been nothing more than a bug in the program, and it changed computer science history. Relevant links: https://www.wired.com/2012/09/deep-blue-computer-bug/

Nov 07, 201620 min

Organizing Google's Datasets

If you're a data scientist, there's a good chance you're used to working with a lot of data. But there's a lot of data, and then there's Google-scale amounts of data. Keeping all that data organized is a Google-sized task, and as it happens, they've built a system for that organizational challenge. This episode is all about that system, called Goods, and in particular we'll dig into some of the details of what makes this so tough. Relevant links: http://static.googleusercontent.com/media/researc...

Oct 31, 201615 min

Fighting Cancer with Data Science: Followup

A few months ago, Katie started on a project for the Vice President's Cancer Moonshot surrounding how data can be used to better fight cancer. The project is all wrapped up now, so we wanted to tell you about how that work went and what changes to cancer data policy were suggested to the Vice President. See lineardigressions.com for links to the reports discussed on this episode.

Oct 24, 201626 min

The 19-year-old determining the US election

Sick of the presidential election yet? We are too, but there's still almost a month to go, so let's just embrace it together. This week, we'll talk about one of the presidential polls, which has been kind of an outlier for quite a while. This week, the NY Times took a closer look at this poll, and was able to figure out the reason it's such an outlier. It all goes back to a 19-year-old African American man, living in Illinois, who really likes Donald Trump... Relevant Links: http://www.nytimes.c...

Oct 17, 201612 min

How to Steal a Model

What does it mean to steal a model? It means someone (the thief, presumably) can re-create the predictions of the model without having access to the algorithm itself, or the training data. Sound far-fetched? It isn't. If that person can ask for predictions from the model, and he (or she) asks just the right questions, the model can be reverse-engineered right out from under you. Relevant links: https://www.usenix.org/system/files/conference/usenixsecurity16/sec16_paper_tramer.pdf

Oct 09, 201614 min

Regularization

Lots of data is usually seen as a good thing. And it is a good thing--except when it's not. In a lot of fields, a problem arises when you have many, many features, especially if there's a somewhat smaller number of cases to learn from; supervised machine learning algorithms break, or learn spurious or un-interpretable patterns. What to do? Regularization can be one of your best friends here--it's a method that penalizes overly complex models, which keeps the dimensionality of your model under co...

Oct 03, 201617 min

The Cold Start Problem

You might sometimes find that it's hard to get started doing something, but once you're going, it gets easier. Turns out machine learning algorithms, and especially recommendation engines, feel the same way. The more they "know" about a user, like what movies they watch and how they rate them, the better they do at suggesting new movies, which is great until you realize that you have to start somewhere. The "cold start" problem will be our focus in this episode, both the heuristic solutions that...

Sep 26, 201616 min

Open Source Software for Data Science

If you work in tech, software or data science, there's an excellent chance you use tools that are built upon open source software. This is software that's built and distributed not for a profit, but because everyone benefits when we work together and share tools. Tim Head of scikit-optimize chats with us further about what it's like to maintain an open source library, how to get involved in open source, and why people like him need people like you to make it all work.

Sep 19, 201620 min

Scikit + Optimization = Scikit-Optimize

We're excited to welcome a guest, Tim Head, who is one of the maintainers of the scikit-optimize package. With all the talk about optimization lately, it felt appropriate to get in a few words with someone who's out there making it happen for python. Relevant links: https://scikit-optimize.github.io/ http://www.wildtreetech.com/

Sep 12, 201616 min

Two Cultures: Machine Learning and Statistics

It's a funny thing to realize, but data science modeling is usually about either explainability, interpretation and understanding, or it's about predictive accuracy. But usually not both--optimizing for one tends to compromise the other. Leo Breiman was one of the titans of both kinds of modeling, a statistician who helped bring machine learning into statistics and vice versa. In this episode, we unpack one of his seminal papers from 2001, when machine learning was just beginning to take root, a...

Sep 05, 201617 min

Optimization Solutions

You've got an optimization problem to solve, and a less-than-forever amount of time in which to solve it. What do? Use a heuristic optimization algorithm, like a hill climber or simulated annealing--we cover both in this episode! Relevant link: http://www.lizsander.com/programming/2015/08/04/Heuristic-Search-Algorithms.html

Aug 29, 201620 min

Optimization Problems

If modeling is about predicting the unknown, optimization tries to answer the question of what to do, what decision to make, to get the best results out of a given situation. Sometimes that's straightforward, but sometimes... not so much. What makes an optimization problem easy or hard, and what are some of the methods for finding optimal solutions to problems? Glad you asked! May we recommend our latest podcast episode to you?

Aug 22, 201618 min

Multi-level modeling for understanding DEADLY RADIOACTIVE GAS

Ok, this episode is only sort of about DEADLY RADIOACTIVE GAS. It's mostly about multilevel modeling, which is a way of building models with data that has distinct, related subgroups within it. What are multilevel models used for? Elections (we can't get enough of 'em these days), understanding the effect that a good teacher can have on their students, and DEADLY RADIOACTIVE GAS. Relevant links: http://www.stat.columbia.edu/~gelman/research/published/multi2.pdf

Aug 15, 201624 min

How Polls Got Brexit "Wrong"

Continuing the discussion of how polls do (and sometimes don't) tell us what to expect in upcoming elections--let's take a concrete example from the recent past, shall we? The Brexit referendum was, by and large, expected to shake out for "remain", but when the votes were counted, "leave" came out ahead. Everyone was shocked (SHOCKED!) but maybe the polls weren't as wrong as the pundits like to claim. Relevant links: http://www.slate.com/articles/news_and_politics/moneybox/2016/07/why_political_...

Aug 08, 201615 min

Election Forecasting

Not sure if you heard, but there's an election going on right now. Polls, surveys, and projections about, as far as the eye can see. How to make sense of it all? How are the projections made? Which are some good ones to follow? We'll be your trusty guides through a crash course in election forecasting. Relevant links: http://www.wired.com/2016/06/civis-election-polling-clinton-sanders-trump/ http://election.princeton.edu/ http://projects.fivethirtyeight.com/2016-election-forecast/ http://www.nyt...

Aug 01, 201629 min

Machine Learning for Genomics

Genomics data is some of the biggest #bigdata, and doing machine learning on it is unlocking new ways of thinking about evolution, genomic diseases like cancer, and what really makes each of us different for everyone else. This episode touches on some of the things that make machine learning on genomics data so challenging, and the algorithms designed to do it anyway.

Jul 25, 201620 min

Climate Modeling

Hot enough for you? Climate models suggest that it's only going to get warmer in the coming years. This episode unpacks those models, so you understand how they work. A lot of the episodes we do are about fun studies we hear about, like "if you're interested, this is kinda cool"--this episode is much more important than that. Understanding these models, and taking action on them where appropriate, will have huge implications in the years to come. Relevant links: https://climatesight.org/

Jul 18, 201620 min

Reinforcement Learning Gone Wrong

Last week’s episode on artificial intelligence gets a huge payoff this week—we’ll explore a wonderful couple of papers about all the ways that artificial intelligence can go wrong. Malevolent actors? You bet. Collateral damage? Of course. Reward hacking? Naturally! It’s fun to think about, and the discussion starting now will have reverberations for decades to come. https://www.technologyreview.com/s/601519/how-to-create-a-malevolent-artificial-intelligence/ http://arxiv.org/abs/1605.02817 https...

Jul 11, 201628 min