In this Data Skeptic episode, Kyle is joined by guest Ruggiero Cavallo to discuss his latest efforts to mitigate the problems presented in this new world of online advertising. Working with his collaborators, Ruggiero reconsiders the search ad allocation and pricing problems from the ground up and redesigns a search ad selling system. He discusses a mechanism that optimizes an entire page of ads globally based on efficiency-maximizing search allocation and a novel technical approach to computing...
Mar 17, 2017•42 min
Today's episode overviews the perceptron algorithm. This rather simple approach is characterized by a few particular features. It updates its weights after seeing every example, rather than as a batch. It uses a step function as an activation function. It's only appropriate for linearly separable data, and it will converge to a solution if the data meets these criteria. Being a fairly simple algorithm, it can run very efficiently. Although we don't discuss it in this episode, multi-layer percept...
Mar 10, 2017•15 min
DataRefuge is a public collaborative, grassroots effort around the United States in which scientists, researchers, computer scientists, librarians and other volunteers are working to download, save, and re-upload government data. The DataRefuge Project, which is led by the UPenn Program in Environmental Humanities and the Penn Libraries group at University of Pennsylvania, aims to foster resilience in an era of anthropogenic global climate change and raise awareness of how social and political e...
Mar 03, 2017•25 min
If a CEO wants to know the state of their business, they ask their highest ranking executives. These executives, in turn, should know the state of the business through reports from their subordinates. This structure is roughly analogous to a process observed in deep learning, where each layer of the business reports up different types of observations, KPIs, and reports to be interpreted by the next layer of the business. In deep learning, this process can be thought of as automated feature engin...
Feb 24, 2017•16 min
In this episode, I speak with Raghu Ramakrishnan, CTO for Data at Microsoft. We discuss services, tools, and developments in the big data sphere as well as the underlying needs that drove these innovations.
Feb 17, 2017•31 min
In this episode, we talk about a high-level description of deep learning. Kyle presents a simple game (pictured below), which is more of a puzzle really, to try and give Linh Da the basic concept. Thanks to our sponsor for this week, the Data Science Association. Please check out their upcoming Dallas conference at dallasdatascience.eventbrite.com
Feb 10, 2017•14 min
Versioning isn't just for source code. Being able to track changes to data is critical for answering questions about data provenance, quality, and reproducibility. Daniel Whitenack joins me this week to talk about these concepts and share his work on Pachyderm. Pachyderm is an open source containerized data lake. During the show, Daniel mentioned the Gopher Data Science github repo as a great resource for any data scientists interested in the Go language. Although we didn't mention it, Daniel al...
Feb 03, 2017•40 min
Logistic Regression is a popular classification algorithm. In this episode, we discuss how it can be used to determine if an audio clip represents one of two given speakers. It assumes an output variable (isLinhda) is a linear combination of available features, which are spectral bands in the discussion on this episode. Keep an eye on the dataskeptic.com blog this week as we post more details about this project. Thanks to our sponsor this week, the Data Science Association. Please check out thei...
Jan 27, 2017•21 min
Prior work has shown that people's response to competition is in part predicted by their gender. Understanding why and when this occurs is important in areas such as labor market outcomes. A well structured study is challenging due to numerous confounding factors. Peter Backus and his colleagues have identified competitive chess as an ideal arena to study the topic. Find out why and what conclusions they reached. Our discussion centers around Gender, Competition and Performance: Evidence from Re...
Jan 20, 2017•34 min
Deep learning can be prone to overfit a given problem. This is especially frustrating given how much time and computational resources are often required to converge. One technique for fighting overfitting is to use dropout. Dropout is the method of randomly selecting some neurons in one's network to set to zero during iterations of learning. The core idea is that each particular input in a given layer is not always available and therefore not a signal that can be relied on too heavily....
Jan 13, 2017•16 min
In this episode I speak with Clarence Wardell and Kelly Jin about their mutual service as part of the White House's Police Data Initiative and Data Driven Justice Initiative respectively. The Police Data Initiative was organized to use open data to increase transparency and community trust as well as to help police agencies use data for internal accountability. The PDI emerged from recommendations made by the Task Force on 21st Century Policing . The Data Driven Justice Initiative was organized ...
Jan 06, 2017•49 min
We close out 2016 with a discussion of a basic interview question which might get asked when applying for a data science job. Specifically, how a library might build a model to predict if a book will be returned late or not.
Dec 30, 2016•35 min
Today's episode is a reading of Isaac Asimov's Franchise . As mentioned on the show, this is just a work of fiction to be enjoyed and not in any way some obfuscated political statement. Enjoy, and happy holidays!
Dec 23, 2016•40 min
Classically, entropy is a measure of disorder in a system. From a statistical perspective, it is more useful to say it's a measure of the unpredictability of the system. In this episode we discuss how information reduces the entropy in deciding whether or not Yoshi the parrot will like a new chew toy. A few other everyday examples help us examine why entropy is a nice metric for constructing a decision tree.
Dec 16, 2016•17 min
Cloud services are now ubiquitous in data science and more broadly in technology as well. This week, I speak to Mark Souza , Tobias Ternström , and Corey Sanders about various aspects of data at scale. We discuss the embedding of R into SQLServer, SQLServer on linux, open source, and a few other cloud topics.
Dec 09, 2016•42 min
Today's episode is all about Causal Impact, a technique for estimating the impact of a particular event on a time series. We talk to William Martin about his research into the impact releases have on app and we also chat with Karen Blakemore about a project she helped us build to explore the impact of a Saturday Night Live appearance on a musician's career. Martin's work culminated in a paper Causal Impact for App Store Analysis . A shorter summary version can be found here . His company helping...
Dec 02, 2016•34 min
The Bootstrap is a method of resampling a dataset to possibly refine it's accuracy and produce useful metrics on the result. The bootstrap is a useful statistical technique and is leveraged in Bagging (bootstrap aggregation) algorithms such as Random Forest. We discuss this technique related to polling and surveys.
Nov 25, 2016•11 min
The Gini Coefficient (as it relates to decision trees) is one approach to determining the optimal decision to introduce which splits your dataset as part of a decision tree. To pick the right feature to split on, it considers the frequency of the values of that feature and how well the values correlate with specific outcomes that you are trying to predict.
Nov 18, 2016•16 min
Financial analysis techniques for studying numeric, well structured data are very mature. While using unstructured data in finance is not necessarily a new idea, the area is still very greenfield. On this episode, Delia Rusu shares her thoughts on the potential of unstructured data and discusses her work analyzing Wikipedia to help inform financial decisions. Delia's talk at PyData Berlin can be watched on Youtube ( Estimating stock price correlations using Wikipedia ). The slides can be found h...
Nov 11, 2016•34 min
AdaBoost is a canonical example of the class of AnyBoost algorithms that create ensembles of weak learners. We discuss how a complex problem like predicting restaurant failure (which is surely caused by different problems in different situations) might benefit from this technique.
Nov 04, 2016•11 min
Platform as a service is a growing trend in data science where services like fraud analysis and face detection can be provided via APIs. Such services turn the actual model into a black box to the consumer. But can the model be reverse engineered? Florian Tramèr shares his work in this episode showing that it can. The paper Stealing Machine Learning Models via Prediction APIs is definitely worth your time to read if you enjoy this episode. Related source code can be found in https://github.com/f...
Oct 28, 2016•37 min
For machine learning models created with the random forest algorithm, there is no obvious diagnostic to inform you which features are more important in the output of the model. Some straightforward but useful techniques exist revolving around removing a feature and measuring the decrease in accuracy or Gini values in the leaves. We broadly discuss these techniques in this episode.
Oct 21, 2016•13 min
As cities provide bike sharing services, they must also plan for how to redistribute bicycles as they inevitably build up at more popular destination stations. In this episode, Hui Xiong talks about the solution he and his colleagues developed to rebalance bike sharing systems .
Oct 14, 2016•30 min
Random forest is a popular ensemble learning algorithm which leverages bagging both for sampling and feature selection. In this episode we make an analogy to the process of running a bookstore.
Oct 07, 2016•13 min
Jo Hardin joins us this week to discuss the ASA 's Election Prediction Contest. This is a competition aimed at forecasting the results of the upcoming US presidential election competition. More details are available in Jo's blog post found here . You can find some useful R code for getting started automatically gathering data from 538 via Jo's github and official contest details are available here . During the interview we also mention Daily Kos and 538 ....
Sep 30, 2016•22 min
The F1 score is a model diagnostic that combines precision and recall to provide a singular evaluation for model comparison. In this episode we discuss how it applies to selecting an interior designer.
Sep 23, 2016•9 min
Urban congestion effects every person living in a city of any reasonable size. Lewis Lehe joins us in this episode to share his work on downtown congestion pricing. We explore topics of how different pricing mechanisms effect congestion as well as how data visualization can inform choices. You can find examples of Lewis's work at setosa.io . His paper which we discussed during the interview is Distance-dependent congestion pricing for downtown zones . On this episode, we discuss State of Califor...
Sep 16, 2016•35 min
Heteroskedasticity is a term used to describe a relationship between two variables which has unequal variance over the range. For example, the variance in the length of a cat's tail almost certainly changes (grows) with age. On the other hand, the average amount of chewing gum a person consume probably has a consistent variance over a wide range of human heights. We also discuss some issues with the visualization shown in the tweet embedded below.
Sep 09, 2016•9 min
Our guest today is Michael Cuthbert, an associate professor of music at MIT and principal investigator of the Music21 project, which we focus our discussion on today. Music21 is a python library making analysis of music accessible and fun. It supports integration with popular formats such as MIDI, MusicXML, Lilypond, and others. It's also well integrated with The Elvis Project , enabling users to import large volumes of music for easy analysis. Music21 is a great platform for musicologists and m...
Sep 02, 2016•35 min
Paxos is a protocol for arriving a consensus in a distributed computing system which accounts for unreliability of the nodes. We discuss how this might be used in the real world in the event of a massive disaster.
Aug 26, 2016•15 min