Data Skeptic - podcast cover

Data Skeptic

Kyle Polichdataskeptic.com
The Data Skeptic Podcast features interviews and discussion of topics related to data science, statistics, machine learning, artificial intelligence and the like, all from the perspective of applying critical thinking and the scientific method to evaluate the veracity of claims and efficacy of approaches.
Last refreshed:
Follow this podcast in the Metacast mobile app to refresh it and see new episodes.
Download Metacast podcast app
Podcasts are better in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episodes

[MINI] The Bonferroni Correction

Today's episode begins by asking how many left handed employees we should expect to be at a company before anyone should claim left handedness discrimination. If not lefties, let's consider eye color, hair color, favorite ska band, most recent grocery store used, and any number of characteristics could be studied to look for deviations from the norm in a company. When multiple comparisons are to be made simultaneous, one must account for this, and a common method for doing so is with the Bonferr...

Jan 22, 201614 min

Detecting Pseudo-profound BS

A recent paper in the journal of Judgment and Decision Making titled On the reception and detection of pseudo-profound bullshit explores empirical questions around a reader's ability to detect statements which may sound profound but are actually a collection of buzzwords that fail to contain adequate meaning or truth. These statements are definitively different from lies and nonesense, as we discuss in the episode. This paper proposes the Bullshit Receptivity scale (BSR) and empirically demonstr...

Jan 15, 201638 min

[MINI] Gradient Descent

Today's mini episode discusses the widely known optimization algorithm gradient descent in the context of hiking in a foggy hillside.

Jan 08, 201615 min

Let's Kill the Word Cloud

This episode is a discussion of data visualization and a proposed New Year's resolution for Data Skeptic listeners. Let's kill the word cloud.

Jan 01, 201615 min

2015 Holiday Special

Today's episode is a reading of Isaac Asimov's The Machine that Won the War . I can't think of a story that's more appropriate for Data Skeptic.

Dec 25, 201514 min

Wikipedia Revision Scoring as a Service

In this interview with Aaron Halfaker of the Wikimedia Foundation, we discuss his research and career related to the study of Wikipedia. In his paper The Rise and Decline of an open Collaboration Community , he highlights a trend in the declining rate of active editors on Wikipedia which began in 2007. I asked Aaron about a variety of possible hypotheses for the phenomenon, in particular, how automated quality control tools that revert edits automatically could play a role. This lead Aaron and h...

Dec 18, 201543 min

The Hunt for Vulcan

Early astronomers could see several of the planets with the naked eye. The invention of the telescope allowed for further understanding of our solar system. The work of Isaac Newton allowed later scientists to accurately predict Neptune, which was later observationally confirmed exactly where predicted. It seemed only natural that a similar unknown body might explain anomalies in the orbit of Mercury, and thus began the search for the hypothesized planet Vulcan. Thomas Levenson 's book "The Hunt...

Dec 04, 201542 min

[MINI] The Accuracy Paradox

Today's episode discusses the accuracy paradox. There are cases when one might prefer a less accurate model because it yields more predictive power or better captures the underlying causal factors describing the outcome variable you are interested in. This is especially relevant in machine learning when trying to predict rare events. We discuss how the accuracy paradox might apply if you were trying to predict the likelihood a person was a bird owner.

Nov 27, 201517 min

Neuroscience from a Data Scientist's Perspective

... or should this have been called data science from a neuroscientist's perspective? Either way, I'm sure you'll enjoy this discussion with Laurie Skelly . Laurie earned a PhD in Integrative Neuroscience from the Department of Psychology at the University of Chicago. In her life as a social neuroscientist, using fMRI to study the neural processes behind empathy and psychopathy, she learned the ropes of zooming in and out between the macroscopic and the microscopic -- how millions of data points...

Nov 20, 201540 min

[MINI] Bias Variance Tradeoff

A discussion of the expected number of cars at a stoplight frames today's discussion of the bias variance tradeoff. The central ideal of this concept relates to model complexity. A very simple model will likely generalize well from training to testing data, but will have a very high variance since it's simplicity can prevent it from capturing the relationship between the covariates and the output. As a model grows more and more complex, it may capture more of the underlying data but the risk tha...

Nov 13, 201514 min

Big Data Doesn't Exist

The recent opinion piece Big Data Doesn't Exist on Tech Crunch by Slater Victoroff is an interesting discussion about the usefulness of data both big and small. Slater joins me this episode to discuss and expand on this discussion. Slater Victoroff is CEO of indico Data Solutions, a company whose services turn raw text and image data into human insight. He, and his co-founders, studied at Olin College of Engineering where indico was born. indico was then accepted into the "Techstars Accelarator ...

Nov 06, 201532 min

[MINI] Covariance and Correlation

The degree to which two variables change together can be calculated in the form of their covariance. This value can be normalized to the correlation coefficient, which has the advantage of transforming it to a unitless measure strictly bounded between -1 and 1. This episode discusses how we arrive at these values and why they are important.

Oct 30, 201514 min

Bayesian A/B Testing

Today's guest is Cameron Davidson-Pilon . Cameron has a masters degree in quantitative finance from the University of Waterloo. Think of it as statistics on stock markets. For the last two years he's been the team lead of data science at Shopify . He's the founder of dataoragami.net which produces screencasts teaching methods and techniques of applied data science. He's also the author of the just released in print book Bayesian Methods for Hackers: Probabilistic Programming and Bayesian Inferen...

Oct 23, 201530 min

[MINI] The Central Limit Theorem

The central limit theorem is an important statistical result which states that typically, the mean of a large enough set of independent trials is approximately normally distributed. This episode explores how this might be used to determine if an amazon parrot like Yoshi produces or or less waste than an African Grey, under the assumption that the individual distributions are not normal.

Oct 16, 201513 min

Accessible Technology

Today's guest is Chris Hofstader (@gonz_blinko) , an accessibility researcher and advocate, as well as an activist for causes such as improving access to information for blind and vision impaired people. His background in computer programming enabled him to be the leader of JAWS, a Windows program that allowed people with a visual impairment to read their screen either through text-to-speech or a refreshable braille display. He's the Managing Member of 3 Mouse Technology . He's also a frequent b...

Oct 09, 201539 min

[MINI] Multi-armed Bandit Problems

The multi-armed bandit problem is named with reference to slot machines (one armed bandits). Given the chance to play from a pool of slot machines, all with unknown payout frequencies, how can you maximize your reward? If you knew in advance which machine was best, you would play exclusively that machine. Any strategy less than this will, on average, earn less payout, and the difference can be called the "regret". You can try each slot machine to learn about it, which we refer to as exploration....

Oct 02, 201513 min

Shakespeare, Abiogenesis, and Exoplanets

Our episode this week begins with a correction. Back in episode 28 (Monkeys on Typewriters), Kyle made some bold claims about the probability that monkeys banging on typewriters might produce the entire works of Shakespeare by chance. The proof shown in the show notes turned out to be a bit dubious and Dave Spiegel joins us in this episode to set the record straight. In addition to that, our discussion explores a number of interesting topics in astronomy and astrophysics. This includes a paper D...

Sep 25, 201558 min

[MINI] Sample Sizes

There are several factors that are important to selecting an appropriate sample size and dealing with small samples. The most important questions are around representativeness - how well does your sample represent the total population and capture all it's variance? Linhda and Kyle talk through a few examples including elections, picking an Airbnb, produce selection, and home shopping as examples of cases in which the amount of observations one has are more or less important depending on how comp...

Sep 18, 201513 min

The Model Complexity Myth

There's an old adage which says you cannot fit a model which has more parameters than you have data. While this is often the case, it's not a universal truth. Today's guest Jake VanderPlas explains this topic in detail and provides some excellent examples of when it holds and doesn't. Some excellent visuals articulating the points can be found on Jake's blog Pythonic Perambulations, specifically on his post The Model Complexity Myth. We also touch on Jake's work as an astronomer, his noteworthy ...

Sep 11, 201530 min

[MINI] Distance Measures

There are many occasions in which one might want to know the distance or similarity between two things, for which the means of calculating that distance is not necessarily clear. The distance between two points in Euclidean space is generally straightforward, but what about the distance between the top of Mount Everest to the bottom of the ocean? What about the distance between two sentences? This mini-episode summarizes some of the considerations and a few of the means of calculating distance. ...

Sep 04, 201513 min

ContentMine

ContentMine is a project which provides the tools and workflow to convert scientific literature into machine readable and machine interpretable data in order to facilitate better and more effective access to the accumulated knowledge of human kind. The program's founder Peter Murray-Rust joins us this week to discuss ContentMine. Our discussion covers the project, the scientific publication process, copywrite, and several other interesting topics.

Aug 28, 201553 min

[MINI] Structured and Unstructured Data

Today's mini-episode explains the distinction between structured and unstructured data, and debates which of these categories best describe recipes.

Aug 21, 201513 min

Measuring the Influence of Fashion Designers

Yusan Lin shares her research on using data science to explore the fashion industry in this episode. She has applied techniques from data mining, natural language processing, and social network analysis to explore who are the innovators in the fashion world and how their influence effects other designers. If you found this episode interesting and would like to read more, Yusan's papers Text-Generated Fashion Influence Model: An Empirical Study on Style.com and The Hidden Influence Network in the...

Aug 14, 201525 min

[MINI] PageRank

PageRank is the algorithm most famous for being one of the original innovations that made Google stand out as a search engine. It was defined in the classic paper The Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Larry Page. While this algorithm clearly impacted web searching, it has also been useful in a variety of other applications. This episode presents a high level description of this algorithm and how it might apply when trying to establish who writes the most ...

Aug 07, 20158 min

Data Science at Work in LA County

In this episode, Benjamin Uminsky enlightens us about some of the ways the Los Angeles County Registrar-Recorder/County Clerk leverages data science and analysis to help be more effective and efficient with the services and expectations they provide citizens. Our topics range from forecasting to predicting the likelihood that people will volunteer to be poll workers. Benjamin recently spoke at Big Data Day LA. Videos have not yet been posted, but you can see the slides from his talk Data Mining ...

Jul 29, 201541 min

[MINI] k-Nearest Neighbors

This episode explores the k-nearest neighbors algorithm which is an unsupervised, non-parametric method that can be used for both classification and regression. The basica concept is that it leverages some distance function on your dataset to find the $k$ closests other observations of the dataset and averaging them to impute an unknown value or unlabelled datapoint.

Jul 24, 20159 min

Crypto

How do people think rationally about small probability events? What is the optimal statistical process by which one can update their beliefs in light of new evidence? This episode of Data Skeptic explores questions like this as Kyle consults a cast of previous guests and experts to try and answer the question "What is the probability, however small, that Bigfoot is real?"

Jul 17, 20151 hr 25 min

[MINI] MapReduce

This mini-episode is a high level explanation of the basic idea behind MapReduce, which is a fundamental concept in big data. The origin of the idea comes from a Google paper titled MapReduce: Simplified Data Processing on Large Clusters . This episode makes an analogy to tabulating paper voting ballets as a means of helping to explain how and why MapReduce is an important concept.

Jul 10, 201513 min

Genetically Engineered Food and Trends in Herbicide Usage

The Credible Hulk joins me in this episode to discuss a recent blog post he wrote about glyphosate and the data about how it's introduction changed the historical usage trends of other herbicides. Links to all the sources and references can be found in the blog post. In this discussion, we also mention the food babe and Last Thursdayism which may be worth some further reading. Kyle also mentioned the list of ingredients or chemical composition of a banana . Credible Hulk mentioned the Mommy PhD ...

Jul 03, 201535 min
Hosted on Libsyn
For the best experience, listen in Metacast app for iOS or Android