Data Skeptic - podcast cover

Data Skeptic

Kyle Polichdataskeptic.com
The Data Skeptic Podcast features interviews and discussion of topics related to data science, statistics, machine learning, artificial intelligence and the like, all from the perspective of applying critical thinking and the scientific method to evaluate the veracity of claims and efficacy of approaches.

Episodes

Preserving History at Cyark

Elizabeth Lee from CyArk joins us in this episode to share stories of the work done capturing important historical sites digitally. CyArk is a non-profit focused on using technology to preserve the world's important historic and cultural locations digitally. CyArk's founder Ben Kacyra, a pioneer in 3D capture technology, and his wife, founded CyArk after seeing the need to preserve important artifacts and locations digitally before they are lost to natural disasters, human destruction, or the pa...

Jun 05, 201523 minTranscript available on Metacast

[MINI] A Critical Examination of a Study of Marriage by Political Affiliation

Linhda and Kyle review a New York Times article titled How Your Hometown Affects Your Chances of Marriage . This article explores research about what correlates with the likelihood of being married by age 26 by county. Kyle and LinhDa discuss some of the fine points of this research and the process of identifying factors for consideration....

May 29, 201510 minTranscript available on Metacast

Detecting Cheating in Chess

With the advent of algorithms capable of beating highly ranked chess players, the temptation to cheat has emmerged as a potential threat to the integrity of this ancient and complex game. Yet, there are aspects of computer play that are measurably different than human play. Dr. Kenneth Regan has developed a methodology for looking at a long series of modes and measuring the likelihood that the moves may have been selected by an algorithm. The full transcript of this episode is well annotated and...

May 22, 201545 minTranscript available on Metacast

[MINI] z-scores

This week's episode dicusses z-scores, also known as standard score. This score describes the distance (in standard deviations) that an observation is away from the mean of the population. A closely related top is the 68-95-99.7 rule which tells us that (approximately) 68% of a normally distributed population lies within one standard deviation of the mean, 95 within 2, and 99.7 within 3. Kyle and Linh Da discuss z-scores in the context of human height. If you'd like to calculate your own z-score...

May 15, 201510 minTranscript available on Metacast

Using Data to Help Those in Crisis

This week Noelle Sio Saldana discusses her volunteer work at Crisis Text Line - a 24/7 service that connects anyone with crisis counselors. In the episode we discuss Noelle's career and how, as a participant in the Pivotal for Good program (a partnership with DataKind), she spent three months helping find insights in the messaging data collected by Crisis Text Line. These insights helped give visibility into a number of different aspects of Crisis Text Line's services. Listen to this episode to ...

May 08, 201535 minTranscript available on Metacast

The Ghost in the MP3

Have you ever wondered what is lost when you compress a song into an MP3? This week's guest Ryan Maguire did more than that. He worked on software to issolate the sounds that are lost when you convert a lossless digital audio recording into a compressed MP3 file. To complete his project, Ryan worked primarily in python using the pyo library as well as the Bregman Toolkit Ryan mentioned humans having a dynamic range of hearing from 20 hz to 20,000 hz , if you'd like to hear those tones, check the...

May 01, 201535 minTranscript available on Metacast

Data Fest 2015

This episode contains converage of the 2015 Data Fest hosted at UCLA. Data Fest is an analysis competition that gives teams of students 48 hours to explore a new dataset and present novel findings. This year, data from Edmunds.com was provided, and students competed in three categories: best recommendation, best use of external data, and best visualization.

Apr 28, 201527 minTranscript available on Metacast

[MINI] Cornbread and Overdispersion

For our 50th episode we enduldge a bit by cooking Linhda's previously mentioned "healthy" cornbread. This leads to a discussion of the statistical topic of overdispersion in which the variance of some distribution is larger than what one's underlying model will account for.

Apr 24, 201516 minTranscript available on Metacast

[MINI] Natural Language Processing

This episode overviews some of the fundamental concepts of natural language processing including stemming, n-grams, part of speech tagging, and th bag of words approach.

Apr 17, 201513 minTranscript available on Metacast

Computer-based Personality Judgments

Guest Youyou Wu discuses the work she and her collaborators did to measure the accuracy of computer based personality judgments. Using Facebook "like" data, they found that machine learning approaches could be used to estimate user's self assessment of the "big five" personality traits: openness, agreeableness, extraversion, conscientiousness, and neuroticism. Interestingly, the computer-based assessments outperformed some of the assessments of certain groups of human beings. Listen to the episo...

Apr 10, 201532 minTranscript available on Metacast

[MINI] Markov Chains

This episode introduces the idea of a Markov Chain. A Markov Chain has a set of states describing a particular system, and a probability of moving from one state to another along every valid connected state. Markov Chains are memoryless, meaning they don't rely on a long history of previous observations. The current state of a system depends only on the previous state and the results of a random outcome. Markov Chains are a useful way method for describing non-deterministic systems. They are use...

Mar 20, 201511 minTranscript available on Metacast

Oceanography and Data Science

Nicole Goebel joins us this week to share her experiences in oceanography studying phytoplankton and other aspects of the ocean and how data plays a role in that science. We also discuss Thinkful where Nicole and I are both mentors for the Introduction to Data Science course. Last but not least, check out Nicole's blog Data Science Girl and the videos Kyle mentioned on her Youtube channel featuring one on the diversity of phytoplankton and how that changes in time and space ....

Mar 13, 201533 minTranscript available on Metacast

NYC Speed Camera Analysis with Tim Schmeier

New York State approved the use of automated speed cameras within a specific range of schools. Tim Schmeier did an analysis of publically available data related to these cameras as part of a project at the NYC Data Science Academy . Tim's work leverages several open data sets to ask the questions: are the speed cameras succeeding in their intended purpose of increasing public safety near schools? What he found using open data may surprise you. You can read Tim's write up titled Speed Cameras: Re...

Feb 27, 201517 minTranscript available on Metacast

[MINI] k-means clustering

The k-means clustering algorithm is an algorithm that computes a deterministic label for a given "k" number of clusters from an n-dimensional datset. This mini-episode explores how Yoshi, our lilac crowned amazon's biological processes might be a useful way of measuring where she sits when there are no humans around. Listen to find out how!

Feb 20, 201514 minTranscript available on Metacast

Shadow Profiles on Social Networks

Emre Sarigol joins me this week to discuss his paper Online Privacy as a Collective Phenomenon . This paper studies data collected from social networks and how the sharing behaviors of individuals can unintentionally reveal private information about other people, including those that have not even joined the social network! For the specific test discussed, the researchers were able to accurately predict the sexual orientation of individuals, even when this information was withheld during the tra...

Feb 13, 201539 minTranscript available on Metacast

[MINI] The Chi-Squared Test

The Chi-Squared test is a methodology for hypothesis testing. When one has categorical data, in the form of frequency counts or observations (e.g. Vegetarian, Pescetarian, and Omnivore), split into two or more categories (e.g. Male, Female), a question may arise such as "Are women more likely than men to be vegetarian?" or put more accurately, "Is any observed difference in the frequency with which women report being vegetarian differ in a statistically significant way from the frequency men rep...

Feb 06, 201518 minTranscript available on Metacast

Mapping Reddit Topics with Randy Olson

My quest this week is noteworthy a.i. researcher Randy Olson who joins me to share his work creating the Reddit World Map - a visualization that illuminates clusters in the reddit community based on user behavior. Randy's blog post on created the reddit world map is well complimented by a more detailed write up titled Navigating the massive world of reddit: using backbone networks to map user interests in social media . Last but not least, an interactive version of the results (which leverages G...

Jan 30, 201530 minTranscript available on Metacast

[MINI] Partially Observable State Spaces

When dealing with dynamic systems that are potentially undergoing constant change, its helpful to describe what "state" they are in. In many applications the manner in which the state changes from one to another is not completely predictable, thus, there is uncertainty over how it transitions from state to state. Further, in many applications, one cannot directly observe the true state, and thus we describe such situations as partially observable state spaces. This episode explores what this mea...

Jan 23, 201513 minTranscript available on Metacast

Easily Fooling Deep Neural Networks

My guest this week is Anh Nguyen, a PhD student at the University of Wyoming working in the Evolving AI lab . The episode discusses the paper Deep Neural Networks are Easily Fooled [pdf] by Anh Nguyen, Jason Yosinski, and Jeff Clune. It describes a process for creating images that a trained deep neural network will mis-classify. If you have a deep neural network that has been trained to recognize certain types of objects in images, these "fooling" images can be constructed in a way which the net...

Jan 16, 201528 minTranscript available on Metacast

[MINI] Data Provenance

This episode introduces a high level discussion on the topic of Data Provenance, with more MINI episodes to follow to get into specific topics. Thanks to listener Sara L who wrote in to point out the Data Skeptic Podcast has focused alot about using data to be skeptical, but not necessarily being skeptical of data. Data Provenance is the concept of knowing the full origin of your dataset. Where did it come from? Who collected it? How as it collected? Does it combine independent sources or one si...

Jan 09, 201511 minTranscript available on Metacast

Doubtful News, Geology, Investigating Paranormal Groups, and Thinking Scientifically with Sharon Hill

I had the change to speak with well known Sharon Hill ( @idoubtit ) for the first episode of 2015. We discuss a number of interesting topics including the contributions Doubtful News makes to getting scientific and skeptical information ranked highly in search results, sink holes, why earthquakes are hard to predict, and data collection about paranormal groups via the internet....

Jan 03, 201531 minTranscript available on Metacast

[MINI] Belief in Santa

In this quick holiday episode, we touch on how one would approach modeling the statistical distribution over the probability of belief in Santa Claus given age.

Dec 26, 201410 minTranscript available on Metacast

Economic Modeling and Prediction, Charitable Giving, and a Follow Up with Peter Backus

Economist Peter Backus joins me in this episode to discuss a few interesting topics. You may recall Linhda and I previously discussed his paper " The Girlfriend Equation " on a recent mini-episode. We start by touching base on this fun paper and get a follow up on where Peter stands years after writing w.r.t. a successful romantic union. Additionally, we delve in to some fascinating economics topics. We touch on questions of the role models, for better or for worse, played a role in the ~2008 ec...

Dec 19, 201424 minTranscript available on Metacast

[MINI] The Battle of the Sexes

Love and Data is the continued theme in this mini-episode as we discuss the game theory example of The Battle of the Sexes. In this textbook example, a couple must strategize about how to spend their Friday night. One partner prefers football games while the other partner prefers to attend the opera. Yet, each person would rather be at their non-preferred location so long as they are still with their spouse. So where should they decide to go?...

Dec 12, 201418 minTranscript available on Metacast

The Science of Online Data at Plenty of Fish with Thomas Levi

Can algorithms help you find love? Many happy couples successfully brought together via online dating websites show us that data science can help you find love. I'm joined this week by Thomas Levi, Senior Data Scientist at Plenty of Fish , to discuss some of his work which helps people find one another as efficiently as possible. Matchmaking is a truly non-trivial problem, and one that's dynamically changing all the time as new users join and leave the "pool of fish". This episode explores the a...

Dec 05, 201459 minTranscript available on Metacast

[MINI] The Girlfriend Equation

Economist Peter Backus put forward "The Girlfriend Equation" while working on his PhD - a probabilistic model attempting to estimate the likelihood of him finding a girlfriend. In this mini episode we explore the soundness of his model and also share some stories about how Linhda and Kyle met.

Nov 28, 201416 minTranscript available on Metacast

The Secret and the Global Consciousness Project with Alex Boklin

I'm joined this week by Alex Boklin to explore the topic of magical thinking especially in the context of Rhonda Byrne's "The Secret", and the similarities it bears to The Global Consciousness Project (GCP). The GCP puts forward the hypothesis that random number generators elicit statistically significant changes as a result of major world events.

Nov 21, 201442 minTranscript available on Metacast

[MINI] Monkeys on Typewriters

What is randomness? How can we determine if some results are randomly generated or not? Why are random numbers important to us in our everyday life? These topics and more are discussed in this mini-episode on random numbers. Many readers will be vaguely familar with the idea of "X number of monkeys banging on Y number of typewriters for Z number of years" - the idea being that such a setup would produce random sequences of letters. The origin of this idea was the mathemetician Borel who was inte...

Nov 14, 20143 minTranscript available on Metacast
Data Skeptic podcast - Listen or read transcript on Metacast