Data Skeptic

‌

Episodes

‌

Data Skeptic

The Data Skeptic Podcast features interviews and discussion of topics related to data science, statistics, machine learning, artificial intelligence and the like, all from the perspective of applying critical thinking and the scientific method to evaluate the veracity of claims and efficacy of approaches.

Episodes

pix2code

In this episode, Tony Beltramelli of UIzard Technologies joins our host, Kyle Polich, to talk about the ideas behind his latest app that can transform graphic design into functioning code, as well as his previous work on spying with wearables.

Jul 28, 2017•27 min•Transcript available on Metacast

[MINI] Conditional Independence

In statistics, two random variables might depend on one another (for example, interest rates and new home purchases). We call this conditional dependence. An important related concept exists called conditional independence. This phrase describes situations in which two variables are independent of one another given some other variable. For example, the probability that a vendor will pay their bill on time could depend on many factors such as the company's market cap. Thus, a statistical analysis...

Jul 21, 2017•15 min•Transcript available on Metacast

Estimating Sheep Pain with Facial Recognition

Animals can't tell us when they're experiencing pain, so we have to rely on other cues to help treat their discomfort. But it is often difficult to tell how much an animal is suffering. The sheep, for instance, is the most inscrutable of animals. However, scientists have figured out a way to understand sheep facial expressions using artificial intelligence. On this week's episode, Dr. Marwa Mahmoud from the University of Cambridge joins us to discuss her recent study, " Estimating Sheep Pain Lev...

Jul 14, 2017•27 min•Transcript available on Metacast

CosmosDB

This episode collects interviews from my recent trip to Microsoft Build where I had the opportunity to speak with Dharma Shukla and Syam Nair about the recently announced CosmosDB. CosmosDB is a globally consistent, distributed datastore that supports all the popular persistent storage formats (relational, key/value pair, document database, and graph) under a single streamlined API. The system provides tunable consistency, allowing the user to make choices about how consistency trade-offs are ma...

Jul 07, 2017•34 min•Transcript available on Metacast

[MINI] The Vanishing Gradient

This episode discusses the vanishing gradient - a problem that arises when training deep neural networks in which nearly all the gradients are very close to zero by the time back-propagation has reached the first hidden layer. This makes learning virtually impossible without some clever trick or improved methodology to help earlier layers begin to learn.

Jun 30, 2017•15 min•Transcript available on Metacast

Doctor AI

hen faced with medical issues, would you want to be seen by a human or a machine? In this episode, guest Edward Choi, co-author of the study titled Doctor AI: Predicting Clinical Events via Recurrent Neural Network shares his thoughts. Edward presents his team’s efforts in developing a temporal model that can learn from human doctors based on their collective knowledge, i.e. the large amount of Electronic Health Record (EHR) data.

Jun 23, 2017•42 min•Transcript available on Metacast

[MINI] Activation Functions

In a neural network, the output value of a neuron is almost always transformed in some way using a function. A trivial choice would be a linear transformation which can only scale the data. However, other transformations, like a step function allow for non-linear properties to be introduced. Activation functions can also help to standardize your data between layers. Some functions such as the sigmoid have the effect of "focusing" the area of interest on data. Extreme values are placed close toge...

Jun 16, 2017•14 min•Transcript available on Metacast

MS Build 2017

This episode recaps the Microsoft Build Conference. Kyle recently attended and shares some thoughts on cloud, databases, cognitive services, and artificial intelligence. The episode includes interviews with Rohan Kumar and David Carmona.

Jun 09, 2017•28 min•Transcript available on Metacast

[MINI] Max-pooling

Max-pooling is a procedure in a neural network which has several benefits. It performs dimensionality reduction by taking a collection of neurons and reducing them to a single value for future layers to receive as input. It can also prevent overfitting, since it takes a large set of inputs and admits only one value, making it harder to memorize the input. In this episode, we discuss the intuitive interpretation of max-pooling and why it's more common than mean-pooling or (theoretically) quartile...

Jun 02, 2017•13 min•Transcript available on Metacast

Unsupervised Depth Perception

This episode is an interview with Tinghui Zhou . In the recent paper " Unsupervised Learning of Depth and Ego-motion from Video ", Tinghui and collaborators propose a deep learning architecture which is able to learn depth and pose information from unlabeled videos. We discuss details of this project and its applications.

May 26, 2017•24 min•Transcript available on Metacast

[MINI] Convolutional Neural Networks

CNNs are characterized by their use of a group of neurons typically referred to as a filter or kernel. In image recognition, this kernel is repeated over the entire image. In this way, CNNs may achieve the property of translational invariance - once trained to recognize certain things, changing the position of that thing in an image should not disrupt the CNN's ability to recognize it. In this episode, we discuss a few high-level details of this important architecture.

May 19, 2017•15 min•Transcript available on Metacast

Multi-Agent Diverse Generative Adversarial Networks

Despite the success of GANs in imaging, one of its major drawbacks is the problem of 'mode collapse,' where the generator learns to produce samples with extremely low variety. To address this issue, today's guests Arnab Ghosh and Viveka Kulharia proposed two different extensions. The first involves tweaking the generator's objective function with a diversity enforcing term that would assess similarities between the different samples generated by different generators. The second comprises modifyi...

May 12, 2017•29 min•Transcript available on Metacast

[MINI] Generative Adversarial Networks

GANs are an unsupervised learning method involving two neural networks iteratively competing. The discriminator is a typical learning system. It attempts to develop the ability to recognize members of a certain class, such as all photos which have birds in them. The generator attempts to create false examples which the discriminator incorrectly classifies. In successive training rounds, the networks examine each and play a mini-max game of trying to harm the performance of the other. In addition...

May 05, 2017•10 min•Transcript available on Metacast

Opinion Polls for Presidential Elections

Recently, we've seen opinion polls come under some skepticism. But is that skepticism truly justified? The recent Brexit referendum and US 2016 Presidential Election are examples where some claims the polls "got it wrong". This episode explores this idea.

Apr 28, 2017•53 min•Transcript available on Metacast

OpenHouse

No reliable, complete database cataloging home sales data at a transaction level is available for the average person to access. To a data scientist interesting in studying this data, our hands are complete tied. Opportunities like testing sociological theories, exploring economic impacts, study market forces, or simply research the value of an investment when buying a home are all blocked by the lack of easy access to this dataset. OpenHouse seeks to correct that by centralizing and standardizin...

Apr 21, 2017•26 min•Transcript available on Metacast

[MINI] GPU CPU

There's more than one type of computer processor. The central processing unit (CPU) is typically what one means when they say "processor". GPUs were introduced to be highly optimized for doing floating point computations in parallel. These types of operations were very useful for high end video games, but as it turns out, those same processors are extremely useful for machine learning. In this mini-episode we discuss why.

Apr 14, 2017•11 min•Transcript available on Metacast

[MINI] Backpropagation

Backpropagation is a common algorithm for training a neural network. It works by computing the gradient of each weight with respect to the overall error, and using stochastic gradient descent to iteratively fine tune the weights of the network. In this episode, we compare this concept to finding a location on a map, marble maze games, and golf.

Apr 07, 2017•15 min•Transcript available on Metacast

Data Science at Patreon

In this week's episode of Data Skeptic, host Kyle Polich talks with guest Maura Church, Patreon's data science manager. Patreon is a fast-growing crowdfunding platform that allows artists and creators of all kinds build their own subscription content service. The platform allows fans to become patrons of their favorite artists- an idea similar the Renaissance times, when musicians would rely on benefactors to become their patrons so they could make more art. At Patreon, Maura's data science team...

Mar 31, 2017•32 min•Transcript available on Metacast

[MINI] Feed Forward Neural Networks

Feed Forward Neural Networks In a feed forward neural network, neurons cannot form a cycle. In this episode, we explore how such a network would be able to represent three common logical operators: OR, AND, and XOR. The XOR operation is the interesting case. Below are the truth tables that describe each of these functions. AND Truth Table Input 1 Input 2 Output 0 0 0 0 1 0 1 0 0 1 1 1 OR Truth Table Input 1 Input 2 Output 0 0 0 0 1 1 1 0 1 1 1 1 XOR Truth Table Input 1 Input 2 Output 0 0 0 0 1 1...

Mar 24, 2017•16 min•Transcript available on Metacast

Reinventing Sponsored Search Auctions

In this Data Skeptic episode, Kyle is joined by guest Ruggiero Cavallo to discuss his latest efforts to mitigate the problems presented in this new world of online advertising. Working with his collaborators, Ruggiero reconsiders the search ad allocation and pricing problems from the ground up and redesigns a search ad selling system. He discusses a mechanism that optimizes an entire page of ads globally based on efficiency-maximizing search allocation and a novel technical approach to computing...

Mar 17, 2017•42 min•Transcript available on Metacast

[MINI] The Perceptron

Today's episode overviews the perceptron algorithm. This rather simple approach is characterized by a few particular features. It updates its weights after seeing every example, rather than as a batch. It uses a step function as an activation function. It's only appropriate for linearly separable data, and it will converge to a solution if the data meets these criteria. Being a fairly simple algorithm, it can run very efficiently. Although we don't discuss it in this episode, multi-layer percept...

Mar 10, 2017•15 min•Transcript available on Metacast

The Data Refuge Project

DataRefuge is a public collaborative, grassroots effort around the United States in which scientists, researchers, computer scientists, librarians and other volunteers are working to download, save, and re-upload government data. The DataRefuge Project, which is led by the UPenn Program in Environmental Humanities and the Penn Libraries group at University of Pennsylvania, aims to foster resilience in an era of anthropogenic global climate change and raise awareness of how social and political e...

Mar 03, 2017•25 min•Transcript available on Metacast

[MINI] Automated Feature Engineering

If a CEO wants to know the state of their business, they ask their highest ranking executives. These executives, in turn, should know the state of the business through reports from their subordinates. This structure is roughly analogous to a process observed in deep learning, where each layer of the business reports up different types of observations, KPIs, and reports to be interpreted by the next layer of the business. In deep learning, this process can be thought of as automated feature engin...

Feb 24, 2017•16 min•Transcript available on Metacast

Big Data Tools and Trends

In this episode, I speak with Raghu Ramakrishnan, CTO for Data at Microsoft. We discuss services, tools, and developments in the big data sphere as well as the underlying needs that drove these innovations.

Feb 17, 2017•31 min•Transcript available on Metacast

[MINI] Primer on Deep Learning

In this episode, we talk about a high-level description of deep learning. Kyle presents a simple game (pictured below), which is more of a puzzle really, to try and give Linh Da the basic concept. Thanks to our sponsor for this week, the Data Science Association. Please check out their upcoming Dallas conference at dallasdatascience.eventbrite.com...

Feb 10, 2017•14 min•Transcript available on Metacast

Data Provenance and Reproducibility with Pachyderm

Versioning isn't just for source code. Being able to track changes to data is critical for answering questions about data provenance, quality, and reproducibility. Daniel Whitenack joins me this week to talk about these concepts and share his work on Pachyderm. Pachyderm is an open source containerized data lake. During the show, Daniel mentioned the Gopher Data Science github repo as a great resource for any data scientists interested in the Go language. Although we didn't mention it, Daniel al...

Feb 03, 2017•40 min•Transcript available on Metacast

[MINI] Logistic Regression on Audio Data

Logistic Regression is a popular classification algorithm. In this episode, we discuss how it can be used to determine if an audio clip represents one of two given speakers. It assumes an output variable (isLinhda) is a linear combination of available features, which are spectral bands in the discussion on this episode. Keep an eye on the dataskeptic.com blog this week as we post more details about this project. Thanks to our sponsor this week, the Data Science Association. Please check out thei...

Jan 27, 2017•21 min•Transcript available on Metacast

Studying Competition and Gender Through Chess

Prior work has shown that people's response to competition is in part predicted by their gender. Understanding why and when this occurs is important in areas such as labor market outcomes. A well structured study is challenging due to numerous confounding factors. Peter Backus and his colleagues have identified competitive chess as an ideal arena to study the topic. Find out why and what conclusions they reached. Our discussion centers around Gender, Competition and Performance: Evidence from Re...

Jan 20, 2017•34 min•Transcript available on Metacast

[MINI] Dropout

Deep learning can be prone to overfit a given problem. This is especially frustrating given how much time and computational resources are often required to converge. One technique for fighting overfitting is to use dropout. Dropout is the method of randomly selecting some neurons in one's network to set to zero during iterations of learning. The core idea is that each particular input in a given layer is not always available and therefore not a signal that can be relied on too heavily....

Jan 13, 2017•16 min•Transcript available on Metacast

The Police Data and the Data Driven Justice Initiatives

In this episode I speak with Clarence Wardell and Kelly Jin about their mutual service as part of the White House's Police Data Initiative and Data Driven Justice Initiative respectively. The Police Data Initiative was organized to use open data to increase transparency and community trust as well as to help police agencies use data for internal accountability. The PDI emerged from recommendations made by the Task Force on 21st Century Policing . The Data Driven Justice Initiative was organized ...

Jan 06, 2017•49 min•Transcript available on Metacast