This week is an insightful discussion with Claudia Perlich about some situations in machine learning where models can be built, perhaps by well-intentioned practitioners, to appear to be highly predictive despite being trained on random data. Our discussion covers some novel observations about ROC and AUC, as well as an informative discussion of leakage. Much of our discussion is inspired by two excellent papers Claudia authored: Leakage in Data Mining: Formulation, Detection, and Avoidance and ...
Jul 22, 2016•37 min•Transcript available on Metacast An ROC curve is a plot that compares the trade off of true positives and false positives of a binary classifier under different thresholds. The area under the curve (AUC) is useful in determining how discriminating a model is. Together, ROC and AUC are very useful diagnostics for understanding the power of one's model and how to tune it.
Jul 15, 2016•11 min•Transcript available on Metacast I'm joined by Chris Stucchio this week to discuss how deliberate or uninformed statistical practitioners can derive spurious and arbitrary results via multiple comparisons. We discuss p-hacking and a variety of other important lessons and tips for proper analysis. You can enjoy Chris's writing on his blog at chrisstucchio.com and you may also like his recent talk Multiple Comparisons: Make Your Boss Happy with False Positives, Guarenteed ....
Jul 08, 2016•30 min•Transcript available on Metacast If you'd like to make a good prediction, your best bet is to invent a time machine, visit the future, observe the value, and return to the past. For those without access to time travel technology, we need to avoid including information about the future in our training data when building machine learning models. Similarly, if any other feature whose value would not actually be available in practice at the time you'd want to use the model to make a prediction, is a feature that can introduce leaka...
Jul 01, 2016•12 min•Transcript available on Metacast Kristian Lum ( @KLdivergence ) joins me this week to discuss her work at @hrdag on predictive policing . We also discuss Multiple Systems Estimation , a technique for inferring statistical information about a population from separate sources of observation. If you enjoy this discussion, check out the panel Tyranny of the Algorithm? Predictive Analytics & Human Rights which was mentioned in the episode....
Jun 24, 2016•36 min•Transcript available on Metacast Distributed computing cannot guarantee consistency, accuracy, and partition tolerance. Most system architects need to think carefully about how they should appropriately balance the needs of their application across these competing objectives. Linh Da and Kyle discuss the CAP Theorem using the analogy of a phone tree for alerting people about a school snow day.
Jun 17, 2016•11 min•Transcript available on Metacast A startup is claiming that they can detect terrorists purely through facial recognition. In this solo episode, Kyle explores the plausibility of these claims.
Jun 10, 2016•33 min•Transcript available on Metacast Goodhart's law states that "When a measure becomes a target, it ceases to be a good measure". In this mini-episode we discuss how this affects SEO, call centers, and Scrum.
Jun 03, 2016•11 min•Transcript available on Metacast I'm joined this week by Jon Morra, director of data science at eHarmony to discuss a variety of ways in which machine learning and data science are being applied to help connect people for successful long term relationships. Interesting open source projects mentioned in the interview include Face-parts , a web service for detecting faces and extracting a robust set of fiducial markers (features) from the image, and Aloha , a Scala based machine learning library. You can learn more about these an...
May 27, 2016•43 min•Transcript available on Metacast Mystery shoppers and fruit cultivation help us discuss stationarity - a property of some time serieses that are invariant to time in several ways. Differencing is one approach that can often convert a non-stationary process into a stationary one. If you have a stationary process, you get the benefits of many known statistical properties that can enable you to do a significant amount of inferencing and prediction.
May 20, 2016•14 min•Transcript available on Metacast I'm joined by Wes McKinney ( @wesmckinn ) and Hadley Wickham ( @hadleywickham ) on this episode to discuss their joint project Feather . Feather is a file format for storing data frames along with some metadata, to help with interoperability between languages. At the time of recording, libraries are available for R and Python, making it easy for data scientists working in these languages to quickly and effectively share datasets and collaborate....
May 13, 2016•23 min•Transcript available on Metacast Bargaining is the process of two (or more) parties attempting to agree on the price for a transaction. Game theoretic approaches attempt to find two strategies from which neither party is motivated to deviate. These strategies are said to be in equilibrium with one another. The equilibriums available in bargaining depend on the the transaction mechanism and the information of the parties. Discounting (how long parties are willing to wait) has a significant effect in this process. This episode di...
May 06, 2016•15 min•Transcript available on Metacast Deepjazz is a project from Ji-Sung Kim , a computer science student at Princeton University. It is built using Theano, Keras, music21, and Evan Chow's project jazzml . Deepjazz is a computational music project that creates original jazz compositions using recurrent neural networks trained on Pat Metheny's "And Then I Knew". You can hear some of deepjazz's original compositions on soundcloud ....
Apr 29, 2016•30 min•Transcript available on Metacast When working with time series data, there are a number of important diagnostics one should consider to help understand more about the data. The auto-correlative function, plotted as a correlogram, helps explain how a given observations relates to recent preceding observations. A very random process (like lottery numbers) would show very low values, while temperature (our topic in this episode) does correlate highly with recent days. See the show notes with details about Chapel Hill, NC weather d...
Apr 22, 2016•15 min•Transcript available on Metacast This week I spoke with Elham Shaabani and Paulo Shakarian ( @PauloShakASU ) about their recent paper Early Identification of Violent Criminal Gang Members (also available on arXiv ). In this paper, they use social network analysis techniques and machine learning to provide early detection of known criminal offenders who are in a high risk group for committing violent crimes in the future. Their techniques outperform existing techniques used by the police. Elham and Paulo are part of the Cyber-So...
Apr 15, 2016•27 min•Transcript available on Metacast A dinner party at Data Skeptic HQ helps teach the uses of fractional factorial design for studying 2-way interactions.
Apr 08, 2016•11 min•Transcript available on Metacast Cheng-tao Chu ( @chengtao_chu ) joins us this week to discuss his perspective on common mistakes and pitfalls that are made when doing machine learning. This episode is filled with sage advice for beginners and intermediate users of machine learning, and possibly some good reminders for experts as well. Our discussion parallels his recent blog post Machine Learning Done Wrong . Cheng-tao Chu is an entrepreneur who has worked at many well known silicon valley companies. His paper Map-Reduce for M...
Apr 01, 2016•25 min•Transcript available on Metacast Co-host Linh Da was in a biking accident after hitting a pothole. She sustained an injury that required stitches. This is the story of our quest to file a 311 complaint and track it through the City of Los Angeles's open data portal. My guests this episode are Chelsea Ursaner (LA City Open Data Team), Ben Berkowitz (CEO and founder of SeeClickFix), and Russ Klettke (Editor of pothole.info)...
Mar 25, 2016•41 min•Transcript available on Metacast Certain data mining algorithms (including k-means clustering and k-nearest neighbors) require a user defined parameter k. A user of these algorithms is required to select this value, which raises the questions: what is the "best" value of k that one should select to solve their problem? This mini-episode explores the appropriate value of k to use when trying to estimate the cost of a house in Los Angeles based on the closests sales in it's area.
Mar 18, 2016•15 min•Transcript available on Metacast Today on Data Skeptic, Lachlan Gunn joins us to discuss his recent paper Too Good to be True . This paper highlights a somewhat paradoxical / counterintuitive fact about how unanimity is unexpected in cases where perfect measurements cannot be taken. With large enough data, some amount of error is expected. The "Too Good to be True" paper highlights three interesting examples which we discuss in the podcast. You can also watch a lecture from Lachlan on this topic via youtube here ....
Mar 11, 2016•35 min•Transcript available on Metacast How well does your model explain your data? R-squared is a useful statistic for answering this question. In this episode we explore how it applies to the problem of valuing a house. Aspects like the number of bedrooms go a long way in explaining why different houses have different prices. There's some amount of variance that can be explained by a model, and some amount that cannot be directly measured. R-squared is the ratio of the explained variance to the total variance. It's not a measure of ...
Mar 04, 2016•13 min•Transcript available on Metacast Jessica Hamrick joins us this week to discuss her work studying mental simulation. Her research combines machine learning approaches iwth behavioral method from cognitive science to help explain how people reason and predict outcomes. Her recent paper Think again? The amount of mental simulation tracks uncertainty in the outcome is the focus of our conversation in this episode. Lastly, Kyle invited Samuel Hansen from the Relative Prime podcast to mention the Relatively Prime Season 3 kickstarter...
Feb 26, 2016•40 min•Transcript available on Metacast This episode is a discussion of multiple regression: the use of observations that are a vector of values to predict a response variable. For this episode, we consider how features of a home such as the number of bedrooms, number of bathrooms, and square footage can predict the sale price. Unlike a typical episode of Data Skeptic, these show notes are not just supporting material, but are actually featured in the episode. The site Redfin gratiously allows users to download a CSV of results they a...
Feb 19, 2016•18 min•Transcript available on Metacast Samuel Mehr joins us this week to share his perspective on why people are musical, where music comes from, and why it works the way it does. We discuss a number of empirical studies related to music and musical cognition, and dispense a few myths about music along the way. Some of Sam's work discussed in this episode include Music in the Home: New Evidence for an Intergenerational Link , Two randomized trials provide no consistent evidence for nonmusical cognitive benefits of brief preschool mus...
Feb 12, 2016•42 min•Transcript available on Metacast This episode reviews the concept of k-d trees: an efficient data structure for holding multidimensional objects. Kyle gives Linhda a dictionary and asks her to look up words as a way of introducing the concept of binary search. We actually spend most of the episode talking about binary search before getting into k-d trees, but this is a necessary prerequisite.
Feb 05, 2016•14 min•Transcript available on Metacast Algorithms are pervasive in our society and make thousands of automated decisions on our behalf every day. The possibility of digital discrimination is a very real threat, and it is very plausible for discrimination to occur accidentally (i.e. outside the intent of the system designers and programmers). Christian Sandvig joins us in this episode to talk about his work and the concept of auditing algorithms. Christian Sandvig ( @niftyc ) has a PhD in communications from Stanford and is currently ...
Jan 29, 2016•43 min•Transcript available on Metacast Today's episode begins by asking how many left handed employees we should expect to be at a company before anyone should claim left handedness discrimination. If not lefties, let's consider eye color, hair color, favorite ska band, most recent grocery store used, and any number of characteristics could be studied to look for deviations from the norm in a company. When multiple comparisons are to be made simultaneous, one must account for this, and a common method for doing so is with the Bonferr...
Jan 22, 2016•14 min•Transcript available on Metacast A recent paper in the journal of Judgment and Decision Making titled On the reception and detection of pseudo-profound bullshit explores empirical questions around a reader's ability to detect statements which may sound profound but are actually a collection of buzzwords that fail to contain adequate meaning or truth. These statements are definitively different from lies and nonesense, as we discuss in the episode. This paper proposes the Bullshit Receptivity scale (BSR) and empirically demonstr...
Jan 15, 2016•38 min•Transcript available on Metacast Today's mini episode discusses the widely known optimization algorithm gradient descent in the context of hiking in a foggy hillside.
Jan 08, 2016•15 min•Transcript available on Metacast This episode is a discussion of data visualization and a proposed New Year's resolution for Data Skeptic listeners. Let's kill the word cloud.
Jan 01, 2016•15 min•Transcript available on Metacast