Data science is typically done by engineers writing code in Python, R, or another scripting language. Lots of engineers know these languages, and their ecosystems have great library support. But these languages have some issues around deployment, reproducibility, and other areas. The programming language Golang presents an appealing alternative for data scientists. Daniel Whitenack transitioned from doing most of his data science work in Python to writing code in Golang. In this episode, Daniel ...
Feb 09, 2017•56 min
Translation is a classic problem in computer science. How do you translate a sentence from one human language into another? This seems like a problem that computers are well-suited to solve. Languages follow well-defined rules, we have lots of sample data to train our machine learning models. And yet, the problem has not been solved–largely because languages don’t always follow rules. We have idioms and subtle contextual clues that make it hard to provide a computer with hard and fast rules for ...
Jan 25, 2017•51 min
Medical imaging is used to understand what is going on inside the human body and prescribe treatment. With new image processing and machine learning techniques, the traditional medical imaging techniques such as CT scans can be enriched to get a more sophisticated diagnosis. HeartFlow uses data from a standard CT scan to model a human heart and understand blockages of blood flow using simulations of fluid dynamics. In today’s episode, Razik Yousfi and Leo Grady from HeartFlow describe the data p...
Jan 17, 2017•53 min
Data visualization tools are required to translate the findings of data scientists into charts, graphs, and pictures. Understanding how to utilize these tools and display data is necessary for a data scientist to communicate with people in other domains. In this episode, Srini Kadamati hosts a discussion with Jake VanderPlas about the Python ecosystem for data science and the different attempts at creating a data visualization library. Jake VanderPlas is the Director of Research for Physical Sci...
Jan 16, 2017•44 min
Data engineering is the software engineering that enables data scientists to work effectively. In today’s episode, we explore the different sides of data engineering–the data science algorithms that need to be processed and the implementation of software architectures that enable those algorithms to run smoothly. The PANCAKE STACK is a 12-letter acronym that Chris Fregly gave to a collection of data engineering technologies including Presto, Cassandra, Kafka, Elastic Search, and Spark. In his cu...
Oct 17, 2016•55 min
Scikit-learn is a set of machine learning tools in Python that provides easy-to-use interfaces for building predictive models. In a previous episode with Per Harald Borgen about Machine Learning For Sales, he illustrated how easy it is to get up and running and productive with scikit-learn, even if you are not a machine learning expert. Srini Kadamati hosts today’s show and interviews Andreas Mueller, a core committer to scikit-learn. Srini and Andreas discuss the background and implementation o...
Sep 27, 2016•31 min
Machine learning can be used to generate music. In the case of Feynman Liang’s research project BachBot, the machine learning model is seeded with the music of famous composer Bach. The music that BachBot creates sounds remarkably similar to Bach, although it has been generated by an algorithm, not by a human. BachBot is a research project on computational creativity. Feynman Liang created BachBot using Python machine learning tools to build a long-short term memory model. Our conversation explo...
Sep 02, 2016•44 min
You have probably read a news article that was written by a machine. When earnings reports come out, or a series of sports events like the Olympics occurs, there are so many small stories that need to be written that a news organization like the Associated Press would have to use all of its resources to write enough content to cover it all. Wordsmith is a tool for automated content generation, and today’s guest Robbie Allen is the CEO of Automated Insights, the company that makes Wordsmith. He t...
Sep 01, 2016•48 min
Research in artificial intelligence takes place mostly at universities and large corporations, but both of these types of institutions have constraints that cause the research to proceed a certain way. In a university, basic research might be hindered by lack of funding. At a big corporation, the researcher might be encouraged to study a domain that is not squarely in the interest of public good–such as targeted advertising. Oren Etzioni is the CEO of the Allen Institute for Artificial Intellige...
Aug 29, 2016•1 hr 2 min
TensorFlow is Google’s open source machine learning library. Rajat Monga is the engineering director for TensorFlow. In this episode, we cover how to use TensorFlow, including an example of how to build a machine learning model to identify whether a picture contains a cat or not. TensorFlow was built with the mission of simplifying the process of deploying a machine learning model from research to production, so we also talk about that, as well as how TensorFlow can be used effectively in combin...
Aug 18, 2016•43 min
Data Validation is the process of ensuring that data is accurate. In many software domains, an application is pulling in large quantities of data from external sources. That data will eventually be exposed to users, and it needs to be correct. Radius Intelligence is a company that aggregates data on small businesses. In order to ensure that business addresses and phone numbers are correct, Radius uses human data validation to ensure that their machine-gathered data is correct. On today’s episode...
Aug 17, 2016•40 min
Machine learning has become simplified. Similar to how Ruby on Rails made web development approachable, scikit-learn takes away much of the frustrating aspects of machine learning, and lets the developer focus on building functionality with high-level APIs. Per Harald Borgen is a developer at Xeneta. He started programming fairly recently, but has already built a machine learning application that cuts down on the time his sales team has to spend qualifying leads. What I found most interesting ab...
Aug 16, 2016•43 min
The war against spam has been going on for decades. Email spam blockers and ad blockers help protect us from unwanted messages in our communication and browsing experience. These spam prevention tools are powered by machine learning, which catches most of the emails and ads that we don’t want to see. TrueCaller is a company that is bringing this quality of spam detection to our phone call systems. Umut Alp is the CTO of TrueCaller, and he joins the show today to break down the engineering proble...
Jun 08, 2016•53 min
“Building a model to predict disease and deploying that in the wild – the bar for success is much higher there than, say, deciding what ad to show you.” Diagnosing illness today requires the trained eye of a doctor. With machine learning, we might someday be able to diagnose illness using only a data set. Today on Software Engineering Daily, we are joined by David Kale, a researcher at the intersection of machine learning and clinical data. We discuss the machine learning and research techniques...
Mar 08, 2016•57 min
“Nothing’s cool unless you call it ‘as a service.’ ” Monsanto is a company that is known for its chemical and biological engineering. It is less well known for its data science and software engineering teams. Tim Williamson is a data scientist at Monsanto, and on today’s show he talked about how he and a small group of engineers at Monsanto dramatically shifted the culture around data science-driven genetic engineering. In this episode, Tim explains how useful graph databases are for modeling th...
Feb 29, 2016•55 min
“I definitely think we can try to abstract away the first principles of intelligence and then try to go from these principles to an intelligent machine that might look nothing like the brain.” Keras is a minimalist, highly modular neural networks library, written in Python and capable of running on top of either TensorFlow or Theano. It was developed with a focus on enabling fast experimentation. In this episode, François discusses the state of deep learning, and explains why the field is experi...
Jan 29, 2016•52 min
“You’ve got software engineers who are interested in machine learning, and think what they need to do is just bring in another module and then that will solve their problem. It’s particularly important for those people to understand that this is a different type of beast.” Machine learning is something that many business are starting to tack onto their existing processes. Yet, to add machine learning capabilities after the fact is often a fool’s errand. Joshua argues that machine learning cannot...
Jan 19, 2016•56 min
“You don’t mind if failures slow things down, but its very important that failures do not stop forward progress.” TensorFlow is an open source machine learning library intended to bring large-scale, distributed machine learning and deep learning to everyone. Google recently released the framework to the public as a second-generation API, having learned from the successes and failures of DistBelief . Greg Corrado is a senior research scientist and tech lead at Google, where he focuses on the rese...
Dec 15, 2015•40 min
“I normally try to sit together or very close to a product team or engineering team. And by doing so, I get very close to the source of all kinds of challenging problems.” Spotify is a streaming music service that uses data science and machine learning to implement product features such as recommendation systems and music categorization, but also to answer internal questions. Boxun Zhang is a data scientist at Spotify where he focuses on understanding user behavior within the product. Questions ...
Dec 11, 2015•56 min
“When I was a graduate student, I was sitting in the office of my advisor in electrical engineering and he said, ‘Look out that window – you see a Volkswagon, I see a realization of a random variable.’ ” Richard Golden is the host of Learning Machines 101 , a podcast that covers artificial intelligence and machine learning topics. Dr. Golden is also a full-time Professor of Cognitive Science and Electrical Engineering at UT Dallas. Questions What is machine learning? What are the fundamental con...
Dec 08, 2015•56 min
“Changing anything changes everything.” Technical debt, referring to the compounding cost of changes to software architecture, can be especially challenging in machine learning systems. D. Sculley is a software engineer at Google, focusing on machine learning, data mining, and information retrieval. He recently co-authored the paper Machine Learning: The High Interest Credit Card of Technical Debt . Questions How do you define technical debt? Why does technical debt tend to compound like financi...
Nov 17, 2015•32 min
Current infrastructure makes it difficult for data scientists to share analytical models with the software engineers who need to integrate them. Yhat is an enterprise software company tackling the challenge of how data science gets done. Their products enable companies and users to easily deploy data science environments and translate analytical models into production code. Greg Lamp is the Co-founder and CTO of Yhat and previously worked as a product manager in financial services. Yhat was part...
Oct 05, 2015•47 min
Data science competitions are an effective way to crowdsource the best solutions for challenging datasets. Kaggle is a platform for data scientists to collaborate and compete on machine learning problems with the opportunity to win money from the competitions’ sponsors. Ben Hamner is the co-founder and CTO of Kaggle. Questions What is Kaggle? How does the experience of an individual competitor compare to the experience of a data science team? What is Kaggle’s tech stack? Do companies collect too...
Oct 03, 2015•50 min
There is a need for more data scientists to make sense of the vast amounts of data we produce and store. Dataquest is an in-browser platform for learning data science that is tackling this problem. Vik Paruchuri is the founder of Dataquest. He was previously a machine learning engineer at EdX and before that a U.S. diplomat. Questions What is data science? How does data science compare to software engineering? How does someone new to data science go about starting off at Kaggle? In machine learn...
Sep 30, 2015•45 min