In this episode, Marie Sadler talks about her recent Cell Genomics paper, Multi-layered genetic approaches to identify approved drug targets . Previous studies have found that the drugs that target a gene linked to the disease are more likely to be approved. Yet there are many ways to define what it means for a gene to be linked to the disease. Perhaps the most straightforward approach is to rely on the genome-wide association studies (GWAS) data, but that data can also be integrated with quanti...
Dec 21, 2023•52 min•Ep. 70
Today on the podcast we have Tomasz Kociumaka and Dominik Kempa , the authors of the preprint Collapsing the Hierarchy of Compressed Data Structures: Suffix Arrays in Optimal Compressed Space . The suffix array is one of the foundational data structures in bioinformatics, serving as an index that allows fast substring searches in a large text. However, in its raw form, the suffix array occupies the space proportional to (and several times larger than) the original text. In their paper, Tomasz an...
Sep 29, 2023•57 min•Ep. 69
In this episode, David Dylus talks about Read2Tree , a tool that builds alignment matrices and phylogenetic trees from raw sequencing reads. By leveraging the database of orthologous genes called OMA , Read2Tree bypasses traditional, time-consuming steps such as genome assembly, annotation and all-versus-all sequence comparisons. Links: Inference of phylogenetic trees directly from raw sequencing reads using Read2Tree (David Dylus, Adrian Altenhoff, Sina Majidian, Fritz J. Sedlazeck, Christophe ...
Aug 28, 2023•49 min•Ep. 68
This is the third and final episode in the AlphaFold series, originally recorded on February 23, 2022, with Amelie Stein , now an associate professor at the University of Copenhagen. In the episode, Amelie explains what 𝛥𝛥G is, how it informs us whether a particular protein mutation affects its stability, and how AlphaFold 2 helps in this analysis. A note from Amelie: Something that has happened in the meantime is the publication of methods that predict 𝛥𝛥G with ML methods, so much faster th...
Jul 29, 2023•35 min•Ep. 67
This is the second episode in the AlphaFold series, originally recorded on February 14, 2022, with Janani Durairaj , a postdoctoral researcher at the University of Basel. Janani talks about how she used shape-mers and topic modelling to discover classes of proteins assembled by AlphaFold 2 that were absent from the Protein Data Bank (PDB). The bioinformatics discussion starts at 03:35. Links: A structural biology community assessment of AlphaFold2 applications (Mehmet Akdel, Douglas E. V. Pires,...
Jul 10, 2023•21 min•Ep. 66
In this episode, originally recorded on February 9, 2022, Roman talks to Pedro Beltrao about AlphaFold, the software developed by DeepMind that predicts a protein’s 3D structure from its amino acid sequence. Pedro is an associate professor at ETH Zurich and the coordinator of the structural biology community assessment of AlphaFold2 applications project, which involved over 30 scientists from different institutions. Pedro talks about the origins of the project, its main findings, the importance ...
Jun 21, 2023•52 min•Ep. 65
In this episode, Jacob Schreiber interviews Žiga Avsec about a recently released model, Enformer . Their discussion begins with life differences between academia and industry, specifically about how research is conducted in the two settings. Then, they discuss the Enformer model, how it builds on previous work, and the potential that models like it have for genomics research in the future. Finally, they have a high-level discussion on the state of modern deep learning libraries and which ones th...
Nov 09, 2021•1 hr•Ep. 64
The Bioinformatics Contest is back this year, and we are back to discuss it! This year’s contest winners Maksym Kovalchuk (1st prize) and Matt Holt (2nd prize) talk about how they approach participating in the contest and what strategies have earned them the top scores. Timestamps and links for the individual problems: 00:10:36 Genotype Imputation 00:21:26 Causative Mutation 00:30:27 Superspreaders 00:37:22 Minor Haplotype 00:46:37 Isoform Matching Links: Matt’s solutions Max’s solutions If you ...
Sep 27, 2021•1 hr 1 min•Ep. 63
In this episode, Apostolos Chalkis presents sampling steady states of metabolic networks as an alternative to the widely used flux balance analysis (FBA). We also discuss dingo , a Python package written by Apostolos that employs geometric random walks to sample steady states. You can see dingo in action here . Links: Dingo on GitHub Searching for COVID-19 treatments using metabolic networks Tweag open source fellowships This episode was originally published on the Compositional podcast. If you ...
Jul 28, 2021•38 min•Ep. 62
In this episode, Jacob Schreiber interviews Da-Inn Erika Lee about data and computational methods for making sense of 3D genome structure. They begin their discussion by talking about 3D genome structure at a high level and the challenges in working with such data. Then, they discuss a method recently developed by Erika, named GRiNCH , that mines this data to identify spans of the genome that cluster together in 3D space and potentially help control gene regulation. Links: GRiNCH: simultaneous s...
Jun 23, 2021•1 hr 10 min•Ep. 61
In this episode, Michael Love joins us to talk about the differential gene expression analysis from bulk RNA-Seq data. We talk about the history of Mike’s own differential expression package, DESeq2 , as well as other packages in this space, like edgeR and limma , and the theory they are based upon. Mike also shares his experience of being the author and maintainer of a popular bioninformatics package. Links: Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 (Love, ...
May 12, 2021•1 hr 31 min•Ep. 60
In this episode, Lindsay Pino discusses the challenges of making quantitative measurements in the field of proteomics. Specifically, she discusses the difficulties of comparing measurements across different samples, potentially acquired in different labs, as well as a method she has developed recently for calibrating these measurements without the need for expensive reagents. The discussion then turns more broadly to questions in genomics that can potentially be addressed using proteomic measure...
Apr 21, 2021•48 min•Ep. 59
In this episode, we learn about B cell maturation and class switching from Hamish King . Hamish recently published a paper on this subject in Science Immunology, where he and his coauthors analyzed gene expression and antibody repertoire data from human tonsils. In the episode Hamish talks about some of the interesting B cell states he uncovered and shares his thoughts on questions such as «When does a B cell decide to class-switch?» and «Why is the antibody isotype correlated with its affinity?...
Mar 31, 2021•1 hr 29 min•Ep. 58
In this episode, Jacob Schreiber interviews Molly Gasperini about enhancer elements. They begin their discussion by talking about Octant Bio, and then dive into the surprisingly difficult task of defining enhancers and determining the mechanisms that enable them to regulate gene expression. Links: Octant Bio Towards a comprehensive catalogue of validated and target-linked human enhancers (Molly Gasperini, Jacob M. Tome, and Jay Shendure) If you enjoyed this episode, please consider supporting th...
Mar 10, 2021•47 min•Ep. 57
Polygenic risk scores (PRS) rely on the genome-wide association studies (GWAS) to predict the phenotype based on the genotype. However, the prediction accuracy suffers when GWAS from one population are used to calculate PRS within a different population, which is a problem because the majority of the GWAS are done on cohorts of European ancestry. In this episode, Bárbara Bitarello helps us understand how PRS work and why they don’t transfer well across populations. Links: Polygenic Scores for He...
Feb 17, 2021•1 hr 30 min•Ep. 56
In this episode, we chat about phylogenetics with Xiang Ji . We start with a general introduction to the field and then go deeper into the likelihood-based methods (maximum likelihood and Bayesian inference). In particular, we talk about the different ways to calculate the likelihood gradient, including a linear-time exact gradient algorithm recently published by Xiang and his colleagues. Links: Gradients Do Grow on Trees: A Linear-Time O(N) -Dimensional Gradient for Statistical Phylogenetics (X...
Jan 13, 2021•57 min•Ep. 55
In this episode, Markus Schmidt explains how seeding in read alignment works. We define and compare k-mers, minimizers, MEMs, SMEMs, and maximal spanning seeds. Markus also presents his recent work on computing variable-sized seeds (MEMs, SMEMs, and maximal spanning seeds) from fixed-sized seeds (k-mers and minimizers) and his Modular Aligner . Links: A performant bridge between fixed-size and variable-size seeding (Arne Kutzner, Pok-Son Kim, Markus Schmidt) MA the Modular Aligner Calibrating Se...
Dec 16, 2020•1 hr 1 min•Ep. 54
In this episode, Jacob Schreiber interviews Devin Schweppe about the analysis of mass spectrometry data in the field of proteomics. They begin by delving into the different types of mass spectrometry methods, including MS1, MS2, and, MS3, and the reasons for using each. They then discuss a recent paper from Devin, Full-Featured, Real-Time Database Searching Platform Enables Fast and Accurate Multiplexed Quantitative Proteomics that involved building a real-time system for quantifying proteomic s...
Nov 18, 2020•1 hr 3 min•Ep. 53
In this episode, Will Freyman talks about identity-by-descent (IBD): how it’s used at 23andMe , and how the templated positional Burrows-Wheeler transform can find IBD segments in the presence of genotyping and phasing errors. Links: Fast and robust identity-by-descent inference with the templated positional Burrows-Wheeler transform (William A. Freyman, Kimberly F. McManus, Suyash S. Shringarpure, Ethan M. Jewett, Katarzyna Bryc, the 23andMe Research Team, Adam Auton) 23andMe research If you en...
Oct 27, 2020•43 min•Ep. 52
In this episode, Jacob Schreiber interviews David Kelley about machine learning models that can yield insight into the consequences of mutations on the genome. They begin their discussion by talking about Calico Labs, and then delve into a series of papers that David has written about using models, named Basset and Basenji, that connect genome sequence to functional activity and so can be used to quantify the effect of any mutation. Links: Calico Labs Basset: Learning the regulatory code of the ...
Oct 07, 2020•1 hr 14 min•Ep. 51
In this episode, Jacob Schreiber interviews Jill Moore about recent research from the ENCODE Project . They begin their discussion with an overview and goals of the ENCODE Project, and then discuss a bundle of papers that were recently published in various Nature journals and the flagship paper, Expanded encyclopaedias of DNA elements in the human and mouse genomes . They conclude their discussion by talking about the challenges with managing a large project as a trainee in a consortium setting....
Sep 10, 2020•56 min•Ep. 50
In systems biology, Boolean networks are a way to model interactions such as gene regulation or cell signaling. The standard interpretations of Boolean networks are the synchronous, asynchronous, and fully asynchronous semantics. In this episode, Loïc Paulevé explains how the same Boolean networks can be interpreted in a new, “most permissive” way. Loïc proved mathematically that his semantics can reproduce all behaviors achievable by a compatible quantitative model, whereas the traditional inte...
Aug 19, 2020•1 hr 4 min•Ep. 49
In this episode, Jacob Schreiber interviews Marinka Zitnik about applications of machine learning to drug development. They begin their discussion with an overview of open research questions in the field, including limiting the search space of high-throughput testing methods, designing drugs entirely from scratch, predicting ways that existing drugs can be repurposed, and identifying likely side-effects of combining existing drugs in novel ways. Focusing on the last of these areas, they then dis...
Jul 29, 2020•1 hr 25 min•Ep. 48
NGLess is a programming language specifically targeted at next generation sequencing (NGS) data processing. In this episode we chat with its main developer, Luis Pedro Coelho , about the benefits of domain-specific languages, pros and cons of Haskell in bioinformatics, reproducibility, and of course NGLess itself. Links: NGLess on GitHub NG-meta-profiler: fast processing of metagenomes using NGLess, a domain-specific language (Luis Pedro Coelho, Renato Alves, Paulo Monteiro, Jaime Huerta-Cepas, ...
Jun 24, 2020•58 min•Ep. 47
In this episode, I continue to talk (but mostly listen) to Sergey Koren and Sergey Nurk . If you missed the previous episode , you should probably start there. Otherwise, join us to learn about HiFi reads, the tradeoff between read length and quality, and what tricks HiCanu employs to resolve highly similar repeats. Links: HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads (Sergey Nurk, Brian P. Walenz, Arang Rhie, Mitchell R. Voll...
May 27, 2020•1 hr 9 min•Ep. 46
In this episode, Sergey Nurk and Sergey Koren from the NIH share their thoughts on genome assembly. The two Sergeys tell the stories behind their amazing careers as well as behind some of the best known genome assemblers: Celera assembler, Canu, and SPAdes. Links: Canu on GitHub SPAdes on GitHub If you enjoyed this episode, please consider supporting the podcast on Patreon ....
May 20, 2020•1 hr 17 min•Ep. 45
Porcupine is a molecular tagging system—a way to tag physical objects with pieces of DNA called molecular bits , or molbits for short. These DNA tags then can be rapidly sequenced on an Oxford Nanopore MinION device without any need for library preparation. In this episode, Katie Doroschak explains how Porcupine works—how molbits are designed and prepared, and how they are directly recognized by the software without an intermediate basecalling step. Links: Porcupine: Rapid and robust tagging of ...
Apr 29, 2020•45 min•Ep. 44
Will Townes proposes a new, simpler way to analyze scRNA-seq data with unique molecular identifiers (UMIs). Observing that such data is not zero-inflated, Will has designed a PCA-like procedure inspired by generalized linear models (GLMs) that, unlike the standard PCA, takes into account statistical properties of the data and avoids spurious correlations (such as one or more of the top principal components being correlated with the number of non-zero gene counts). Also check out Will’s paper for...
Mar 27, 2020•1 hr•Ep. 43
In this episode, we hear from Amatur Rahman and Karel Břinda , who independently of one another released preprints on the same concept, called simplitigs or spectrum-preserving string sets. Simplitigs offer a way to efficiently store and query large sets of k-mers—or, equivalently, large de Bruijn graphs. Links: Simplitigs as an efficient and scalable representation of de Bruijn graphs (Karel Břinda, Michael Baym, Gregory Kucherov) Representation of k-mer sets using spectrum-preserving string se...
Feb 28, 2020•53 min•Ep. 42
Kris Parag is here to teach us about the mathematical modeling of infectious disease epidemics. We discuss the SIR model, the renewal models, and how insights from information theory can help us predict where an epidemic is going. Links: Optimising Renewal Models for Real-Time Epidemic Prediction and Estimation (KV Parag, CA Donnelly) Adaptive Estimation for Epidemic Renewal and Phylogenetic Skyline Models (KV Parag, CA Donnelly) The listener survey If you enjoyed this episode, please consider s...
Jan 27, 2020•1 hr 8 min•Ep. 41