Algorithms for Next-Generation Sequencing (Chapman & Hall/CRC Computational Biology Series)

Speaker 1

00:00

Right now, sitting inside almost every single cell of your body is a three billion letter instruction.

Speaker 2

00:07

Manual, which is just I mean, it's a staggering scale to even try to picture.

Speaker 1

00:12

Yeah, think about that scale for a second. If a doctor wants to find out why you're sick or you know, why a medication isn't working, they essentially have to find a single microscopic typo in a book that is three million pages long.

Speaker 2

00:26

Right, and they need to find it fast. I mean, thirty years ago, doing that was a biological impossibility. It took over a decade and literally billions of dollars just to do it once.

Speaker 1

00:36

Wow.

Speaker 2

00:36

But today we expect those answers in a matter of days. It's moved from being this purely biological challenge to what is essentially a computational miracle.

Speaker 1

00:46

Okay, let's unpack this because if you've ever wondered how a simple cheek swab or like a vial of blood drawn out a clinic actually turns into a highly personalized medical profile, well, this is exactly the breakdown you need.

Speaker 2

00:57

Yeah. It's a fascinating journey.

Speaker 1

00:58

It really is. We're not just talking about the biology today. We are taking a deep dive into the journey from the wet, messy chemistry of a human cell to the digital data on a computer screen. And more importantly, we're looking at the mind bending mathematical tricks that allow a standard, cheap laptop to search your entire genetic code without instantly catching fire.

Speaker 2

01:22

It really is a collision of two completely different worlds. I mean, you have to physically extract the data from the molecule first, Yeah, right, and only then can the algorithms do their heavy lifting.

Speaker 1

01:32

Right, So let's start with that physical extraction. We've got this invisible DNA in a tube. How did we go from painstakingly reading one genetic sentence at a time to basically scanning the entire three million page library in an afternoon. I mean it didn't happen overnight.

Speaker 2

01:46

No, not at all. It's a story of well, constant, aggressive problem solving. It started back in nineteen seventy seven with what we now call first generation sequencing or saying or sequencing. The foundational idea was brilliant. Honestly, they used a natural enzyme to copy a strand of DNA, but they spike the chemical soup with these modified nucleotides. You know, the ACG and T building blocks.

Speaker 1

02:07

Right. The sources mentioned these modified blocks have fluorescent glowing tags on them. They act like molecular stop signs.

Speaker 2

02:14

Yeah, that's exactly it. That's the key. Imagine you're copying a sentence, but every time you write the letter A, your pen freezes.

Speaker 1

02:21

Oh weird, okay, right, so.

Speaker 2

02:22

You'd end up with a fragment ending in A. By running this process over and over, you end up with a massive mixture of DNA fragments of all different lengths. You sort them by size using electrical charge, a technique called electrophoresis, got it, and then a camera reads the glowing colors at the end of each fragment one by one to spell.

Speaker 1

02:42

Out the sequence, which sounds incredibly accurate, but I mean practically agonizing.

Speaker 2

02:47

Oh it's painfully slow. Yeah.

Speaker 1

02:49

The sources say this method maxes out at reading fragments about eight hundred letters long. If I'm trying to read a three billion letter genome, that makes me think of like a medieval monk painstakingly copying an encyclopedia by hand, letter by single letter.

Speaker 2

03:05

It's a great analogy.

Speaker 1

03:06

It works, but you aren't mass producing anything that way.

Speaker 2

03:09

No, it was a massive bottleneck, and that specific speed limit is what triggered a complete rethinking of the process. Companies like Alumina came along and essentially disrupted the biological space like a Silicon Valley tech company right the introduced second generation sequencing.

Speaker 1

03:25

The massively parallel approach. So if Sanger was the medieval monk, Alumina is like taking that three million page book, tossing it into a wood schipper and turning it into millions of tiny pieces of confetti exactly, and then you read every single shred of confetti at the exact same moment and basically force a computer to paste the book back together.

Speaker 2

03:44

That is essentially what they do. Yeah, but to read millions of tiny shreds at once, the signal has to be loud enough for a camera sensor to actually pick it up. A single DNA molecule is just too faint. So these techniques like a mulsion PCR bridge PCR.

Speaker 1

03:59

Hold on I see mentioned everywhere in the Deep dive sources, But what does that actually mean in this context? Bridge PCR think.

Speaker 2

04:06

Of it as microscopic photocopying. They wash the DNA fragments over a tiny glass slide. The fragments attached to the slide, and enzymes duplicate them right there in.

Speaker 1

04:15

Place, Okay, right there on the glass.

Speaker 2

04:17

Yeah, they bend over, forming a bridge and copy themselves again and again. Suddenly, instead of one faint DNA molecule, you have a dense little cluster of thousands of identical clones standing up like a tiny forest on the glass.

Speaker 1

04:30

Oh wow.

Speaker 2

04:31

So when you attach a glowing chemical tag to them, that entire cluster flashes brightly enough for a digital camera to photograph.

Speaker 1

04:37

That is wild. So you take a picture, wash the chemicals away, add the next letter, and take another picture. Just millions of clusters flashing in sequence. But you know, reading the sources, there's another second gen variation that completely blew my mind. Ion Torrent, Oh yeah, they don't use lasers, they don't use camera because.

Speaker 2

04:55

They aren't looking at light at all. They are literally measuring the acidity of the chemical soup.

Speaker 1

05:00

Hold on, how do you read a genetic code by checking the pH level?

Speaker 2

05:03

It comes down to basic chemistry. Really, Every single time a new nucleotide successfully attaches to a growing DNA strand, the chemical bond naturally releases a single positively charged hydrogen ion Okay ion torrent machines use a summit conductor chip layered with millions of microscopic wells. It's basically a massive grid of tiny pH meters. It detects that microscopic drop

05:27

in pH when the hydrogen ion pops off. So it's translating a biological event directly into a digital electronic signal.

Speaker 1

05:34

You're just listening for the electrical pop of a hydrogen atom. Unbelievable.

Speaker 2

05:37

It is pretty incredible.

Speaker 1

05:38

But even with that speed, second generation sequencing still relies on tearing the DNA into tiny confetti, right, which brings us to the third generation technologies like pack bio and Oxford nanopore. This reads like pure science fiction. They don't chop it up, they don't pause to take pictures, they just read it continuously.

Speaker 2

05:58

Yeah, it's called single molecule time sequencing with nanopore. Imagine a microscopic hole, a literal poor punctured through a synthetic membrane. They apply a steady electrical current across that membrane. Then they physically pull a single long strand of DNA.

Speaker 1

06:15

Through that hole, like threading a needle.

Speaker 2

06:17

Exactly like that, And because the molecular shapes of an ASCG and a T are slightly different. They each block the hole in a uniquely different way as they pass through. Oh I see, yeah, that physically alters the electrical current. The machine reads this specific changes in the voltage to spell out the letters as the strand zips through.

Speaker 1

06:35

Which means you can read massive uninterrupted stretches. The sources say up to twenty thousand letters in a single read. It's like feeding the entire intact book through a high speed ticker tape scanner.

Speaker 2

06:45

Yep, it's a huge leap and read length.

Speaker 1

06:47

But I've got to pause you here. I'm looking at the data from the sources. If this third generation tech is so revolutionary and reads so fast, why are we still using the second generation confetti method at all?

Speaker 2

07:00

Right? Well, what's fascinating here is a very stubborn, hidden trade off between length and accuracy. When you are violently pulling a molecule through a microscopic hole at high speed, the sensor occasionally blinks. It might miss a letter entirely, or accidentally read the same letter twice. These are called insertion and deletion errors. Third generation tools historically sit at an error rate of about seventeen point eight to seventeen point nine percent, almost.

Speaker 1

07:26

An eighteen percent error rate in a medical context. I mean, if I'm looking for a single cancer causing mutation, an eighteen percent failure rate sounds absolutely terrifying.

Speaker 2

07:35

It does sound alarming, for sure, but scientists realize something brilliant about those errors. They're completely random. The nanophore doesn't systematically struggle with the letter C, for example. It's just random static.

Speaker 1

07:47

Okay, So how do you fix random static?

Speaker 2

07:49

The workaround is actually quite elegant. You just sequence the exact same strand of DNA twenty or thirty times.

Speaker 1

07:55

Oh, I see, because the odds of the machine making the exact same random mistake on the exact same letter twenty times in a row is basically zero.

Speaker 2

08:03

Precisely, you layer the thirty reads on top of each other, the random glitches mathematically cancel out, and the true underlying sequence emerges clearly.

Speaker 1

08:12

Okay, So that leads us directly to the next massive problem. If third generation sequencing has a nearly eighteen percent raw error rate, just dumping all that text into a computer file is completely useless. The computer needs to know which letters are biological facts and which letters are just machine hallucinations. So how do we tag the trustworthy data.

Speaker 2

08:32

That's where specialized file formats come in. The most basic format used to be called FASTA. It was just a plain text file, literally just a string of acsgs and t's. But as you point it out, FASTA isn't enough anymore. We needed a way to track the confidence of every single letter.

Speaker 1

08:47

Enter the fast Q format, where the Q literally stands for quality exactly.

Speaker 2

08:52

FASTQ attaches a crucial piece of metadata called the phred quality score or Q score. The sequencing machine actually grades its own homework. For every single letter it outputs, it calculates a mathematical probability that it made a mistake.

Speaker 1

09:07

I found the engineering behind this fascinating. A Q score is a number, right, say a score of thirty means a ninety nine point nine percent accuracy rate.

Speaker 2

09:15

Right, it's a logarithmic scale.

Speaker 1

09:16

But if you have to store a two digit number next to every single letter of a three billion letter genome, you instantly double or triple your file size. Our hard drives would fill up immediately. So instead, the algorithms take that Q score number, add exactly thirty three to it, and map it to a keyboard symbol.

Speaker 2

09:34

It's an incredibly clever compression hack.

Speaker 1

09:36

But wait, why add exactly thirty three? Why not just use the number itself.

Speaker 2

09:40

Well, it's because of how computers read text using the ASKI standard. The first thirty two characters in a computer's language aren't printable. They are invisible commands like escape or return.

Speaker 1

09:50

Oh right, okay.

Speaker 2

09:51

By mathematically adding thirty three to the Q score, you jump past those invisible commands and land perfectly on standard printable characters. So instead of storing the number thirty the computer stores a single XH symbol or maybe a question mark. You fit complex probability data into a single byte of memory.

Speaker 1

10:09

That is brilliant, And the stakes here are real because if a doctor is looking at your file and the sequence shows a genetic marker for a severe disease, they need to know if that marker have a high Q score or if it's just a low quality machine glitch.

Speaker 2

10:23

Exactly, if we connect this to the bigger picture, we aren't just trusting one read. We look at the read depth and the genotype quality. If fifty reads show mutation and have high Q scores, the algorithm confidently called it a true variant.

Speaker 1

10:36

It ignores the low quality blitches.

Speaker 2

10:38

Yes, and once we trust the letters, we have to figure out what they mean. You take those millions of verified fast Q shreds and you align them against a standard reference human genome. It's like checking your puzzle pieces against the picture on the front of the box. Once they are aligned, they are saved as a BAM file, which is a highly compressed binary format.

Speaker 1

10:57

But humans are fundamentally ninety nine point nine percent identical. If you sequence my DNA, almost all of it is exactly the same as the reference map. It seems wildly inefficient to store three billion letters just to say yep, still human.

Speaker 2

11:11

Which is why the final piece of this file pipeline is the VCF or variant call format. We don't store your whole genome. The VCF file only stores your mutations, the differences. It's essentially a list of typos. It says, at chromosome four position one million, there should be an A, but in this patient it's a G.

Speaker 1

11:30

Okay, let's step back. Because I'm looking at the sheer math of this alignment process. We sort of glossed over how we actually match the puzzle pieces. If I have one hundred letter fragment, and I have three billion possible places to stick it on the reference genome. What in a standard computer search algorithm just freeze? I mean, how do they avoid a total system crash?

Speaker 2

11:48

This is where we get into the real heavy algorithmic lifting. The first major hurdle is that genetic mutations mean you almost never have an exact match. You might have a missing letter or an extra one, so you can't just hit ctrl as string in search for the exact string. You have to use something called dynamic programming to calculate the edit distance.

Speaker 1

12:07

I read about this. It's about finding the minimum number of operations insertions, deletions, or substitutions to change one string of text into another. The source gave a great simple example, changing the word ants to bent. You substitute the A for an E, insert a B at the front, and delete the S at the end. That takes three steps, perfect exactly. But scaling that up to thousands of letters creates an astronomical number of possible operations.

Speaker 2

12:34

Right if you try to calculate every single possible combination from scratch using standard recursion, which essentially means the computer solves the problem by breaking it into smaller pieces and solving every single piece over and over, the computing time grows exponentially. The universe would literally end before your laptop finished.

Speaker 1

12:52

Okay, so if recursion crashes the computer, how does dynamic programming solve it?

Speaker 2

12:56

By using memory to save time, it builds what's called a dependency graph or a table. Think of it like getting driving directions. If you want to calculate the absolute fastest route from New York to Seattle, and part of your out goes through Chicago, you calculate the Chicago Seattle leg once you write that answer down on a sticky note.

13:18

Oh okay, so yeah, if you were testing a million different routes out of New York and a bunch of them eventually passed through Chicago, you don't mathematically recalculate the western half of the United States every single time. You Just look at your sticky.

Speaker 1

13:31

Note, right, You've already done that math exactly.

Speaker 2

13:34

Dynamic programming does this for DNA. It solves the tiny sub problems of the text, saves the answers in a massive table, and just lifts them up. It drops the computing time from trillions of years down to minutes.

Speaker 1

13:46

It catches the answers that makes total sense. But even with the sticky notes, searching every edge of a three billion letter genome for millions of tiny confetti fragments is still too slow, which brings us to a concept called a bloom filter. And I've got to admit this is where the computer science gets really counterintuitive for me.

Speaker 2

14:04

It is a bit mind bending at first.

Speaker 1

14:06

It's a space efficient probabilistic data structure. Basically, it asks a massive database, does this sequence exist in here? Without actually looking through the data. Yeah.

Speaker 2

14:17

It uses mathematical hash functions and a simple bit array, just a microscopic sequence of ones and zeros. When you insert a genetic sequence into the system, it runs it through a math formula that flips specific zeros to ones. Okay, when you want to search for a sequence later, you run it through the same formula. If all the corresponding bits are ones, it tells you the item is probably there.

14:38

But if even a single bit is a zero, it guarantees with absolute mathematical certainty that the item is not there.

Speaker 1

14:45

I was trying to picture this, and it makes me think of a very strict bouncer at a crowded VIP club. The bouncer uses a series of quick, weird rules to check people at the door. Are you wearing red shoes? Do you have a ticket?

Speaker 2

14:57

That's a good way to look at it right now.

Speaker 1

15:00

And then the bouncer might mistakenly let a random person in who isn't on the list. That's a false positive. But the bouncer will absolutely never ever turn away someone who is actually on the list. There is zero false negative. But let me challenge this directly, go for it. Why would computer scientists intentionally design an algorithm that we know for a fact gives false positives? Is an accuracy the entire point of medical science.

Speaker 2

15:24

This raises an important question about computational trade offs. It's all about conserving memory and speed. A bloom filter takes up an unbelievably small amount of memory by intentionally allowing a tiny predictable margin of error, say a one or two percent false positive rate. We can achieve near instantaneous search.

Speaker 1

15:43

Speeds because you aren't using the bloom filter for the final answer. You use it to instantly discard the ninety nine percent of the genome where the sequence definitely doesn't belong.

Speaker 2

15:51

Ray Siicely, you use the cheap fast algorithm to clear away the junk, and then you only perform the slow, rigorous dynamic programming check on the few positive hits. You save your heavy computational artillery for the targets that actually matter.

Speaker 1

16:04

Okay, so bloom filters tell us if a sequence exists somewhere, But to find exactly where it lives in the genome, we need an index, like the index at the back of a textbook telling you which page a word is on. But when I was looking at the source text, standard computer indexes for something this large are impossibly bloated. A standard index a suffix tree for the human genome takes up about forty gigabytes of active.

Speaker 2

16:27

Memory, which is a fatal bottleneck. You can't load forty gigabytes of data into the ram of a standard computer. It means the computer would have to constantly read back and forth from the hard drive, which slows everything to an absolute crawl.

Speaker 1

16:39

And this is where we get to the absolute crown jewel of this whole. Deep dive. Researchers Ferragina and Manzini created the FM index, and they did it using a mathematical trick called the Burrows Wheeler transform or BWT, but honestly, reading the mechanics of this transform broke my brain a little bit. How does BWT actually work.

Speaker 2

16:57

It is notoriously difficult to visual lies, but incredibly elegant once you get it. The BWT is a permutation. It reorganizes the text. Imagine taking a sequence of letters, rotating the whole sequence by one letter, writing that down, rotating it again, and listing out all the possible rotations. Then you sort those rows alphabetically.

Speaker 1

17:19

Okay, I'm with you, but why do that? What does alphabetically sorting a bunch of rotated gibberish actually achieve?

Speaker 2

17:25

Because of the underlying structure of human language and DNA, When you sort those rotations alphabetically, a mathematical magic trick happens. In the final column of that list. Identical characters suddenly group together. So instead of a random string like acgdac, the final column will spit out long runs of the same letter like aaccgt oh.

Speaker 1

17:44

And because they are grouped together, you can compress them exactly.

Speaker 2

17:47

It's called run lengthen coding.

Speaker 1

17:49

Wait, let me make sure I'm picturing this right. Instead of the computer wasting memory writing out aaaa, run lengthen, coding just writes five A yes, that.

Speaker 2

17:57

Single trick allows the FM index to shrink the already gigabyte in decks down to less than two gigabytes.

Speaker 1

18:02

Here's where it gets really interesting. Suddenly the entire searchable map of the human genome fits comfortably into the active memory of a cheap laptop you could buy at a big box store.

Speaker 2

18:12

Yeah, that is just staggering. But the compression isn't even the craziest part the source is mentioned. It allows for backward search, which sounds impossible. How do you search compressed data without uncompressing it first? This is the true genius of the BWT. Because of how the matrix is mathematically structured, you can jump between the columns to trace a sequence backwards, letter by letter without ever unpacking the file. And here

18:38

is the kicker. The time it takes to search for a pattern is proportional only to the length of your query string. It completely ignores the massive size of the actual genome.

Speaker 1

18:47

Hold on, you're saying that if I want to search for a fifty letter sequence, it takes the exact same amount of time whether I am searching the tiny genome of a fruitfly or the three billion letter human genome.

Speaker 2

18:57

Exactly the size of the haystack no longer matters. The time it takes only depends on the size of the needle. It completely democratized genomic research overnight.

Speaker 1

19:05

Okay, so we've gone from wet chemistry to massive raw data, to error correcting files to mind blowing compression algorithms. So what does this all mean? What does all this computational heavy lifting actually do for the person listening right now? If you are a patient in a hospital, what are the steaks.

Speaker 2

19:24

The steaks for your life? Before these algorithms, the genome was a black box. Today, because we can search it so quickly, we discovered incredible things. We found out that humans only have about twenty thousand protein coding genes. That's a mere three percent of our total.

Speaker 1

19:38

DNA here, Really, just three percent actually builds the proteins. The rest is essentially regulatory instructions.

Speaker 2

19:44

Yes, and because we can quickly map a patient's DNA against the reference, we can find the exact microscopic typos causing their illness. The sources highlight a perfect example, the bcl abl one fusion gene.

Speaker 1

19:55

Right, that's a structural variation. It's when a piece of chromosome nine accidentally breaks off attaches to chromosome twenty two.

Speaker 2

20:01

Right, and that specific structural TYPO is present in ninety five percent of patients with chronic myelogenous leukemia. Before this technology, we just knew a patient had cancer, and we threw toxic chemotherapy at them, hoping it worked. Today, we sequence the genome, find the exact broken gear in the cellular machinery, and use highly targeted drugs designed specifically to block that mutated protein.

Speaker 1

20:25

And it isn't just about static DNA either, Right The sources talk about RNA sec and hippie suck.

Speaker 2

20:30

Yes, If DNA is the architectural blueprint of a house, RNA SEC is watching the construction workers actually build it. It tells us the transcriptome which specific genes are actively turned on or off in a cell at any given second, and eachpisick maps out the specific proteins the transcription factors that are flipping those switches.

Speaker 1

20:48

It's watching the engine run in real time, which really means the era of one size fits all medicine is dying. If you get sick, doctors won't be guessing your treatment based on population averages anymore. They are going to use these algorithms to read your specific genetic typos and prescribe personalized, stratified medicine designed exactly for your unique biology.

Speaker 2

21:06

It is a complete paradigm shift. We've moved from observing symptoms to observing the fundamental code of life in real.

Speaker 1

21:13

Time, and the technology is accelerating. Our source text mentions that Oxford Nanopor, the company pulling DNA through microscopic holes, has created a device called the Minion. It's a disposable DNA sequencer, the exact size and shape of a standard USB flash drive. You just plug it right into a laptop.

Speaker 2

21:31

Just think about the implications of that for a moment.

Speaker 1

21:34

It completely flips the power dynamic, building on everything we've talked about today, the error correction, the dynamic programming, the heavily compressed FM index. We are rapidly approaching a world where you could sequence your own DNA at home. If the code that dictates whether your cells live, die, or mutate can be read as easily as scanning a grocery store barcode in your living room, how is that going to change our relationship with our own biology? What does

21:59

it mean for our ourrivacy or our insurance? If anyone can plug a thumb, drive in and read our biological destiny.

Speaker 2

22:04

It's a profound frontier. The diagnostic muddy waters are finally clearing, but what we find underneath is going to challenge us as a society in entirely new ways.

Speaker 1

22:13

It really is something for you to ponder long after this deep dive ends, because the ultimate instruction manual is no longer hidden. Thank you so much for joining us as we impact the invisible architecture of your own biology. Keep questioning, keep learning, and join us next time as we continue to explore the absolute edges of human knowledge.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript