Ideas: The journey to DNA data storage - podcast episode cover

Ideas: The journey to DNA data storage

Nov 19, 202443 min
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

Research manager Karin Strauss and members of the DNA Data Storage Project reflect on the path to developing a synthetic DNA–based system for archival data storage, including the recent open-source release of its most powerful algorithm for DNA error correction.

Get the Trellis BMA code: GitHub - microsoft/TrellisBMA: Trellis BMA: coded trace reconstruction on IDS channels for DNA storage

Transcript

[MUSIC PLAYS UNDER DIALOGUE]

JAKE SMITH

This really starts from the  fundamental data production–data storage gap,   where we produce way more data nowadays than  we could ever have imagined years ago. And it's   more than we can practically store in magnetic  media. And so we really need a denser medium   on the other side to contain that. DNA is  extremely dense. It holds far, far more   information per unit volume, per unit mass than  any storage media that we have available today.  

This, along with the fact that DNA is itself a  relatively rugged molecule—it lives in our body;   it lives outside our body for thousands  and thousands of years if we, you know,   leave it alone to do its thing—makes  it a very attractive media.

BICHLIEN NGUYEN

It's such  a futuristic technology,   right? When you begin to work on the tech, you  realize how many disciplines and domains you   actually have to reach in and leverage. It's  really interesting, this multidisciplinarity,   because we're, in a way, bridging software  with wetware with hardware. And so you,   kind of, need all the different disciplines  to actually get you to where you need to go.

SERGEY YEKHANIN

We all work for Microsoft;  we are all Microsoft researchers. Microsoft   isn’t a startup. But that team, the team  that drove the DNA Data Storage Project,   it did feel like a startup, and it was  something unusual and exciting for me.

SERIES INTRO

You’re listening to Ideas, a  Microsoft Research Podcast that dives deep into   the world of technology research and the profound  questions behind the code. In this series, we’ll   explore the technologies that are shaping our  future and the big ideas that propel them forward.

[MUSIC FADES]

SERIES INTRO

GUEST HOST KARIN STRAUSS: I'm your guest host  Karin Strauss, a senior principal research   manager at Microsoft. For nearly a decade, my  colleagues and I—along with a fantastic and   talented group of collaborators from academia  and industry—have been working together to help   close the data creation–data storage gap. We're  producing far more digital information than we can   possibly store. One solution we've explored uses  synthetic DNA as a medium, and over the years,  

we've contributed to steady and promising progress  in the area. We've helped push the boundaries of   how much DNA writer can simultaneously store,  shown that full automation is possible,   and helped create an ecosystem for the commercial  success of DNA data storage. And just this week,   we've made one of our most advanced tools  for encoding and decoding data in DNA open  

source. Joining me today to discuss the  state of DNA data storage and some of our   contributions are several members of the DNA  Data Storage Project at Microsoft Research:   Principal Researcher Bichlien Nguyen,  Senior Researcher Jake Smith, and Partner   Research Manager Sergey Yekhanin. Bichlien,  Jake, and Sergey, welcome to the podcast.

BICHLIEN NGUYEN

Thanks for having us, Karin.

SERGEY YEKHANIN

Thank you so much.

JAKE SMITH

Yes, thank you.

STRAUSS

So before getting into the details of DNA  data storage and our work, I'd like to talk about   the big idea behind the work and how we all got  here. I've often described the DNA Data Storage   Project as turning science fiction into reality.  When we started the project in 2015, though, the   idea of using DNA for archival storage was already  out there and had been for over five decades.   Still, when I talked about the work in the area,  people were pretty skeptical in the beginning,  

and I heard things like, “Wow, why are you  thinking about that? It's so far off.” So, first,   please share a bit of your research backgrounds  and then how you came to work on this project.   Where did you first encounter this idea, what do  you remember about your initial impressions—or the   impressions of others—and what made you want  to get involved? Sergey, why don’t you start.

YEKHANIN

Thanks so much. So I’m a coding  theorist by training, so, like, my core areas   of research have been error-correcting codes  and also computational complexity theory. So   I joined the project probably, like, within half  a year of the time that it was born, and thanks,   Karin, for inviting me to join. So, like,  that was roughly the time when I moved from   a different lab, from the Silicon Valley lab  in California to the Redmond lab, and actually,  

it just so happened that at that moment, I  was thinking about what to do next. Like,   in California, I was mostly working on coding  for distributed storage, and when I joined here,   that effort kept going. But I had some free  cycles, and that was the moment when Karin   came just to my office and told me about the  project. So, indeed, initially, it did feel a  

lot like science fiction. Because, I mean, we  are used to coding for digital storage media,   like for magnetic storage media, and here, like,  this is biology, and, like, why exactly these   kind of molecules? There are so many different  molecules. Like, why that? But honestly, like,   I didn't try to pretend to be a biologist and make  conclusions about whether this is the right medium  

or the wrong medium. So I tried to look into these  kinds of questions from a technical standpoint,   and there was a lot of, kind of, deep, interesting  coding questions, and that was the main attraction   for me. At the same time, I wasn’t convinced  that we will get as far as we actually got,   and I wasn't immediately convinced about the  future of the field, but, kind of, just the depth   and the richness of the, what I’ll call, technical  problems, that's what made it appealing for me,  

and I, kind of, enthusiastically joined. And  also, I guess, the culture of the team. So, like,   it did feel like a startup. Like, we all work  for Microsoft; we’re all Microsoft researchers.   Microsoft isn’t a startup. But that team, the  team that drove the DNA Data Storage Project,   it did feel like a startup, and it was  something unusual and exciting for me.

NGUYEN

Oh, I love that, Sergey. So my background  is in organic chemistry, and Karin had reached out   to me, and I interviewed not knowing what Karin  wanted. Actually … so I took the job kind of   blind because I was like, “Hmm, Microsoft  Research? … DNA biotech? ...” I was very,   very curious, and then when she told me that this  project was about DNA data storage, I was like,  

this is a crazy, crazy idea. I definitely was  not sold on it, but I was like, well, look,   I get to meet and work with so many interesting  people from different backgrounds that, one,   even if it doesn't work out, I’m  going to learn something, and, two,   I think it could work, like it could work. And so  I think that's really what motivated me to join.

SMITH

The first thing that you think when  you hear about we're going to take what is   our hard drive and we're going to turn that  into DNA is that this is nuts. But, you know,   it didn't take very long after that. I come  from a chemistry, biotech-type background   where I've been working on designing drugs, and  there, DNA is this thing off in the nethers,  

you know. You look at it every now and then  to see what information it can tell you about,   you know, what maybe your drug might be hitting  on the target side, and it's, you know, that   connection—that the DNA contains the information  in the living systems, the DNA contains the   information in our assays, and why could the DNA  not contain the information that we, you know,   think more about every day, that information that  lives in our computers—as an extremely cool idea.

STRAUSS

Through our work, we've had years to  wrap our heads around DNA data storage. But,   Jake, could you tell us a little bit about   how DNA data storage works and why we're  interested in looking into the technology?

SMITH

So you mentioned it earlier, Karin,  that this really starts from the fundamental   data production–data storage gap, where we  produce way more data nowadays than we could   ever have imagined years ago. And it's more than  we can practically store in magnetic media. This   is a problem because, you know, we have data;  we have recognized the value of data with the  

rise of large language models and these other big  generative models. The data that we do produce,   our video has gone from, you know, substantially  small, down at 480 resolution, all the way up to   things at 8K resolution that now take orders of  magnitude more storage. And so we really need a   denser medium on the other side to contain that.  DNA is extremely dense. It holds far, far more   information per unit volume, per unit mass than  any storage media that we have available today.  

This, along with the fact that DNA is itself a  relatively rugged molecule—it lives in our body;   it lives outside our body for thousands  and thousands of years if we, you know,   leave it alone to do its thing—makes  it a very attractive media,   particularly compared to the traditional  magnetic media, which has lower density   and a much shorter lifetime on the,  you know, scale of decades at most.

So how does DNA data storage actually work?  Well, at a very high level, we start out in the   digital domain, where we have our information  represented as ones and zeros, and we need to   convert that into a series of A's, C's, T's,  and G's that we could then actually produce,   and this is really the domain of Sergey. He'll  tell us much more about how this works later on.   For now, let's just assume we've done this. And  now our information, you know, lives in the DNA  

base domain. It's still in the digital world. It's  just represented as A’s, C’s, T’s, and G’s, and   we now need to make this physical so that we can  store it. This is accomplished through large-scale   DNA synthesis. Once the DNA has been synthesized  with the sequences that we specified, we need to   store it. There's a lot of ways we can think about  storing it. Bichlien’s done great work looking at  

DNA encapsulation, as well as, you know, other  more raw just DNA-on-glass-type techniques. And   we've done some work looking at the susceptibility  of DNA stored in this unencapsulated form to   things like atmospheric humidity, to temperature  changes and, most excitingly, to things like   neutron radiation. So we've stored our data  in this physical form, we've archived it, and   coming back to it, likely many years in the future  because the properties of DNA match up very well  

with archival storage, we need to convert it back  into the digital domain. And this is done through   a technique called DNA sequencing. What this does  is it puts the molecules through some sort of   machine, and on the other side of the machine, we  get out, you know, a noisy representation of what   the actual sequence of bases in the molecules  were. We have one final step. We need to take  

this series of noisy sequences and convert it back  into ones and zeros. Once we do this, we return   to our original data and we've completed,  let's call it, one DNA data storage cycle.

STRAUSS

We'll get into this in more detail  later, but maybe, Sergey, we dig a little bit   on encoding-decoding end of things and how DNA is  different as a medium from other types of media.

YEKHANIN

Sure. So, like, I mean, coding is an  important aspect of this whole idea of DNA data   storage because we have to deal with errors—it’s  a new medium—but talking about error-correcting   codes in the context of DNA data storage, so, I  mean, usually, like … what are error-correcting   codes about? Like, on the very high level, right,  I mean, you have some data—think of it as a binary   string—you want to store it, but there are  errors. So usually, like, in most, kind of,  

forms of media, the errors are bit flips. Like,  you store a 0; you get a 1. Or you store a 1; you   get a 0. So these are called substitution errors.  The field of error-correcting codes, it started,   like, in the 1950s, so, like, it’s 70 years old  at least. So we, kind of, we understand how to   deal with this kind of error reasonably well, so  with substitution errors. In DNA data storage,   the way you store your data is that given,  like, some large amount of digital data,  

you have the freedom of choosing which short  DNA molecules to generate. So in a DNA molecule,   it’s a sequence of the bases A, G, C, and  T, and you have the freedom to decide,   like, which of the short molecules you need to  generate, and then those molecules get stored,   and then during the storage, some of them  are lost; some of them can be damaged. There   can be insertions and deletions of bases on every  molecule. Like, we call them strands. So you need  

redundancy, and there are two forms of redundancy.  There's redundancy that goes across strands,   and there is redundancy on the strand. And so,  yeah, so, kind of, from the error-correcting   side of things, like, we get to decide what kind  of redundancy we want to introduce—across strands,   on the strand—and then, like, we want to  make sure that our encoding and decoding   algorithms are efficient. So that's  the coding theory angle on the field.

NGUYEN

Yeah, and then, you know, from there,  once you have that data encoded into DNA,   the question is how do you make that data  on a scale that's compatible with digital   data storage? And so that's where a lot of the  work came in for really automating the synthesis   process and also the reading process, as well. So  synthesis is what we consider the writing process   of DNA data storage. And so, you know, we came  up with some unique ideas there. We made a chip  

that enabled us to get to the densities that  we needed. And then on the reading side, we   used different sequencing technologies. And it was  great to see that we could actually just, kind of,   pull sequencing technologies off the shelf because  people are so interested in reading biological   DNA. So we explored the Illumina technologies and  also Oxford Nanopore, which is a new technology  

coming in the horizon. And then preservation, too,  because we have to make sure that the data that’s   stored in the DNA doesn't get damaged and that we  can recover it using the error-correcting codes.

STRAUSS

Yeah, absolutely. And it's clear  that—and it's also been our experience that—DNA   data storage and projects like this require more  than just a team of computer scientists. Bichlien,   you’ve had the opportunity to collaborate with  many people in all different disciplines. So   do you want to talk a little bit about  that? What kind of expertise, you know,   other disciplines that are relevant to  bringing DNA data storage to reality?

NGUYEN

Yeah, well, it's such a futuristic  technology, right? When you begin to work   on the tech, you realize how many disciplines  and domains you actually have to reach in and   leverage. One concrete example is that in order  to fabricate an electronic chip to synthesize DNA,   we really had to pull in a lot of material science  research because there's different capabilities   that are needed when trying to use liquid on a  chip. We, you know, have to think about DNA data  

storage itself. And that's a very different beast  than, you know, the traditional storage mediums.   And so we worked with teams who literally create,  you know, these little tiny micro- or nanocapsules   in glass and being able to store that there. It's  really interesting, this multidisciplinarity,   because we're, in a way, bridging software  with wetware with hardware. And so you,   kind of, need all the different disciplines  to actually get you to where you need to go.

STRAUSS

Yeah, absolutely. And, you know,  building on, you know, collaborators,   I think one area that was super interesting,  as well, and was pretty early on in the project   was building that first end-to-end system that  we collaborated with University of Washington,  

the Molecular Information Systems Lab there,  to build. And really, at that point, you know,   there had been work suggesting that DNA data  storage was viable, but nobody had really shown   an end-to-end system, from beginning to end, and  in fact, my manager at the time, Doug Carmean,   used to call it the “bubble gum and shoestring”  system. But it was a crucial first step because  

it shows it was possible to really fully  automate the process. And there have been   several interesting challenges there in the  system, but we noticed that one particularly   challenging one was synthesis. That first system  that we built was capable of storing the word   “hello,” and that was all we could store. So  it wasn't a very high-capacity system. But in   order to be able to store a lot more volumes of  data instead of a simple word, we really needed  

much more advanced synthesis systems. And this is  what both Bichlien and Jake ended up working on,   so do you want to talk a little bit about that  and the importance of that particular work?

SMITH

Yeah, absolutely. As you said, Karin,  the amount of DNA that is required to store   the massive amount of data we spoke  about earlier is far beyond the amount   of DNA that's needed for any, air quotes,  traditional applications of synthetic DNA,   whether it's your gene construction or it's your  primer synthesis or such. And so we really had   to rethink how you make DNA at scale and  think about how could this actually scale  

to meet the demand. And so Bichlien started out  looking at a thing called a microelectrode array,   where you have this big checkerboard of small  individual reaction sites, and in each reaction   site, we used electrochemistry in order to  control base by base—A, C, T, or G by A, C,   T, or G—the sequence that was growing at that  particular reaction site. We got this down to   the nanoscale. And so what this means practically  is that on one of these chips, we could synthesize  

at any given time on the order of hundreds of  millions of individual strands. So once we had the   synthesis working with the traditional chemistry  where you're doing chemical synthesis—each base   is added in using a mixture of chemicals that are  added to the individual spots—they're activated.   But each coupling happens due to some energy you  prestored in the synthesis of your reagents. And  

this makes the synthesis of those reagents costly  and themselves a bottleneck. And so taking, you   know, a look forward at what else was happening  in the synthetic biology world, the, you know,   next big word in DNA synthesis was and still is  enzymatic synthesis, where rather than having to,   you know, spend a lot of energy to chemically  pre-activate reagents that will go in to make   your actual DNA strands, we capitalize on  nature's synthetic robots—enzymes—to start  

with less-activated, less-expensive-to-get-to,  cheaply-produced-through-natural-processes   substrates, and we use the enzymes themselves,  toggling their activity over each of the   individual chips, or each of the individual  spots on our checkerboard, to construct DNA   strands. And so we got a little bit into this  project. You know, we successfully showed that  

we could put down selectively one base at a  given time. We hope that others will, kind of,   take up the work that we've put out there, you  know, particularly our wonderful collaborators   at Ansa who helped us design the enzymatic  system. And one day we will see, you know,   a truly parallelized, in this fashion, enzymatic  DNA system that can achieve the scales necessary.

NGUYEN

It's interesting to note that even  though it's DNA and we're still storing data   in these DNA strands, chemical synthesis and  enzymatic synthesis provide different errors   that you see in the actual files, right, in  the DNA files. And so I know that we talked   to Sergey about how do we deal with these new  types of errors and also the new capabilities   that you can have, for example, if you don't  control base by base the DNA synthesis.

YEKHANIN

This whole field of DNA data storage,  like, the technologies on the biology side are   advancing rapidly, right. And there are different  approaches to synthesis. There are different   approaches to sequencing. And, presumably,  the way the storage is actually done, like,   is also progressing, right, and we had works on  that. So there is, kind of, this very general,   kind of, high-level error profile that you can  say that these are the type of errors that you  

encounter in DNA data storage. Like, in DNA  molecules—just the sequence of these bases,   A, G, C, T, in maybe a length of,  like, 200 or so and you store a very,   very large number of them—the errors that you  see is that some of these strands, kind of,   will disappear. Some of these strings can be  torn apart like, let’s say, in two pieces,   maybe even more. And then on every strand, you  also encounter these errors—insertions, deletions,  

substitutions—with different rates. Like, the  likelihood of all kinds of these errors may differ   very significantly across different technologies  that you use on the biology side. And also there   can be error bursts somehow. Maybe you can get  an insertion of, I don’t know, 10 A’s, like, in a  

row, or you can lose, like, you know, 10 bases in  a row. So if you don't, kind of, quantify, like,   what are the likelihoods of all these bad events  happening, then I think this still, kind of,   fits at least the majority of approaches to DNA  data storage, maybe not exactly all of them,   but it fits the majority. So when we design  coding schemes, we are trying also, kind of,   to look ahead in the sense that, like,  we don't know, like, in five years, like,  

how will these error profiles, how will it look  like. So the technologies that we develop on the   error-correction side, we try to keep them very  flexible, so whether it's enzymatic synthesis,   whether it's Nanopore technology, whether it’s  Illumina technology that is being used, the   error-correction algorithms would be able to adapt  and would still be useful. But, I mean, this makes   also coding aspect harder because, [LAUGHTER] kind  of, you want to keep all this flexibility in mind.

STRAUSS

So, Sergey, we are  at an interesting moment now   because you’re open sourcing the  Trellis BMA piece of code, right,   that you published a few years ago. Can  you talk a little bit about that specific   problem of trace reconstruction and then  the paper specifically and how it solves it?

YEKHANIN

Absolutely, yeah, so this Trellis BMA  paper for that we are releasing the source code   right now, this is, kind of, this is the latest in  our sequence of publications on error-correction   for DNA data storage. And I should say that, like,  we already discussed that the project is, kind of,   very interdisciplinary. So, like, we have experts  from all kinds of fields. But really even within,   like, within this coding theory, like,  within computer science/information theory,  

coding theory, in our algorithms, we use ideas  from very different branches. I mean, there are   some core ideas from, like, core algorithm space,  and I won’t go into these, but let me just focus,  

kind of, on two aspects. So when we just faced  this problem of coding for DNA data storage and we   were thinking about, OK, so how to exactly design  the coding scheme and what are the algorithms   that we’ll be using for error correction, so,  I mean, we’re always studying the literature,   and we came up on this problem called trace  reconstruction that was pretty popular—I mean,   somewhat popular, I would say—in computer science  and in statistics. It didn’t have much motivation,  

but very strong mathematicians had been looking  at it. And the problem is as follows. So, like,   there is a long binary string picked at random,  and then it’s transmitted over a deletion channel,   so some bits—some zeros and some ones—at certain  coordinates get deleted and you get to see, kind  

of, the shortened version of the string. But you  get to see it multiple times. And the question is,   like, how many times do you need to see it so that  you can get a reasonably accurate estimate of the  

original string that was transmitted? So that was  called trace reconstruction, and we took a lot of   motivation—we took a lot of inspiration—from the  problem, I would say, because really, in DNA data   storage, if we think about a single strand, like,  a single strand which is being stored, after we   read it, we usually get multiple reads of this  string. And, well, the errors there are not just  

deletions. There are insertions, substitutions,  and, like, inversive errors, but still we could   rely on this literature in computer science that  already had some ideas. So there was an algorithm   called BMA, Bitwise Majority Alignment. We  extended it—we adopted it, kind of, for the needs   of DNA data storage—and it became, kind of, one  of the tools in our toolbox for error correction.

So we also started to use ideas from  literature on electrical engineering,   what are called convolutional error-correcting  codes and a certain, kind of, class of algorithms   for decoding errors in these convolutional  error-correcting codes called, like, I mean,   Trellis is the main data structure, like,  Trellis-based algorithms for decoding   convolutional codes, like, Viterbi algorithm or  BCJR algorithm. Convolutional codes allow you to  

introduce redundancy on the string. So, like, with  algorithms kind of similar to BMA, like, they were   good for doing error correction when there was no  redundancy on the strand itself. Like, when there  

is redundancy on the strand, kind of, we could do  some things, but really it was very limited. With   Trellis-based approaches, like, again inspired  by the literature in electrical engineering,   we had an approach to introduce redundancy on the  strand, so that allowed us to have more powerful   error-correction algorithms. And then in the end,  we have this algorithm, which we call Trellis BMA,  

which, kind of, combines ideas from both  fields. So it's based on Trellis, but it's   also more efficient than standard Trellis-based  algorithms because it uses ideas from BMA from   computer science literature. So this is, kind of,  this is a mix of these two approaches. And, yeah,   that’s the paper that we wrote about three years  ago. And now we're open sourcing it. So it is the  

most powerful algorithm for DNA error correction  that we developed in the group. We’re really happy   that now we are making it publicly available  so that anybody can experiment with the source   code. Because, again, the field has expanded a  lot, and now there are multiple groups around   the globe that work just specifically on error  correction apart from all other aspects, so, yeah,   so we are really happy that it’s become publicly  available to hopefully further advance the field.

STRAUSS

Yeah, absolutely, and I'm  always amazed by, you know, how,   it is really about building on other  people's work. Jake and Bichlien,   you recently published a paper in Nature  Communications. Can you tell us a little   bit about what it was, what you exposed the  DNA to, and what it was specifically about? NGUYEN: Yeah. So that paper was on the  effects of neutron radiation on DNA   data storage. So, you know, when we  started the DNA Data Storage Project,  

it was really a comparison, right, between the  different storage medias that exist today. And   one of the issues that have come up through the  years of development of those technologies was,   you know, hard errors and soft errors that were  induced by radiation. So we wanted to know,   does that maybe happen in DNA? We know that DNA,  in humans at least, is affected by radiation from  

cosmic rays. And so that was really the motivation  for this type of experiment. So what we did was   we essentially took our DNA files and dried  them and threw them in a neutron accelerator,   which was fantastic. It was so exciting. That's,  kind of, the merge of, you know, sci fi with sci   fi at the same time. [LAUGHS] It was fantastic.  And we irradiated for over 80 million years— The equivalent of … NGUYEN: The equivalent of 80 million years.

Yes, because it's a lot of  radiation all at the same time, …

NGUYEN

It’s a lot of radiation …

STRAUSS

… and it's  accelerated radiation exposure?

NGUYEN

Yeah, I would say it's accelerated  aging with radiation. It's an insane amount   of radiation. And it was surprising that  even though we irradiated our DNA files   with that much radiation, there wasn't that much  damage. And that's surprising because, you know,   we know that humans, if we were to be irradiated  like that, it would be disastrous. But in,   you know, DNA, our files were able  to be recovered with zero bit errors.

STRAUSS

And why that difference?

NGUYEN

Well, we think there's a few reasons.  One is that when you look at the interaction   between a neutron and the actual elemental  composition of DNA—which is basically carbons,   oxygens, and hydrogens, maybe a phosphorus—the  neutrons don't interact with the DNA much.   And if it did interact, we would  have, for example, a strand break,  

which based on the error-correcting codes,  we can recover from. So essentially,   there's not much … one, there's not much  interaction between neutrons and DNA,   and second, we have error-correcting  codes that would prevent any data loss.

STRAUSS

Awesome, so yeah, this is another  milestone that contributes towards the   technology becoming a reality. There are also  other conditions that are needed for technology   to be brought to the market. And one thing I've  worked on is to, you know, create the DNA Data   Storage Alliance; this is something Microsoft  co-founded with, Illumina, Twist Bioscience,   and Western Digital. And the goal there was to  essentially provide the right conditions for the  

technology to thrive commercially. We did bring  together multiple universities and companies that   were interested in the technology. And one thing  that we've seen with storage technologies that's   been pretty important is standardization and  making sure that the technology’s interoperable.  

And, you know, we've seen stalemate situations  like Blu-ray and high-definition DVD, where, you   know, really we couldn't decide on a standard, and  the technology, it took a while for the technology   to be picked up, and the intent of the DNA Data  Storage [Alliance] is to provide an ecosystem   of companies, universities, groups interested in  making sure that this time, it's an interoperable  

technology from the get-go, and that increases  the chances of commercial adoption. As a group,   we often talk about how amazing it is to work  for a company that empowers us to do this kind of   research. And for me, one of Microsoft Research’s  unique strengths, particularly in this project,   is the opportunity to work with such a  diverse set of collaborators on such a  

multidisciplinary project like we have. How  do you all think where you've done this work   has impacted how you've gone about it and  the contributions you’ve been able to make? NGUYEN: I'm going to start with if we look  around this table and we see who's sitting at it,   which is two chemists, a computer architect, and  a coding theorist, and we come together and we're  

like, what can we make that would be super, super  impactful? I think that's the answer right there,   is that being at Microsoft and being in  a culture that really fosters this type   of interdisciplinary collaboration is the key  to getting a project like this off the ground. SMITH: Yeah, absolutely. And we should  acknowledge the gigantic contributions   made by our collaborators at the University of  Washington. Many of them would fall in not any  

of these three categories. They’re electrical  engineers, they're mechanical engineers,   they're pure biologists that we worked with.  And each of them brought their own perspective,   and particularly when you talk about  going to a true end-to-end system,   those perspectives were invaluable as we were  trying to fit all the puzzle pieces together. Yeah, absolutely. We've had great  collaborations over time—University of Washington,  

ETH Zürich, Los Alamos National Lab, ChipIr,  Twist Bioscience, Ansa Biotechnologies. Yeah,   it’s been really great and a great set of  different disciplines, all the way from coding   theorists to the molecular biology and chemistry,  electrical and mechanical engineering. One of the   great things about research is there's never  a shortage of interesting questions to pursue,   and for us, this particular work has opened the  door to research in adjacent domains, including  

sustainability fields. DNA data storage requires  small amounts of materials to accommodate the   large amounts of data, and early on, we wanted to  understand if DNA data storage was, as it seemed,   a more sustainable way to store information.  And we learned a lot. Bichlien and Jake,   you had experience in green chemistry when you  came to Microsoft. What new findings did we make,  

and what sustainability benefits do  we get with DNA data storage? And,   finally, what new sustainability  work has the project led to?

NGUYEN

As a part of this project, if we're  going to bring new technologies to the forefront,   you know, to the world, we should make sure that  they have a lower carbon footprint, for example,   than previous technologies. And so we ran a life  cycle assessment—which is a way to systematically   evaluate the environmental impacts of anything of  interest—and we did this on DNA data storage and   compared it to electronic storage medium, and we  noticed that if we were able to store all of our  

digital information in DNA, that we would have  benefits associated with carbon emissions. We   would be able to reduce that because we don't need  as much infrastructure compared to the traditional   storage methods. And there would be an energy  reduction, as well, because this is a passive way  

of archival data storage. So that was, you know,  the main takeaways that we had. But that also,   kind of, led us to think about other  technologies that would be beneficial   beyond data storage and how we could use the  same kind of life cycle thinking towards that.

SMITH

This design approach that you've, you know,  talked about us stumbling on, not inventing but   seeing other people doing in the literature and  trying to implement ourselves on the DNA Data  

Storage Project, you know, is something that can  be much bigger than any single material. And where   we think there's a, you know, chance for folks  like ourselves at Microsoft Research to make a   real impact on this sustainability-focused design  is through the application of machine learning,   artificial intelligence—the new tools that will  allow us to look at much bigger design spaces   than we could previously to evaluate  sustainability metrics that were not  

possible when everything was done manually and  to ultimately, you know, at the end of the day,   take a sustainability-first look at what a  material should be composed of. And so we've   tried to prototype this with a few projects.  We had another wonderful collaboration with   the University of Washington where we looked at  recyclable circuit boards and a novel material  

called a vitrimer that it could possibly be made  out of. We've had another great collaboration with   the University of Michigan, where we've looked at  the design of charge-carrying molecules in these   things called flow batteries that have good  potential for energy smoothing in, you know,  

renewables production, trying to get us out  of that day-night, boom-bust cycle. And we   had one more project, you know, this time with  collaborators at the University of Berkeley,   where we looked at, you know, design of a class  of materials called a metal organic framework,   which have great promise in low-energy-cost  gas separation, such as pulling CO2 out of the,   you know, plume of a smokestack or, you  know, ideally out of the air itself.

STRAUSS

For me, the DNA work has made  me much more open to projects outside my   own research area—as Bichlien mentioned, my  core research area is computer architecture,   but we've ventured in quite a bit of  other areas here—and going way beyond   my own comfort zone and really made me love  interdisciplinary projects like this and try,  

really try, to do the most important work I  can. And this is what attracted me to these   other areas of environmental sustainability  that Bichlien and Jake covered, where there's   absolutely no lack of problems. Like them, I'm  super interested in using AI to solve many of   them. So how do each of you think working on  the DNA Data Storage Project has influenced   your research approach more generally and how you  think about research questions to pursue next?

YEKHANIN

It definitely expanded the horizons  a lot, like, just, kind of, just having this   interactions with people, kind of, whose core  areas of research are so different from my own   and also a lot of learning even within my  own field that we had to do to, kind of,   carry this project out. So, I mean, it  was a great and rewarding experience.

NGUYEN

Yeah, for me, it's kind of the opposite of  Karin, right. I started as an organic chemist and   then now really, one, appreciate the breadth  and depth of going from a concept to a real  

end-to-end prototype and all the requirements  that you need to get there. And then also,   really the importance of having, you know,  a background in computer science and really   being able to understand the lingo that is used  in multidisciplinary projects because you might   say something and someone else interprets it very  differently, and it's because you're not speaking  

the same language. And so that understanding  that you have to really be … you have to learn   a little bit of vocabulary from each person  and understand how they contribute and then   how your ideas can contribute to their ideas  has been really impactful in my career here.

SMITH

Yeah, I think the key change  in approach that I took away—and I   think many of us took away from the DNA Data  Storage Project—was rather than starting with   an academic question, we started with  a vision of what we wanted to happen,   and then we derived the research questions  from analyzing what would need to happen in   the world—what are the bottlenecks that need to  be solved in order for us to achieve, you know,  

that goal? And this is something that we've  taken with us into the sustainability-focused   research and, you know, something that I think  will affect all the research I do going forward.

STRAUSS

Awesome. As we close, let's  reflect a bit on what a world in which   DNA data storage is widely used might  look like. If everything goes as planned,   what do you hope the lasting impact of this  work will be? Sergey, why don’t you lead us off.

YEKHANIN

Sure, I remember that, like, when …  in the early days when I started working on this   project actually, you, Karin, told me that you  were taking an Uber ride somewhere and you were   talking to the taxi driver, and the taxi  driver—I don't know if you remember that—but   the taxi driver mentioned that he has a camera  which is recording everything that's happening   in the car. And then you had a discussion with  him about, like, how long does he keep the data,  

how long does he keep the videos. And he told  you that he keeps it for about a couple of days   because it's too expensive. But otherwise, like,  if it weren't that expensive, he would keep it   for much, much longer because, like, he wants to  have these recordings if later somebody is upset   about the ride and, I don’t know, he is getting  sued or something. So this is, like, this is one   small narrow application area where DNA data  storage would clearly, kind of, if it happens,  

then it will solve it. Because then, kind of, this  long-term archival storage will become very cheap,   available to everybody; it would become a  commodity basically. There are many things   that will be enabled, like this helping the Uber  drivers, for instance. But also one has to think   of, of course, like, about, kind of, the broader  implications so that we don't get into something   negative because again this power of recording  everything and storing everything, it can also  

lead to some use cases that might be, kind of,  morally wrong. So, again, hopefully by the time   that we get to, like, really wide deployments  of this technology, the regulation will also be   catching up and the, like, we will have great use  cases and we won’t have bad ones. I mean, that's   how I think of it. But definitely there are lots  of, kind of, great scenarios that this can enable.

SMITH

Yeah. I'll grab onto the word you use  there, which is making DNA a commodity. And   one of the things that I hope comes out of this  project, you know, besides all the great benefits   of DNA data storage itself is spillover benefits  into the field of health—where if we make DNA   synthesis at large scale truly a commodity thing,  which I hope some of the work that we've done to   really accelerate the throughput of synthesis  will do—then this will open new doors in what  

we can do in terms of gene synthesis, in terms of,  like, fundamental biotech research that will lead   to that next set of drugs and, you know, give us  medications or treatments that we could not have   thought possible if we were not able to synthesize  DNA and related molecules at that scale.

NGUYEN

So much information gets lost  because of just time. And so I think   being able to recover really ancient history  that humans wrote in the future, I think,   is something that I really hope could be  achieved because we're so information rich,   but in the course of time, we become information  poor, and so I would like for our future   generations to be able to understand the life  of, you know, an everyday 21st-century person.

STRAUSS

Well, Bichlien, Jake, Sergey,  it's been fun having this conversation   with you today and collaborating  with you in all of this amazing   project [MUSIC] and all the research  we've done together. Thank you so much. YEKHANIN: Thank you, Karin. SMITH: Thank you.

NGUYEN

Thanks.

[MUSIC FADES]

Transcript source: Provided by creator in RSS feed: download file
For the best experience, listen in Metacast app for iOS or Android