Hello, and welcome to the Data Engineering podcast, the show about modern data management. When you're ready to launch your next project, you'll need somewhere to deploy it, so you should check out linode at data engineering podcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. And go to data engineering podcast.com
to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page, which is linked from the site. To help other people find the show, you can leave a review on Itunes or Google Play Music, tell your friends and coworkers, and share it on social media. I've got a couple of announcements before we start the show. There's still time to register for the O'Reilly Strata Conference in San Jose, California
happening from March 5th to 8th. Use the link data engineering podcast.com/strata dash sand dash Jose to register and save 20% off your tickets. The O'Reilly AI conference is also coming up, happening April 29th to 30th in New York. It will give you solid understanding of the latest breakthroughs and best practices in AI for business. Go to data engineering podcast.com/aicondashnewdashyork to register and save 20% off the tickets.
Also, if you work with data or want to learn more about how the projects you have heard about on the show get used in the real world, then join me at the Open Data Science Conference happening in Boston from May 1st through 4th. It has become 1 of the largest events for data scientists, data engineers, and data driven businesses to get together and learn how to be more effective. To save 60% off your tickets, go to data engineering podcast.com/od dash east dash 2018 and register.
Your host is Tobias Macy. And today, I'm interviewing Danielle Robinson and Joe Hand about the DAT project, the distributed data sharing protocol for building applications of the future. So Danielle, could you start by introducing yourself? Sure. My name is Danielle Robinson, and I'm the co executive director of Code for Science and Society, which is the nonprofit that supports the DAT project. I've been working on debt related projects first as a partnerships director for about a year now,
and I'm here with my colleague, Joe Hand. Take it away, Joe. Yeah. I'm Joe Hand, and I'm the other co executive director and the director of operations at Code for Science and Society. And I've been a core DAT contributor for about 2 years now. And, Danielle, starting with you again, can you talk about how you first got involved and interested in the area of data management? Sure. So, I have a PhD in neuroscience. I finished that about a year and a half ago.
And what I did during my PhD, my research was focused on cell biology, really without getting into the weeds too much on that. A lot of time at microscopes, collecting, some kind of medium sized imaging data. And during that process, I became pretty frustrated with the academic and publishing systems that seem to be limiting
access of people to the results of taxpayer funded research. So publications are behind paywalls and data is either not published along with the paper or, sometimes is published, but not well archived and becomes inaccessible over time. So sort of compounding this, traditionally, code has not really been thought of as an academic a scholarly work. So, and that's a whole another conversation.
But even though these things are changing, data and code aren't shared consistently and are pretty inconsistently managed within labs. I think that's fair to say. So and what that does is it makes it really hard to reproduce or replicate other people's research, which is important for the scientific process.
So during my PhD, I got really active in the OpenCon and Mozilla Science communities, which I encourage your listeners to check out. These communities build encourage your listeners to check out. These communities build inter interdisciplinary connections between the open source world and open education, open access, and open data communities.
And that's really important in order to, like, build things that people will actually use and make big cultural and policy changes that will make it easier to access research and share data. So it's sort of I got involved
because of the partly because of the technical challenge, but also I'm interested in the people problems. So the changes to the incentive structure and the culture of research that are needed to make data management better on a day to day and make, our research infrastructure stronger and more long lasting. And, Joe, how did you get involved in data management?
Yeah. I've sort of gone back and forth between the the sort of more academic or research data management and more traditional software side. So I I really got started involved in data management when I was at a data visualization agency, and we basically built, you know, pretty web based visualization, interactive visualizations for a variety of clients. This was cool because it sort of allowed me to see, like, a large variety of data management techniques.
So there was, like, the small scale spreadsheet and manually updating data in spreadsheets and then sending that off to Visualize and to, like, big Fortune 500 companies that had data warehouses and full internal APIs that we got access to. So it was really cool to see that sort of variety of of data collection and data usage between all those organizations. So So that was also good because it sort of helped me understand
how how to use data effectively, and that really means like telling a story around it. So, you know, in order to sort of use data, you have to either use some math or some visual representation, and the best the best stories around data combine sort of a bit of both of those. And then from there, I moved to a research institute and we were tasked with building a data platform for an international NGO. And that group basically does census data collection in slums all over the world.
So as a research group, we were sort of trying interested in using that data for research, but we also had to help them figure out how to collect that data. So before we came in with that project, they had basically been doing 30 years of data collection on paper and then sometimes manually entering that data into spreadsheets
and then trying to sort of share that around through thumb drives or Dropbox or sort of whatever tools they had access to. So this was cool because it really gave me a great opportunity to see the other side of data management and analysis. So, you know, we work with the corporate clients which sort of have big lots of resources and computer computer resources and cloud servers. And this was sort of the other side where there's there's very few resources.
Most of the data analysis happens offline and a lot of the data transfer happens offline. So it's really cool and interesting to see that that a lot of the tools I've been taking for granted sort of weren't couldn't be applied in those in those areas. And then on the research side of things, I saw that, you know, as scientists and governments, they were just sort of haphazardly organizing data in the same way.
So I was sort of trying to collect and download census data from about 30 countries and we had to e mail, write, fax people. We got different CDs and paper documents and PDFs in other languages. So that really illustrated that there's, like, a lot of data managed out there in a way that I wasn't totally familiar with. And it's just it's just very crazy how everybody manages their data in a different way. That's sort of a long what I like to call the long tail of data management.
Managing data in that way probably wouldn't call it data, but it's just sort of what they use to get their job done. And so once I started to sort of look at alternatives to managing that research data, I found DAT basically and and was hooked and started to contribute. So that's sort of how I found DAT. So that leads us nicely into talking about what the DAT project is
and, as much of the origin story as each of you might be aware of. And, Joe, you already mentioned how you got involved in the DAT project. But, Danielle, if you could also share your involvement or or how you got started with it as well. Yeah. I can tell the origin story. So the DAT project is an open source community building a protocol for peer to peer data sharing.
And, as a protocol, it's similar to HTTP and how the protocol is used today, but that adds extra security on automatic versioning and allows users to connect to a decentralized network in a decentralized network. You can store the data anywhere, either in a cloud or in a local computer, and it does work offline.
And so DAT is built to make it easy for developers to build decentralized applications without worrying about moving data around. And the people who originally developed it, and that'll be, Matthias and Max and Carissa, they're scratching their own itch for building software to share and archive public and research data. And this is how Joe got involved, like he was saying before.
And, so it originally started as an open source project, and then that got a grant from the Knight Foundation in 2013 as a prototype grant focusing on government data. And then that was followed up in 2014 by a grant from the Alfred p Sloan Foundation. And that grant focused more on scientific research, and allowed the project to put a little more effort into working with researchers.
And since then, we've been working to solve research data management problems by developing software on top of the DAT protocol. And the most recent project is funded by the Gordon and Betty Moore Foundation, and that, that project started 2016. And that supports us. It's called DAT in the Lab, and, I can get you a link to it on our blog.
It supports us to work with, California Digital Library and research groups in the University of California system to make it easier to move files around, version datasets, and support researchers, through automating archiving. And so that's a really cool project because we get to work directly with researchers and do the kind of participatory design software very different from the research I did my PhD. 1 of the labs we're working with studied sea star wasting disease.
So it's really fascinating stuff, and we get to work right with them to make things that were gonna fit into their workflows. So I started working with DAT in the summer right before that grant was funded. So I guess maybe 6 months before that grant was funded. And so I was came on as a consultant initially to help write grants and start talking about how to work directly with researchers and what, to build the researchers would really help them,
move their data around and version control it. So so, yeah, that's how I became involved. And then in the fall, transitioned to a partnerships position and then the ED position in the last month. And you mentioned that a lot of the sort of boost to the project has come in the form of grants from a few different foundations. So I'm wondering if you can talk a bit about how those different grants have influenced the focus and pace of the development that was possible for the project?
Yeah. I mean, DAT really occupies a unique position in the open source world with that grant funding. So, you know, for the 1st few years, it was closer to sort of a research project than a traditional product focused startup. And other projects, other open source projects like that might be done part time as a side project or just sort of for fun. But the grant funding really allowed, the original developers to sign on and work full time, really solving
harder problems than they might might be able to otherwise. So since we sort of got those grants, we've been able to toe the line between a more user facing product and some research software. And the the grant really gave us the opportunity to to toy that line, but also get in the field and connect with researchers and end users, so we can sort of innovate in with technical solutions, but really ground those real in reality with with specific scientific use cases.
So, you know, this balance is really only possible because of that grant funding, which sort of gives us more flexibility and might have a little longer timeline than than VC money or or just like a open source, side project. But now we're really at a critical juncture. I'd say where where grant funding's not quite enough to to cover what we want to do.
But we're lucky because the protocol is really getting in a more stable position and we're starting to to look at those user facing products on top and starting to build those those around around the core protocol. And the fact that you have received so many different rounds of grant funding sort of lends credence to the fact that you're solving a critical problem that lots of people are coming up against.
And I'm wondering if there are any other projects or companies or organizations that are trying to tackle similar or related problems that you sort of view as co collaborators or competitors in the space? Or do you think that the DAP project is fairly uniquely positioned to solve the specific problems that it's addressing? Yeah. I mean, I would say we have, you know, there are other similar use cases and and tools. And you know, a lot of that is around
sharing open data sets and sort of that, the publishing of data, which Danielle might be able to talk more about. But on the on the sort of technical side, there is, you know, other I guess the biggest competitor or similar thing might be IPFS, which is another sort of decentralized protocol for for sharing and and storing data in different ways. But we're really we're actually, you know, excited to work with these various companies. So, you know, IPFS is more of
a a storage focused format. So basically allows content based storage on a distributed network. And that's really more about sort of the the transfer protocol and and being very interoperable with all these other all these other solutions. So Yeah. You know, that's what we're more excited about is trying to understand how we can how we can use that
in collaboration with all these other groups. Yeah. I think I'm I'm just plus 1 what Joe said through my time coming up in the OpenCon community and the Mozilla Science community. There are a lot of people trying to improve access to data broadly, And I most of the people I everyone in the space really takes a collaboration, not competition sort of approach because there are a lot of different ways to solve the problem depending on who what the end user wants.
And there are there's a lot of great projects working in the space. I would agree with Joe, I guess, that IPFS is the thing that people sometimes, you know, like, I'll be at an event and someone will say, what's the difference between debt and IPFS? And I answer pretty much how Jojo's answered, but it's important to note that we know those people and, we have good relationships with them. And we've actually just been emailing with them about some kind of collaboration
over in the next year. So it's there's a lot of, there's a lot of really great projects in the open data and improving access to data space, and I like, basically support them all. So hopefully, there's so much work to be done that, I think there's room for all the people in the space. And now that you have established a nonprofit organization around that, are there any particular plans that you have to support future sustainability and growth for the project?
Yes. Future sustainability and growth for the project is what we wake up and think about every day. Sometimes in the middle of the night. It's the most important thing. And incorporating the nonprofit was a big step that happened I think, the end of 2016. And so it's critical as we move that towards a self sustaining future. And importantly, it will also allow us to continue to support and incubate other open source projects in the space, which is something that I'm really excited about.
For DAT, our goal is to support a core group of TAC contributors through grants, revenue sharing, and donations. And so over the next 12 months, we'll be pursuing grants and corporate donations, as well as rolling out an open collective page help facilitate smaller donations, and continuing to develop products with an eye towards things that can generate revenue and support that idea that ecosystem. At the same time, we're also focusing on sustainability
within the project itself. And what I mean by that is, you know, governance, immunity management. And so we are right now working with the DAT developer community, to formalize the technical process on a protocol through a working group. And those are really great calls. Lots of great people are involved in that. And we really wanna make sure the protocol decisions are made transparently, and it can involve a wider
group of the DAC community in the process. And we also want to make the path to participation, involvement and community leadership clear for newcomers. So by supporting the DAT developer community, we hope to encourage like, new and exciting implementations of the DAT protocol. Some of the stuff that happened 2017, you know, from my perspective working in the science
and sort of came out of nowhere, and people were building, you know, amazing new social networks based on that. And it was really fun and exciting. And so just keeping the community healthy and making sure that the technical process and how decisions get made is really clear and transparent, I think, was going to facilitate even more of that. And just another comment about being a nonprofit, because Code for Science and Society is a nonprofit, we also act as a fiscal sponsor.
And what that means is that like minded projects who get grant funding that are not nonprofits, so they can't accept the grant, they run their grant through us. And then we take a small percentage of that grant, and we use that to help those projects by linking them up with our community. I work with them on grant writing and fundraising and strategy.
We'll support their own community engagement efforts and sometimes offer technical support. And we see this as really important to the ecosystem and a way to help smaller projects develop and succeed. So right now, we do that with 2 projects. 1 of them is called Stensilla, and I can send a link for that. And the other 1 is called Science Fair. Stensilla is an open source reproducible document software funded by the Alfred P. Sloan Foundation.
It's looking to support researchers through from data collection to document authoring. And Science Fair is a peer to peer library built on DAT, which is designed to make it easy for scholars to curate collections of research on a certain topic, annotate them, and share it with their colleagues. And so that project was funded by a prototype grant from a publisher called eLife, and they're looking for additional funding. So we're working with both of them.
And in the Q1 of this year, Joe and I are working to formalize the process of how we work with these other projects and what we can offer them. And hopefully, we'll be in the position take on additional projects later this year.
But I really enjoy that work, and I think as someone so I went through the a Mozilla Fellowship, which was like a 10 month long crazy period where Mozilla invested a lot in me and making sure I was meeting people and learning how to write grants and learning how to give good talks and all kinds of awesome investment.
And so for a person who goes through a program like that or a person who has a side project, there's, kind there's a need for, groups in the space who can incubate those projects and help them as they develop from from the incubator stage to the, you know, middle stage before they scale up. So I I'm thinking there's. So as a fiscal sponsor, we we're hoping to be able to support projects in that space.
And digging into the DAP protocol itself, when I was looking through the documentation, it mentioned that the actual protocol itself is agnostic to the implementation, and I know that the current reference implementation
is done in JavaScript. So I'm wondering if you can describe a bit about how the protocol itself is designed, how the reference implementation is, done, and how the overall protocol has evolved since it was first started and what your approach is to versioning the protocol itself to ensure that people who are implementing it in other technologies or formats are able to ensure that they're compliant with specific versions of the protocol as it evolves?
Yeah. So that's basically a combination of ideas from from Git, BitTorrent, and just the the web in general. And so there are a few key properties in DAT that basically any implementation has to recreate. And those are content integrity, decentralized mirroring of the datasets, network privacy, incremental versioning, and then random access to the data. So we have a white paper that sort of explains all these in-depth, but I'll sort of explain how they work maybe in a basic use case.
So let's say I want to send some data to Danielle, which I do all the time, and I have a spreadsheet where I keep track of my coffee intake intake. So I wanna live sync that to Danielle's computer so she can make sure I'm not over caffeinating myself. So sort of similar to how you get started with Git, I would put my spreadsheet in a folder and create a new DAT. And so whenever I create a new DAT, it makes a new key pair. So one's a public key and one's a private key.
And the public key is basically the DAT link, so kind of like a URL. So you can use that in any anything that speaks the the DAT protocol, and you can just sort of open that up and look at all the files inside of that. And then the the private key allows me to write files to the DAT and is used to sign any of the new changes.
And so the private key allows Danielle to verify that the changes actually came from me and that somebody else wasn't wasn't trying to fake my data or somebody wasn't trying to man in the middle of my my data when I was transferring it to Danielle. So I add my spreadsheet to the DAT and then the what DAT does is break that file into little trunks. It hashes all those trunks and creates a Merkle tree with that. And that Merkle tree basically has lots of cool properties and is 1 of the key
key sort of features of DAT. So the merkle tree allows us to sparsely replicate data. So if we had a a really big dataset and you only wanted 1 file, we can sort of use the Merkle tree to download 1 file and still verify the integrity of that content, with that incomplete dataset. And the other part that allows us to do that is the register. So all the files are stored in 1 register and all the metadata is stored in another register.
And these registers are basically append only ledgers. They're also sort of known as secure registers. Google has a project called certificate transparency that has similar ideas. And these registers basically you append whatever new file changes, you might append that to the metadata register, and that register storage is based on information about the structure of the file system, what version it is, and then any
other metadata, like the creation time or the change time of that file. And so right now, you know, as you said, Tobias, we we sort of are very flexible on sort of how things are implemented. But right now, we basically store the files as files. So that sort of allows for people to see the files normally and interact with them normally. But the cool part about that is that the the on disk file storage can be really flexible. So as long as the implementation, has random access basically,
then they can store it in any different way. So we have, for example, a storage store a storage model built for the server that stores all of the files as a single file. So that sort of allows you to have less file descriptors open and sort of shut gets the the file IO, all constrained to 1 file. So once my file gets added, I can share my link privately with Danielle and I can send that over
chat or something or just paste it somewhere. And then she can clone my DAT on using our command line tool or the desktop tool or the Beaker browser. And when she clones my DAT, our computers basically connect directly to each other. So we use a variety of mechanisms to try and do that connection.
That's been 1 of the challenges that I can talk about later, sort of how to how to connect peer to peer and the challenges around that. But then once we do connect, we'll transfer the data either over TCP or UDP. So those are default, network protocols that we use right now. But, yeah, that can be implemented basically on any other protocol. I think
Matthias once said that that if you could implement it over carrier pigeon, that would work fine as long as you had a lot of pigeons. So we're really open to sort of how how the the data as far as the protocol information gets transferred, and we're working over a DAT over HTTP implementation too. So this wouldn't be peer to peer, but it would allow basically a traditional server fallback if no peers are online
or for services that don't want to run a peer to peer for whatever reason. Once Danielle clones my DAT, she can open it just like a normal file and plug it into R or Python or whatever and use her equation to measure my caffeine level. And then let's say I drink another cup of coffee and update my spreadsheet. The changes will basically automatically be synced to her as long as she's still connected to me and it'll it'll be synced throughout the network
to anybody else that's connected to me. So the metadata register stores that updated file information and then the content register stores just the change file blocks. So Danielle only has to sync the the diff of that content change rather than the whole dataset again. So this is really useful for the the big datasets, so you don't have to sync the whole thing. And, yeah, we've tried to design basically each of these pieces to be as modular as possible, both within our JavaScript
implementation but also in the protocol in general. So right now, developers can swap other network protocols, data storage. So for example, if you want to use that in the browser, you can use WebRTC for the network and discovery, and then use IndexedDB for data storage. So IndexedDB has random access, so you can just plug that in directly into DAT. And we have some modules for those, and that should be working.
We did have a WebRTC implementation we were supporting for a while, but we found it a bit inconsistent for our use cases, which is, you know, more around like large file sharing. But it still might be okay for chat and other more text based things. So, yeah, all of our implementations in Node right now. I think that was that was both for for usability and developer friendliness and also just being able to work in the browser and across platforms.
So we can distribute a binary now of dat pretty easily, and you can run dat in the browser or build DAT tools on Electron. So it sort of allows a wide range of of developer tools built on top of that. But we have a few community members now working on different implementations in Rust and c, I think, are the 2 the 2 that are going right now. And so as far as the the protocol versioning,
that was actually 1 of the big conversations we were having in the last working group meeting, and that's to be decided basically. But through through the stages we've gone through, we've broken it quite a few times, and now we're finally in a place where we we want to make sure not to break it moving forward. So there's sort of space in the protocol for information like version history or or version of the protocol.
So we'll probably use that to signal the version and just figure out how how the tools that are implementing it can fall back to the latest version. So before before all this sort of file based stuff, that went through a different a few different stages. It started really as a more, like, versioned decentralized database. And then as as Max and Matthias and Carissa sort of moved to the scientific use cases, they sort of removed
more and more of the database architecture as it as it moved on and matured. So we basically
that transition was really driven by, like, user feedback and watching how researchers work. And we realized that so much of research data is still kept in files and basically moved manually between machines. So even if we were gonna build, like, a special database, a lot of researchers still wouldn't be able to use that because that sort of requires more more infrastructure than their they have time to support.
So we really just kept working to build a general purpose solution that allows other people to build tools to solve those those more specific problems. And the last point is that right now, all DAT transfer is basically 1 way. So only 1 person can update the source. This is really useful for a lot of our research is research cases where they're getting data from lab equipment where there's, like, a specific source, and you just want to disseminate that information to
various computers, but it really doesn't work for collaboration. So that's sort of the next thing that we're working on. But we really want to make sure to solve solve the sort of 1 way problem before we move to the the harder problem of collaborative datasets. And this last major iteration is sort of the hardest and that's
what we're working in right now, but it sort of allows multiple users to write to the same DAT. And with that, we sort of get into problems like conflict resolution and and duplicate updates and other other sort of harder distributed computing problems.
And that partially answers 1 of the next questions I had which was to ask about conflict resolution, but if there's only 1 source that's allowed to update the information, then that solves a lot of the problems that might arise by syncing all these datasets between multiple machines because there aren't gonna be multiple parties changing the data
concurrently so you don't have to worry about how to handle those use cases. And another question that I had from what you were talking about is the cryptography aspect of that. Sounds as though when you initialize the DAT, it just automatically generates the private key, and so that private key is canonically linked with that particular dataset.
But is there any way to, use, for instance, Keybase or GPG to sign the source that in addition to the generated key to establish your identity for some for when you're trying to share that information publicly and not necessarily via some channel that already has established trust? Yeah. I mean, you can sort of so once I mean, you could, like, do that within the DAT. We don't really have any mechanism for doing that on top of DAT.
So it's you know, we're sort of gonna throw that into user land right now. But, yeah, I mean, that's a good good question. And we've we've had some people, like, I think, experimenting with different identity systems and and how to solve that problem. And I think we're we're pretty excited about the the new wire app because that's open source and it uses end to end encryption and it has some identity system, and we're sort of trying to see if we can sort of build that on top of Wire.
So that's 1 of the things that we're just sort of experimenting with. And 1 of the primary use cases that is mentioned in the documentation and the website for DAT is being able to host and distribute open datasets with a focus being on researchers and academic use cases.
So I'm wondering if you can talk some more about how DAT helps with that particular effort and what improvements it offers over some of the existing solutions that researchers were using prior to your introduction to that. So there are solutions for both hosting and distributing data. And terms of hosting and distribution, there's a lot of great work, focused on data publication
and making sure that data associated with publications is available online. And so thinking about Zenodo and Dryad or Dataverse. There are also, other data hosting platforms, such as CCAN or data dot world. And we really love the work these people do, and we've collaborated with some of them where we're involved in, like, organization of friendly people, like, for the Open Source Alliance for Open Scholarship,
has some people from Dryad who are involved in it. And so it's nice to work with them, and we'd love to work with them to use that to upload and distribute data. But, right now, if researchers need to feed if researchers need to share files between many machines and keep them updated in version. So for example, if there's a large live updating dataset, there really aren't great solutions to address data versioning and sharing.
So in terms of sharing and transferring, lots of researchers still manually copy files between machines and servers, or use tools like rsync or FTP, which is how I handled it during my PhD. Other software such as Globus or even Dropbox, Box, can require more IT infrastructure than a small research group may have. Researchers are like you know, they are all operating on limited grant funding. And, they also depend on the IT structure of their institution to get them access to certain things.
So a researcher like me might spend all day collecting a terabyte of data on a microscope and then wait for hours or wait overnight to move it to another location. And the ideal situation from a data management perspective is that those raw data are automatically archived to the lab server and sent to the researcher's computer for processing. So you have an archived copy of the raw data that came off of the equipment.
And then the processed files also need to be archived. So you need archives of the imaging files in this case at each step in processing. And then when a publication is ready, the data processing cluster where the analysis was done, to a compute the computer or the cluster where the analysis was done, a person should be able to re repeat that. And I say ideally because this isn't really how it's happening now.
Archiving data at different steps can be the some of the things that stop that from happening are just, cost of storage and the availability of storage and researcher habits. So I ever went off, which isn't really like a long term solution. True facts. So DAT can make can automate these archiving steps at different checkpoints and make the backups easier for researchers. As a former researcher, I'm interested in anything that makes better data management automatic for researchers.
And so we're also interested in version compute environments to help labs avoid the drawer full of jazz drives problem, which is sadly a quote from a senior scientist who was describing a bunch of data collected by her lab that she can no longer access. She has the drawer, she has the jazz drives, she can't get in them, that data is essentially lost. And so researchers are really motivated to make sure
when things are archived, they're archived in a form where they can actually be accessed. But I think because researchers are so busy, it's really hard to know, like, when that is. So, I think because we're so focused on, essentially, like, filling in the gaps between the services that researchers use and that work well for them and automating things, I think that that's in a really good position to solve some of these problems.
And if you have, you know, some of the researchers that we're working with now, I'm thinking of 1 person who has a large data set and a bioinformatic pipeline, And he's at a UC lab, and he wants to get all the information to his collaborator in Washington state. And it's taken months, and he has not been able to do it, or he can get he can't he just can't move that data across institutional lines. So,
and that's a much longer conversation as to, like, why exactly that isn't working. But, we're working with him to try to just, make him make it possible for him to move the data and create, a a versioned iteration or a versioned emulation of his compute environment so that his collaborator can just do what he was doing and not need to spend 4 months worrying about season stuff. So, yeah, hopefully, that is the question.
And 1 of the other difficult aspects of building a peer to peer protocol is the fact that in order for there to be sufficient value in the protocol itself is there needs to be a network behind it of people to be able to share that information with and share the bandwidth requirements for being able to distribute that information. So I'm wondering how you have approached the effort of building up that network and how much progress you feel you have made in that effort.
Yeah. I'm not sure we really view that as as that traditional peer to peer protocol. I'm using that model sort of relying on on network effects to scale. So, you know, as Danielle said, we're just trying to get data from a to b, and so our critical mass, is basically 2 users on a given dataset.
So, obviously, we wanna first build something that offers better tools for those 2 users over a traditional cloud or client server model. So if I'm transferring files to another researcher using Dropbox, you know, we have to transfer files via a third party and a third computer,
before it can get to the other computer. So rather than going direct between 2 computers, we have to go through a detour. And this has implications for speed, but also security, bandwidth usage, and even something like energy usage.
So by cutting off that 3rd computer, we feel like we're we're already adding value to the network, and we're sort of hoping that when when researchers are doing this a to b transfer, they they can sort of see the value of going directly and and using something that is versioned and can in light be live synced, over existing tools like rsync or FTP or other commercial services, that might store data in the cloud. And you know, we really don't have anything against these centralized services.
We sort of recognize that they're very useful sometimes, but they they also aren't the answer to everything. And so depending on the use case, this decentralized system might make more sense than a centralized 1. And so we sort of want to offer developer and users that option to make that choice, which we don't really have right now. But in order to do that, we really have to start with peer to peer tools first. And then once we have that decentralized network, we can basically limit the network
to 1 server peer and many clients, and then all of a sudden it's centralized. So we sort of understand that that it's easy to go from decentralized to centralized, but it's harder to go the other way around. So we sort of have to start with a peer to peer network in order to solve all these different problems. And the other thing is that we sort of know file systems are not going away. We know that the web browsers will continue to support static files.
And we also know that people basically want to move these things between computers, back them up, archive them, share them to different computers. So we sort of know files are gonna be transferred a lot in the future, and that's something we can we can depend on. And they probably even wanna do this in a secure way sometimes and maybe in an offline environment or a local network. And so we're basically trying to build from that those basic principles,
using sort of peer to peer transfer as the sort of bedrock of all that. And that's sort of how we got to to where we are now with the peer to peer network. But we're not really worried that that we need a certain number of critical mass of users to add value because we just sort of feel like by building the right tools, with these principles, we can we can start adding value whether it's a a decentralized network or a centralized network.
And 1 of the other use cases that's been built on top of that is being able to build websites and applications that can be viewed via web browsers and distributed peer to peer in that manner. So I'm wondering how much uptake you've seen in usage for that particular application of the protocol and how much development effort is being focused on that particular use case.
Yeah. So, you know, if I open my Beaker browser right now, which is the main the main web implementation we have that Paul Frazee and Tara Bansal are working on. You know, if I open my my Beaker browser, I think I usually have 50 to a 100 or sometimes 200 peers that I connect to right away. So that's through some of the the social network copies like Rotonde or Fritter,
and then just some like personal sites. And, you know, we've sort of been working with the Beaker Browser folk probably for 2 years now, sort of codeveloping the protocol and and seeing what they need support for in Beaker. But, you know, it sort of come back comes back to that basic principle that we can recognize that a lot of websites
are static files. And if we can just sort of support static files in the best way possible, then you can browse a lot of websites. And that even gives you the benefit of things that are more interactive. We know that they have to be developed so they work offline too. So both Rotonde and Twitter, can work offline. And then once you get back online, you can just sync the data sort of seamlessly. So that's sort of the most exciting part about those. You mean fritter, not Twitter? Sorry.
Fritter is the Twitter clone that, Tara, Vansel, and Paul made. Beaker is a lot of fun, and if you've never played around with it, I would encourage you to, download it at, I think it's just at beakerbrowser.com. And, I'm not a developer by trade, but I have seriously enjoyed playing around on Beaker.
And, I think the some of the more, frivolous things, like fritter, that have come out of it, are a lot of fun and really speak to the potential of peer to peer networks, in in today's era as people are becoming increasingly frustrated with the centralized platforms. And the fact that the content that's being distributed
via DAT using the Beaker Browser is primarily static in nature. I'm wondering how that affects the sort of architectural social network applications that have been built on you you've already mentioned, a couple of social network applications that have been built on top of it, but I'm wondering if there are any others that are built on top of and delivered via DAT that you're aware of that you can talk about that speak to,
some of the ways that people are taking advantage of that in more of the consumer space? Yeah. I mean, I think, you know, 1 of the big shifts that have made this easier is having, databases in the browser. So things like IndexedDB or other local storage databases and then be able to sync those to other computers. So as long as you sort of know that I'm writing to my database and that, you know, I'm writing my, think people are trying to build,
games off this. So, you know, you could build a chess game where I write to my local database, and then you have some logic for determining if a move is valid or not, and then syncing that to your competitor. You know, it sort of provides it's a more constrained environment, but I think that also gives you a benefit of of sort of being able to constrain your development and and not requiring,
these external services or external database calls or whatever. I know that I've tried a few times to sort of develop projects or just like fun little things, and it is a challenge. It's a challenge because you sort of have to think differently how those things work, and you can't rely necessarily on on external services, you know, whether that's something as simple as, like, loading fonts from external service or CSS styles or whatever, external JavaScript.
You sort of want that all to be packaged within 1 1 DAT if you wanna ensure it's all gonna work. So it's definitely, has you know, you think of a little differently even on those those simple things. But, yeah, it does constrain the sort of bigger applications. And, you know, I think the other area that that we could see development is more in electron applications.
So maybe not in Beaker, but electron using that sort of framework as as a platform for other types of applications that might need those more sort of flexible models. So science fair, which is 1 of our hosted projects, is a really good example of how how to use that in a way to distribute data, but still sort of have a a full application. So, basically, you can distribute all the data for the application over DAT and keep it updated through the live syncing.
And users can basically download the the PDFs that they need to read or the journals or the figures they wanna read and just download whatever they want. So it's sort of allowing developers to have that flexible model where you can distribute things peer to peer and have both the live syncing, but also just downloading whatever data that users need and just providing that framework for for that data management.
And 1 of the other challenges that's posed particularly for this, public distribution use case is that of content discovery because the by default, the dot URLs that are generated are private and unguessable because they're essentially just hashes of the content.
So I'm wondering if there are any particular mechanisms that you either have built or planned or started discussing for being able to facilitate content discovery of the information that's being distributed by these different debt networks.
Yeah. This is definitely an open question. I I sort of fall back on my common answer, which is depends on the the tool that we're using and the different communities. And there's gonna be different approaches. Some might be more decentralized and some might be centralized. So for example, with dataset discovery, you know, there's a lot of good centralized services for dataset publishing, as Daniel mentioned, like Zenodo or Dataverse. So these are places that already have, discovery
engines, I guess I'll say. And they publish datasets. So, you know, you could sort of similarly publish that URL along with those those datasets so that people could sort of have an alternative way to download those datasets. So that's that's sort of 1 way that we've been thinking about discovery is sort of leveraging these existing solutions that are doing a really good job in their domain and trying to work with them to start using DAT for their their data management.
Another sort of hacky solution, I guess I'll say, is using existing domains and DNS. So basically, you can publish a regular HTTP site on your URL and give it a specific, well known file. And that points to your dat address and then the Beaker browser can find that file and tell you that a peer to peer, version of that site is available. So we're basically leveraging the existing DNS infrastructure to start to discover content just with existing URLs.
And I think a lot of the discovery will be more community based. So in, for example, Fritter and Rotonde, people are starting to build crawlers or search bots to discover users or search. And so basically, just sort of looking at where there is need and identifying
you know, different types of crawlers to build and and how to connect those communities in different ways. So we're really excited to see what what ideas pop in that in that area and and they'll probably come in a in a decentralized way, we hope.
And for somebody who wants to start using that, what is involved in creating and or consuming the content that's available on the network? Or if there are any particular resources that are available to get somebody up to speed and understand how it works and some of the different uses that they could put it to. Sure. I can take that. And, Joe, just chime in if you think of anything else. We built a tutorial for our work with the labs and for Mozfest this year. That's attridash.com.
And this tutorial takes you through how to work with the command line tool and some basics about Beaker. And please tell us if you find a bug. There may be bugs. Warning. But, it was working pretty well when I used it last. And it's in the browser, and you can either share DAT with yourself. It spins up a little virtual machine. So you can share data with yourself, or you can do it with a friend and share data with your friend.
So, Beaker is also super easy for a user who wants to get started. You can visit pages over dot just like you would a normal web page. For example, you can go to this website, and we'll give Tobias the link to that. And just change the HTTP to dat. And so it looks like datcolon/jhand.space. And, Beaker also has this fun thing
that lets you create a new site with a single click. And you can also fork sites and edit them and make your own copies of things, which is fun if you're, like, learning about how to build simple websites. So you can go to beakerbrowser.com and learn about that. And I think we've already talked about Rotonde and Fritter, and we'll add links into, people who wanna learn more about that.
And then for data focused users, you can use DAT for sharing or transferring files either with the desktop application or the command line interface. So if you're interested, we encourage you to play around. The community is really friendly and helpful to new people. Joe and I are always on the IRC channel or on Twitter. So if you have questions, feel free to ask. And,
we love talking to new people because that's how all the exciting stuff happens in this community. So And what have been some of the most challenging aspects of building the project in the community and promoting the use cases and capabilities of the project? I can speak a little bit to promoting it in the academic research. So in academic research, probably similar to many of the industries where your listeners work, Software decisions are not always made for entirely rational reasons.
There's tension between what your boss wants, what the ID department has approved that meets institutional data security needs, and then the perceived time cost of developing a new workflow and getting used to a new protocol. So we try to work directly with researchers to make sure the things we build are easy and secure.
But it is a lot of promotion and outreach to get their scientists to try a new workflow. They're really busy. And, the incentives are all, you know, get more grants, do more projects, publish more papers.
And so even if something will eventually make your life easier, it's hard to sink in that time upfront. 1 thing I notice, and this is probably common to all industries, is that people will I'll be talking to someone and they'll say, oh, you know, archiving the data for my research group is not a problem for me. And then they'll proceed to describe a super problematic data management workflow. And it's not a problem for them anymore, because they're used to it, so it doesn't hurt day to day.
But, you know, doing things like waiting till the point of publication to then try to go back and archive all the raw data, maybe some was collected by a postdoc who's now gone, Other was collected by a summer student who used a nonstandard naming scheme for all the files. You know, there's just a million ways that that stuff can go wrong. So for now, we're focusing on developing real world use cases and participating in, you know, community education around data management.
And we want to build stuff that's meaningful for researchers and others who work with data. And we think that by working with people and doing the nonprofit thing with the grants, it's gonna be the way to get us there. Joe, do you wanna talk a little bit about building? Yeah. Sure. So, you know, in terms of building it, I mean, I haven't done too much work on the core protocol, so I can't say much around
the the difficult design decisions there. I'm the main developer on the command line tool, and the most of the challenging decisions there are all about sort of user interfaces, not necessarily technical problems. And so so as Danielle said, it's sort of as much about people as it is around software and and those decisions. But I think, you know, 1 of the the most challenging thing that we've run into a lot is is basically network issues. So in the peer to peer network,
you know, you have to figure out how to connect 2 peers directly in a network they might not be supposed to do that. So I think a lot of that is from BitTorrent, sort of, making different institutions restrict peer to peer networking in different ways. And and so we're sort of having to fight that battle against these existing restrictions.
And, trying to find out how these networks are restrictive, and and how we can continue to have success in connecting peers directly rather than through through a third party server. And it's funny because, or maybe not funny, but some of the strictest networks we found are actually in academic institutions. And so, you know, some, for example, 1 of the UC campuses, I think we found out that
computers can never connect directly to other computers on that same network. So if we wanted to transfer data between 2 computers sitting right next to each other, we basically have to go through external cloud server just to get it to the computer sitting right next to each other or, you know, use something like a hard drive or a thumb drive or whatever.
But, you know, that sort of thing, all these different sort of network configurations, I think, is 1 of the the hardest parts, both in terms of implementation but also in terms of testing. Since we can't we can't, like, readily get into these UC campuses or sort of see what the what the network setup is. So we're sort of trying to create more tools around network testing and book testing,
networks in the wild, but also just sort of using virtual networks to test different different types of network setups and sort of leverage that, those 2 things combined to try and get around around all these network connection issues. So yeah. I think, you know, I would love to ask Matthias too this question around the design decisions in terms of the core protocol, but but I can't really say much about that, unfortunately. And are there any particularly
interesting or inspiring uses of DAT that you're aware of that you'd like to share? Sure. I can share a couple of things that we were involved in. During last in January 2016, we were involved in the data rescue and libraries plus network community. And that was the movement to archive government funded research at trusted public institutions, like libraries and archives. And as a part of that, we got to work with some of the really awesome people at California Digital Library.
California Digital Library is really cool because it is a digital library with a mandate to preserve and archive and steward the data that's produced in the UC system, and it supports the entire UC system. And the people are great. And so we worked with them to make the first ever backup of data.gov in January of 2016. And I think my colleague had 40 terabytes of metadata sitting in his living room for a while as we were working up to the transfer.
And so that was a really cool project. And it has produced a useful thing. And it's sort of you know, we got to work with some of the data dot gov people to make that happen. And they, you know, they were like, oh, there really it has never been backed up. But it was a good time to do it. But believe it or not, it's actually pretty hard to find funding for that work. And we have more work we'd like to do in that space.
Archiving copies of federally funded research at trusted institutions is a really critical step towards ensuring the long time preservation of the research that gets done in this country. So hopefully, 2018, we'll see those projects funded or new collaborations in that space.
Also, it's a fantastic community because it's a lot of really interesting librarians and archivists who have great perspective on long term data preservation, and I love working with them. So hopefully, I can do something else there. Then the other thing that I'm really excited about is the working on the DAT in the lab project, working for on the DAT container issue. And I don't I know we're a little over time, so I don't know how much I should go into this.
But we've learned a lot about really interesting research, and so we're working to develop a container based simulation of of a research computing cluster that can run on any machine or in the cloud. And then by creating a container that will include the complete software environment of the cluster, researchers across the UC system can quickly get analysis pipelines that they're working on
usable in other locations. And this, believe it or not, is a is a big problem. I was sort of surprised when 1 researcher told me she had been working for 4 months to get a pipeline running at UC Merced that had been developed at UCLA. And that's like you could drive back and forth between Merced and UCLA a bunch of times in 4 months. But it's this little stuff that really slows research down. And so I'm really excited about the potential there.
And we wrote we've written a couple of blog posts on that, so I can add the links, to those blog posts in the in the follow-up. And I'd say the the most novel use that I'm sort of excited about is called Hypervision, and it's basically video streaming built on DAT. Matthias Bus, 1 of the the lead developers on DAT, is prototyping sort of something similar with the Danish public TV, and they basically want to livestream their their channels,
over the peer to peer network. So I'm excited about that because I'd really love to get more public television and public radio distributing content peer to peer, so we can sort of reduce their their infrastructure costs and hopefully allow for for more of that great content to come out. Are there any other topics that we didn't discuss yet which you think we should talk about before we close out the show?
I think I'm feeling pretty good. What about you, Joe? Yeah. I think that that's it for me. Okay. So for anybody who wants to keep up to date with the work you're doing or get in touch, I'll have you each add your preferred contact, excuse me, your preferred contact information to the show notes. And as a final question to give the listener something else to think about, From your perspective, what is the biggest gap in the tooling or technology that's available for data management today?
I'd say transferring files, which it feels really funny to say that, but to me it's still a problem that's not really well solved. Just how do you get files from a to b in a consistent and easy to use manner? I especially want a solution that doesn't really require a command line and is still secure and hopefully doesn't go through a a third party service,
because hopefully that means it works offline. So a lot of what I saw in this sort of developing world is the need for data management that works offline. And I think that's that's 1 of the biggest gaps that, we don't really address yet.
So there are a lot of great data management tools out there, but I think they sort of aim more at data scientists or software focused users that might use managed databases or something like Hadoop. But there's really a ton of users out there that don't really have tools they need. And most of the world is still offline or with inconsistent Internet and putting everything through the servers on the cloud isn't really feasible. But the alternatives now require sort of careful
data management and manual data management if you don't wanna lose all your data. So we really hope to to find a good balance between those those 2 needs and those 2 use cases. Yeah. I'll, plus 1 what Joe said, transferring files. It does feel funny to say that, but it is still a problem in a lot of industries, especially where I come from in research science. From my perspective, I guess the other
issue is that, you know, the people problems are always as hard or harder than the technical problems. So if people don't think that it's important to share data or archive data in an accessible and usable form, we could have the world's best easy to use tool, and it wouldn't impact the landscape or the accessibility of data. And similarly, if people are sharing data that's not usable because it's missing experimental context or it's in a proprietary format
or because it's shared under a restrictive license, it's also not gonna impact the landscape or, be useful to the scientific community or the public. So working to change we wanna build great tools, but I also wanna work to change the incentive structure and research to ensure that good data management practices are rewarded and so that data is shared in a usable form. That's really key.
And, I'll add a link in the show notes to the FAIR data principles, which means data should be findable, accessible, interoperable, and reusable. It's something that your listeners might wanna check out if they're not familiar with it. It's a framework developed in academia, but it, I'm not sure actually how much impact it's had outside of that sphere, but it would be interesting to talk to your listeners a little bit about that.
And, yeah, I'll put my contact info in the show notes, and I'd love to connect with anyone and or answer any further questions about DAT and what we're gonna try to do with code for science and society over the next year. So thanks a lot, Tobias, for inviting us. Yeah. Absolutely. Thank you both for taking the time out of your days to join me and talk about the work you're doing. It's definitely a very interesting project with a lot of useful potential,
and so I'm excited to see where you go from now into the future. So thank you both for your time, and I hope you enjoy the rest of your evening. Thank you. Thank you.
