Hello, and welcome to the data engineering podcast, the show about modern data management. If you lead a data team, you know this pain. Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one off tools instead of doing actual data work.
Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data while keeping it all secure. Type a prompt like build me a self-service reporting tool that lets teams query customer metrics from Databricks, and they get a production ready app with the permissions and governance built in. They can self serve, and you get your time back. It's data democratization without the chaos.
Check out Retool at dataengineeringpodcast.com slash Retool today, that's r e t o o l, and see how other data teams are scaling self-service. Because let's be honest, we all need to retool how we handle data requests.
Your host is Tobias Macy, and today I'm interviewing Rowan Cockett about building data systems that make scientific research easier to reproduce. So, Rowan, can you start by introducing yourself? Awesome. I'm Rowan Cockett. I'm the cofounder and CEO of CurfNote as well as one of the, founders of Continuous Science Foundation.
And do you remember how you first got started working in the overall space of data? So I have a background in geoscience. I have an undergrad in geology. I started creating three d visualization tools for my peers just sort of in undergrad classrooms. And after I graduated I put those online those sort of got picked up and went viral in that space and so I started a company and my master's degree at the same time. I ran that company for about two and a half years and then we were acquired by a
data science company that was trying to go from sort of desktop software in the geoscience space to collaborative online software with version control and data management practices. And so that was in the mining industry and civil engineering. And so we were doing sort of data management of like all of Australia's sort of underground mines and making it like have multiple scientists working together on algorithms and visualizations there and sharing them with their managers.
So that was sort of my introduction to this space.
And so now you're working on CurveNode and at the Continuous Science Science Foundation, both of which are focused on the overall challenge of how to you're make the discovery and collaboration of scientific research a little bit more manageable. And one of the key challenges in the overall scientific publication ecosystem is that of reproducibility,
which can be due to things such as not having access to the datasets that were used or not being able to replicate the software environments that were used. And I'm just wondering if you can talk through some of the ways that this overall aspect of reproducibility and discoverability and collaboration around scientific research practices has struck your fancy and why you're investing your time in that space?
Yeah. I do. Yeah. I think reproducibility is is certainly, like, the entry point in one of those things. I think, like, underneath reproducibility, there is maybe two things, there's integrity of the data sets of the data processing pipelines,
can you trust it, can you see it, as well as reuse as well. So can you actually get hands on data to sort of reproduce it yourself and reuse has all sorts of other benefits both in terms of like from an educational standpoint can you actually understand that method and reuse it in a different context.
So I think from that perspective, are if you have improved integrity in science as well as the ability to reuse results that you trust because they're reproducible, then those two things together, science can move faster and just sort of from my goal of where I wanted to start a company when I was in 2019 thinking about what next to do, I'd sort of seen the challenges of I was doing my PhD at the time and doing all sorts of reproducible work and sort of environment mapping
and then sharing those with students and having them get up and running with an interactive widget to do data exploration or scientific modelling. And then the sort of mismatch of when I was publishing papers on that and it's like take a screenshot, have no place to share all of your data, share all of your code and just have an advertisement
about the work. And so that mismatch of we know that we can do these things, but the systems and sort of tools that we have available are just not fit for purpose at all.
In that space of being able to manage that collaboration and the friction involved in either reusing certain areas of research or just verifying the findings? What is the role that data plays? And what are the reasons that data can be such a challenging aspect of the overall research workflow?
Yeah, I think we've gone from, if you think back even twenty years,
data could fit inside of a page on a PDF and you're talking about sort of small data tables and that is so not the world we live in today. There's like terabytes of data and complex data processing pipelines in a lot of disciplines and so that's the space that we live in and the scientific systems for communication just have not kept up to that at all, and so we need better ways to share research, better ways to sort of incentivise the more modular sharing and continuous sharing
of research as well.
To that point of the scale of data, obviously, that's a challenge just in terms of the cost factor alone, not to mention the variability in terms of the tooling and systems that can be brought to bear to handle the storage and retrieval and processing of that information. But there are also numerous areas of research that don't require large volumes of data and are still stymied by this aspect of not being able to share or
reuse the datasets either because it's not something that is core to the overall research practice where it's just I had this dataset. These these were my findings, and then it was just never published or just some of the friction in terms of the ways that information is published in these journals where having the data set available is just an afterthought or just not even considered because of the way that it's all been done historically.
Yeah, totally. And like a lot of the ways that data sharing practices are sort of supported in this space is you're like through a zip file of your data just on, Zenodo or Dryad and it is not curated, it doesn't have the context there that is so necessary.
And so a lot of the my focus in this space is that sort of contextual layer where you're putting data, code and visuals all together and there are excellent tools out there like Jupyter Notebooks and other tools that have that sort of more literate programming style to them and those are the types of ideas that we want to bring into the sharing of scientific research as well as sort of promoting better ways to put your data into the more accessible spaces and formats.
With that format question as well, particularly in various scientific fields, the way that the data is represented or the way that it's stored is not something that is going to align easily with the ways that data is processed in an organizational and analytical context where you're not necessarily going to immediately reach for a Postgres database or an iceberg data lake, you're going to rely on something like, I think it's an five file format or something like that or some bespoke
your geology background, some form of GeoJSON or a,
some others type of shapefile format, and those aren't necessarily going to be as conducive to the commercial off the shelf style tools that people who are working in an organizational context are going to reach for. And I'm wondering just how that disparity in terms of investment and tooling for the sort of high value monetized aspects of data management conflict with these more niche and bespoke aspects of data management that are focused on a particular research domain?
Yeah, I think this this is where some of sort of the open science practices and open source practices are like quite prevalent in the scientific space is because many people are sharing data between institutions and they may not have access to the same types of tools, there is
often sort of that convergence on open data standards and open source tooling to access that. And so you were talking about HDF5 and sort of the there has been an evolution from that format to a ZAR based format where you have access in sort of a more cloud optimized way of getting
to the data. And that's those are the types of shifts that we're starting to see is like much more sophisticated tooling for storing the data just in cloud buckets with metadata, for example, but then you can bring visualizations right on top of that. So one of the main examples that I show in sort of pitch decks and slides is
zooming into a terabyte and a half of microscopy data the same way that you would into Google Maps and you just can just sort of keep on going into this. And the way that scientists share that today is they're taking screenshots, they're putting it in a panel, they're saying it's ABC, if you zoom in you'll actually see over there and they often are sharing that data but it's disconnected,
it's like one step away, it's hard to then view or interrogate or sort of connect to the narrative story that you're telling. And so there's again that sort of convergence of open data standards, open source visualization and processing tools that sit on top of that, and then this sort of integration layer that can have a much more compelling way to tell scientific narratives.
Data and tooling is one piece of it, but there's also the aspect of how the research is structured where maybe it's not designed in a way that makes it easy to build on or and there's also
particularly in academia, some perverse incentives as far as maybe making it more difficult for other people to build on your research. Because if you're the one that knows how it works, then it's gonna be easier for you to then continue the research and get the next grant to write the next paper with the whole publisher perish mechanism.
And I'm curious how some of those more social aspects of research are also contributing to the current crisis that we're in as far as being able to reproduce the research that's been done and then reuse it and even commercialise it?
Yeah, this is absolutely like the sort of the classic mismatch of like a social problem and a technical problem, and both of them are like not moving at the same time. And that was one of the reasons that we started Continuous Science Foundation is to try and tackle some of those more community based social problems and that's through working groups and ways to sort of bring people together and talk about incentives.
The one that I was running this morning was with Creative Commons on incentive reuses reuse and bringing licensing and attribution earlier into that process of the things that you're sharing, which is exactly getting at the social challenge of being scooped, is if instead you're just cited because you shared something earlier that turns into a form of academic credit that is monetised maybe in the commercial sense but also in sort of the promotion sense in terms of academic tenure and promotion.
And so I think it's a very difficult space to work in because there are such entrenched social norms about how to share, about what the credit systems are, and then just sort of the grant funding ecosystem that's very project based that often people make a little bit of progress and run out of money and so that fails. So yeah, it's a hard problem with so many different angles to it.
What are some of the ways that you and just some of the overall research community have been starting to try and tackle this sociotechnical challenge that has grown up and has had these various misaligned incentives contributing to the current situation?
So I'll talk about a couple of different aspects that we're doing. One is first from sort of the tooling aspect of just making it possible to share these sort of computational narratives in ways that can be published, and that is a project that I started and sort of helped found under the Jupyter organization called Jupyter Book that is bringing some of these computational notebooks into narratives and
using sort of a markdown flavor called Mist markdown markedly structured text that brings it all the way to ways that you can actually publish. So it's considering the environment, the code, the data, the narrative, and really thinking about it as a comprehensive package and making that as easy as possible. And so we're about three or four years into that journey.
There's been tens of thousands of textbooks and sort of educational courses that have been built off of that code base and we're starting to chip away at bringing that all the way through towards more traditional formats of publishing. And so that I see as sort of one piece of the technical stack.
Another piece is like how do these publishers and societies and institutions actually manage that content because they're used to FTP servers and XML and the technology there is not fit for purpose for any sort of large scale data or computation or the sort of new more modular way of publishing research.
And that's where we're positioning our company CurveNotes as a scientific content management system that is fit for purpose to bridge that gap from new ways of creating this knowledge or old even ways of creating this knowledge like Jupyter Notebooks and just being able to bridge that gap between the way that research is done today and the way it's published as well.
And on that publishing aspect, for a long time, various journals were the gatekeepers to whether or not your research would ever see the light of day. And there's obviously the peer review structure in terms of I am publishing my research. I need somebody else who is an expert in the field to determine whether or not this is worthy of being published in these different journals that has been, to some degree, disrupted by these preprint journals such as archive.
And then there are also other incentives such as the Journal of Open Source Science that is trying to democratize and commoditize publishing industry and maybe break some of the stranglehold that the different big journals have. And then that's not even getting into the format aspect where most of these publications are a static PDF. They're a point in time. And so as soon as you get your research published, you're on to the next thing. There's not really a lot of incentive to maintain
the code that you use to do your research, make sure that the data continues to be available because as soon as your research is done, you have no further use for it. And I'm just wondering how some of that publication ecosystem and the formatting of publication also contributes to some of the challenges that we're facing in the scientific industry.
Yeah. So think you're, so maybe I'll pick up first on the Journal of Open Source Software that you mentioned is one of the reasons that that came to being into BEAM was a social challenge is that a whole lot of researchers were spending a lot of time on these highly technical software projects but not able to get credit for them because they had to be wrapped up in this sort of in like applied results or some sort of scientific finding, but that utility there of the scientific software
has been used already like hundreds of thousands of times and you don't have any visibility or credit in that space. And so JOS, Journal of Open Source Software, came along and created a new journal, a social place to give credit to this sort of unseen labour that was there and sort of have something that you can put in a CV and put on a promotion resume so that you can actually further your career.
So I think that same sort of lens, if we're bringing that to some of these new computational ways of working, so for example if you're creating visualization widgets like that microscopy image viewer that I was talking about, that was somebody's whole PhD,
like multiple years. They definitely published some papers, but actually seeing how that is reused across publications is a different form of impact that you could have if some of these systems for visualising and showing research were better integrated into the data and software and sort of widget ecosystem that are these like highly specialised ways of explaining software.
So I think there's again that sort of sociotechnical process of there's some technology that we have but we also need some of these social experiments for new types of journals or new ways of thinking about what a preprint server is to make it more modular and distribute credit in different ways as well.
And in terms of the technical landscape of investment that you and the broader Continuous Science Foundation are making, What are some of the ways that maybe some of the
more modern data engineering tools can be incorporated into that workflow? Some of the education involved, I know that there's a lot of work going on in terms of things like the data carpentry, and some of the software carpentry to bring more software engineering best practices and data engineering best practices into the research ecosystem and just some of the ways that the research itself can be structured differently to be more conducive to
being collaborative and other people being able to build on findings rather than it being this monolithic research effort that is backed by some substantial grant funding and tied to a particular institution and all of the ego and politics that get wrapped up in that?
Yeah. I think so on the sort of education fronts, then you mentioned data carpentries. One of my main collaborators these days is Tracy Thiel, who was one of the co founders of Datacarpentries and she's currently the CEO of OpenArchive which is the home of BioArchive and MedArchive which is the largest biomedical
preprint server. And so that's only happened in sort of the past six months or so, and so I'm like very excited for the future there because there is this sort of lens of data and modularity
and curation of the data set that goes from all the way from that sort of advertisement of the work to the utility that you get from the data or the code and the reuse aspects of that. And so that is something that I'm incredibly excited about into the future and ways that CurveNodes and Continuous Science Foundation can help in that space and sort of convene and rally around better standards for sharing research.
What are some of the aspects of friction and entrenched interests that are pushing back or fighting against some of the work that you're trying to facilitate?
I mean, the I I think it is the the main pushback is complacency, especially in the on the journal side and some of the society side is what they have is good enough for their current business model and so investing in change is quite difficult there.
Where there have been a lot of shifts these days is all the way from sort of funding and policy shifts from organisations like the Gates Foundation or Howard Hughes Medical Institute or the Michael J Fox Foundation and sort of pointing upstream to these preprint repositories like archive or bioRxiv and using that as a new staging ground which has slightly different interests and incentives in that space which are more aligned towards sharing research early, sharing it in a more complete form.
And so I think there is a large shift all the way up and down sort of the stack of scientific research that is gonna make some changes in the coming years. And
on the publication side, I know specifically with CurveNote, you're working to make the actual display and interactivity
of the research more than just here's a PDF. Good luck. Ask me or follow me for questions or come and find me at some industry or, academic conference. And I'm wondering how that contributes to some of the accessibility of the research, but also some of the challenge that that poses as far as the need to continuously maintain the platform in order to be able to keep that research accessible.
Yeah. So how CurveNote works and sort of the output of what we do are these much more interactive articles is that you can come to a CurveNote article, scroll down, you have a an image there, but instead you actually have a play button on that image and right now that spins up a server in the cloud using Jupyter. It connects to that environment and then reproduces the results.
And so some of the journals that we're working with like Microscopy Society of America are just doing really amazing things with microscopy images as well as explaining algorithms in different ways for like focusing
electron beams and things like that. And so the bar is just like going way up for those things and the capabilities, especially with AI coding tools in the loop to be able to take your research from I can create a static picture to put some sliders on that and actually make it a little bit more interactive. I think that that is just coming down to the floor in terms of the capabilities there.
And then to talk about that sort of challenge on the flip side and maintainability, one of the working groups that we ran through Continuous Science Foundation and Force11, which is the future of research communication and e scholarship founded in 2011, that was around graceful degradation of these is as to your point environments degrade, data sets erode, links to sort of APIs may not be maintainable into the future, and so one of the
ways that we're thinking about that is having sort of alternative fallbacks so that you can go from that sort of computational reproducible interactive widget to a movie to an image, and that image is the thing that you can store in a PDF.
And so over time, those things can and should be designed to degrade gracefully so that we can at the at the sort of moment of releasing the research where it's most read and most used, we can have something that has a much higher bar for reuse, reproducibility, but then readership and sort of the world moves on. And if it is important, like a climate change model or something that you want to come back to,
there are going to be resources to keep those alive. But those are going to be in probably the minority of sort of like high quality and high profile scientific data sets that are already sort of ongoing projects.
And with that more interactive and computationally based publication ecosystem, how does that then simplify or facilitate other people building on top of a particular research finding where maybe they don't necessarily have to go and rebuild their own entire dataset that may or may not be equivalent to what the other researcher was working from and just some of the ways that that facilitates an acceleration of scientific research.
Yeah. I think if you if you can jump into somebody's research stack with a click of a button, that like suddenly you're going from like a month of work of sort of like reading into their papers, finding their GitHub repository, digging out their dataset on Sonodo or some other service and like matching it all together in your space, as well as, like, installing their environments and libraries.
If if that bar comes to zero, then just the possibility to go in, tweak some parameters, have a look at the code, that I think the ability to build on somebody else's work, reuse it, that that is the space that is like standing on the shoulders of giants, and that is like sort of the ethos of science that we're leaning into hard.
Because of the fact that you are building this cloud based environment to make this more interactive, it also brings up the question of viability in the long term, particularly if you have larger datasets that you need to,
provide access to or that the researcher was relying on. I'm just wondering architecturally how you're thinking about some of that storage and optimization process, particularly since science does not have, as much of an economic incentive for continuous investment as a lot of other commercial interests might be competing for?
Yeah. So CurveNotes specifically does not store large scale datasets. We do work with partners in that space or existing platforms like Zenodo.
One of the folks that is doing a really good job in this space is Source Cooperative and they are like backed by have all of their storage on just AWS buckets and have sort of a very minimalistic way of thinking about that so that again you can bring your ECQ instance directly to the data sets, whereas a lot of these other cloud based storage solutions for archiving really aren't aren't thinking about the compute aspects
of that. And so yeah, there's there's other people who are innovating in this space and also are thinking about sort of that long term storage and viability as well.
In terms of the overall scientific community, are there particular other particular areas of research or particular domains that are easier to, onboard onto something like a curve note or that are maybe more advanced in terms of their overall technical acumen to take advantage of these types of platforms versus others and just some of the ways that you're seeing that play out as the entire world becomes more computationally based and, you know, by necessity, so does the research ecosystem.
Yeah. I think that's that's the trend that we're seeing is there's just computation coming into every single scientific field and some of them are just sort of at a different level right now. And so we got our start in computational geoscience that's my background and my co founder's background.
And then most of our work these days is in computational bioscience, so working with servers like OpenArchive, BioArchive, as well as inside of Howard Hughes Medical Institute which has some computational neuroscience and ways to share that information. And so those are the spaces that we're sort of exposed to right now but there's a whole lot of fields that are doing many of these things and over time, yeah, that's that's just going to be increasing.
As you are working with different research groups and working with people in this community trying to make research easier and, more reliable and repeatable and composable.
What does a typical engagement look like as you're starting to understand, okay, what is the research that you're conducting? What are the techniques and technologies that you're relying on? How can I work to help facilitate that, whether by virtue of the kind of educational aspects on the Continuous Science Foundation or through the technical aspects of CurveNote?
Yeah. I think the if I'm putting my CurveNote hat on, most of the folks who sort of get exposed to what we're doing are coming from a Jupyter like community, so they're already working in a computational way. They already have sort of skills around data sets and computation and coding and just want to have that sort of higher fidelity, more native way of sharing their research.
And so they're coming into contact with tools like Mist markdown or Jupyter book or just Jupyter notebooks and that's sort of their entry point. I think one of the things that you mentioned there as well is really interesting is in the education space with people actually teaching about their field or their science and creating sort of these tutorials or lecture notes. That's a space which can have a lot more flexibility because it is a single lecturer who's often in control of the entire
stack of how they're delivering content. So they can either choose to, like, use a textbook or they can choose to use some of these open source tools to deliver their research in more computational enabled ways. And so that's where we're seeing a lot of innovation and sort of crossover between the ways people are teaching and the ways people are wanting to communicate their research.
And so especially like in as you sort of bring that back into science and sort of building upon other people's work, that push and pull between education and teaching and sort of learning about the world, that is something that we should have like a tight feedback loop in.
And so that's again where I'm sort of seeing that sort of you're trying to get and weave together these sort of like learning from the research itself and then sort of diving into different fields of research and trying to make those as tight as possible and supporting them with more advanced communication tools.
And as you have been working in this space and working with the research community, what are some of the most interesting or innovative or unexpected ways that you have seen the investment in reusable and composable research applied as far as the ability to unlock new, maybe otherwise unviable research or just some of the ways that you have seen the ecosystem evolve and adopt some of these principles?
Some of the researchers that we're working with just seeing their creativity be unlocked and not be constrained by thinking about the way that they deliver science has to be on a piece of paper. Like that mentality is so ingrained
in the training of scientists to the communication to the evaluation of scientists. So as soon as you give someone this other avenue of like, oh my gosh, I could create this widget to explain my research and it could actually be connected to the data instead of just an advertisement that it was maybe connected. That I think opens up creativity in scientists in such a visceral way.
And I've been in calls with researchers where they like lean back in their chair and like laugh that they're like just so excited to be able to share their research in sort of that more authentic way.
And in your own experience of working in this ecosystem and facilitating this overall transition to more reusable research and publication? What are some of the most interesting or unexpected or challenging lessons that you learned in the process?
I'll start with the most challenging one is the when I started CurveNote back in 2019, I thought it was sort of a technical look at sort of lowering the barrier to entry to just being able to create these computational narratives and share them online.
And so we created a what you see is what you get editor that could integrate directly with Jupyter. You could copy your cell output from that Jupyter notebook and have it like exist right in line with your document, and we would be like working with these PhD students or researchers early on and they're like loving the experience, and then they get to the end they're like, how do I download a PDF because I need to send this to
a journal and get credit for it? And so that was this sort of like massive moment of we are like not solving or like the problem is just so much bigger than the technical tools and that was sort of that like my awakening into some of the more social problems and that we needed to exercise some other muscles in this space and have a broad coalition of people who are taking a step together into this new world.
As you continue to build CurveNote and invest in just the community adoption of some of these newer tools and newer research practices, what are some of the next set of challenges that you're focused on addressing in this overall space of reproducibility and modularity in the research ecosystem?
One of the the steps that we're taking at the moment is trying to rally people around a new standard for sharing science that is more computationally enabled and modular so that you can share the bits and pieces as well as can have that those sort of principles of graceful degradation in it so that you can go from interactive widgets down to a picture of the science.
And that is an initiative that I've started with Tracy Teal who's the CEO of OpenArchive called the Open Exchange Architecture, OXA, and that's a new initiative that we're trying to come up with a standard that is as adopted as the PDF in science, which is a big mission that we can sort of unlock something different and move past sort of this ingrained paper based mentality that sort of permeates their research sharing and promotion culture.
And so that that is the the biggest challenge that I'm on at the moment.
And are there any other aspects of this ecosystem of research and the challenges that are posed both from the tooling and technology aspect as well as some of the social dynamics that are at play and just the overall work that you're doing in this space that we didn't discuss yet that you'd like to cover before we close out the show?
Yeah. I think it's it's in terms of the sort of way that we're thinking about it is you need to be able to share those bits of research, the sort of modular components, and then have them exist anywhere that you want them. So thinking about that from a technical standpoint of interoperability and then there's this step on top of that that is composability so that you can much like a software ecosystem you can install packages and use them together.
There's nothing like that for science today. You can't bring in some piece of scientific research into yours if it's not sort of the software aspects and actually like bring it together in a composable way. And so I think that's the piece that I'm most excited about because as soon as you get composability into some sort of system there's just so much that you unlock the sort of multiplicative effects of compounding progress into the future.
So for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. As the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
Yeah. I think from the biggest gap, and this is where I'm putting all of my time, is it's the integration of all of this work together. It's possible to share code, it's possible to share data, it's possible to share narratives, but actually stitching those together in a way that is archivable for the long term, you can sort of support the standing on the shoulders of giants ethos of science.
That's the part that I think is a community and sort of like standards and sociotechnical challenge that's ahead of all of us.
All right, well thank you very much for taking the time today to join me and share the work that you've been doing on CurveNote and the Continuous Science Foundation. It's definitely a very interesting and important aspect of investment and effort. So I appreciate all the time that you're putting into helping to grease the skids of scientific industry as it were. So, thank you again for that, and I hope you enjoy the rest of your day.
Thank you for hosting.
Thank you for listening, and don't forget to check out our other shows. Podcast.net covers the Python language, its community, and the innovative ways it is being used. And the AI Engineering podcast is your guide to the fast moving world of building AI systems. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts@dataengineeringpodcast.com
with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
