Your Data, Your Lake: How Observe Uses Iceberg and Streaming ETL for Observability - podcast episode cover

Your Data, Your Lake: How Observe Uses Iceberg and Streaming ETL for Observability

Jan 18, 20261 hr 12 minEp. 497
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

Summary 
In this episode Jacob Leverich, cofounder and CTO of Observe, talks about applying lakehouse architectures to observability workloads. Jacob discusses Observe’s decision to leverage cloud-native warehousing and open table formats for scale and cost efficiency. He digs into the core pain points teams face with fragmented tools, soaring costs, and data silos, and how a lakehouse approach - paired with streaming ingest via OpenTelemetry, Kafka-backed durability, curated/columnarized tables, and query orchestration - can deliver low-latency, interactive troubleshooting across logs, metrics, and traces at petabyte scale. He also explore the practicalities of loading and organizing telemetry by use case to reduce read amplification, the role of Iceberg (including v3’s JSON shredding) and Snowflake’s implementation, and why open table formats enable “your data in your lake” strategies. 
Announcements 
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • If you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.
  • You’re a developer who wants to innovate—instead, you’re stuck fixing bottlenecks and fighting legacy code. MongoDB can help. It’s a flexible, unified platform that’s built for developers, by developers. MongoDB is ACID compliant, Enterprise-ready, with the capabilities you need to ship AI apps—fast. That’s why so many of the Fortune 500 trust MongoDB with their most critical workloads. Ready to think outside rows and columns? Start building at MongoDB.com/Build
  • Composable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end-to-end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for dbt Cloud customers, they'll give you $1,000 credit to migrate to Bruin Cloud.
  • Your host is Tobias Macey and today I'm interviewing Jacob Leverich about how data lakehouse technologies can be applied to observability for unlimited scale and orders of magnitude improvement on economics

Interview
 
  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by giving an overview of what the major pain points have been in the observability space? (e.g. limited scale/retention, costs, integration fragmentation)
  • What are the elements of the ecosystem and tech stacks that led to that state of the world?
  • What are you building at Observe that circumvents those pain points?
  • What are the major ecosystem evolutions that make this a feasible architecture? (e.g. columnar storage, distributed compute, protocol consolidation)
  • Can you describe the architecture of the Observe platform?
  • How have the design of the platform evolved/changed direction since you first started working on it?
  • What was your process for determining which core technologies to build on top of?
  • What were the missing pieces that you had to engineer around to get a cohesive and performant platform?
  • The perennial problem with observability systems and data lakes is their tendency to succumb to entropy. What are the guardrails that you are relying on to help customers maintain a well-structured and usable repository of information?
  • Data lakehouses are excellent for flexibility and scaling to massive data volumes, but they're not known for being fast. What are the areas of investment in the ecosystem that is changing that narrative?
  • As organizations overcome the constraints of limited retention periods and anxiety over cost, what new use cases does that unlock for their observability data?
  • How do AI applications/agents change the requirements around observability data? (collection, scale, complexity, applications, etc.)
  • What are the most interesting, innovative, or unexpected ways that you have seen Observe/lakehouse technologies used for observability?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on Observe?
  • When is Observe/lakehouse technologies the wrong choice?
  • What do you have planned for the future of Observe?

Contact Info
 

Parting Question
 
  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements
 
  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.

Links
 

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Transcript

Tobias MaceyTobias Macey

Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Composable data infrastructure is great until you spend all of your time gluing it back together. Bruin is an open source framework driven from the command line that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement.

Bruin allows you to build end to end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for DBT cloud customers, they'll give you a thousand dollar credit to migrate to Bruin Cloud.

If you lead a data team, you know this pain. Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data while keeping it all secure.

Type a prompt like build me a self-service reporting tool that lets teams query customer metrics from Databricks, and they get a production ready app with the permissions and governance built in. They can self serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com slash Retool today, that's r e t o o l, and see how other data teams are scaling self-service.

Because let's be honest, we all need to retool how we handle data requests. Your host is Tobias Maci, and today I'm interviewing Jacob Leverich about how data lakehouse technologies can be applied to observability for unlimited scale and orders of magnitude improvement on economics. So Jacob, can you start by introducing yourself? You betcha. Yeah. Nice to be with you. Yeah. So

Jacob Leverich

I'm Jacob. I'm one of the cofounders at Observe Inc. So we built an observability solution built on top of a lake house architecture and, you know, kind of current day job, I guess, as CTO. But, you I've kind of worn almost every hat under the sun kind of through our journey over the past several years. And, maybe just a little bit more background. So prior to Observe, I was an engineer at Splunk.

So I worked on the core search engine at Splunk, just banging out c plus plus code, scaling, I think, out to better bigger and better use cases. That was a a very good experience. I had lots of great things to say about the product and the company and the technology. Prior to Splunk, whirlwind tour, big companies, failed startups. I spent time at at Google. Spent time at HP Labs. I time at IBM. Did the whole grad school thing. None of it matters.

The thing that's most relevant to what we do now is when I started my career late nineties, early two thousands, I was a Linux sys admin, just banging out mountains of Perl scripts to monitor and maintain servers, manage servers in colo facilities or a pager, like a four line alphanumeric Motorola Elite. And so kind of like lived the kind of DevOps and sysadmin dream kind of in the formative years of my career. And so

when I arrived at Splunk, it just kinda dawned on me, you know, holy shit. If I'd had this software as a practitioner, I would have been way better at my job. And just sort of, understand, like, a couple of different pieces to that. It's like, one is, you know, solving the hard technical challenge of bringing together just like mountains of log data and just having, you know, some place to go and slice and dice on it. But then the second is just sort of like having oh, it's just like solving the the business process challenge, the organizational challenges. How do you get everyone access to the same data so that when you're troubleshooting at two in the morning when everything's gone down, you know, you kind of all have, like, the same alert to respond to and the same dashboard to look at and the same data to drill into? And I started to appreciate, like, how valuable,

that sort of, like, data centralization platform is, for the, like, kind of the job of, like, system administration or or site reliability engineering or DevOps or whatever. And so and I kind of discovered there that I was actually, you know, like, surprisingly passionate about this space. And so it's kind of led to to founding Observe.

So we kind of do a lot of similar stuff. You know, we target software developers and SREs and IT analysts and security analysts, folks that have these profoundly large volumes of disparate data, but, ultimately, you know, kind of need to be able to bring it together and collaborate to solve challenging problems, answer difficult questions. And so we build a product, that sort of ties a lot of this stuff together, which I'm sure we'll talk a lot more about in a second.

Tobias MaceyTobias Macey

And so I actually share the origin story of starting as a, sysadmin in the Linux space in particular. So it's interesting, the journeys that you end up taking and, I guess, along the path from where you were to where you are now, what are the I guess, what would you say is the point where you actually got started in the data management space?

Jacob Leverich

You know, I'll I'll just, riff on this for a second because I think when you phrase it that way, it kind of triggered some memories in me. I mean, obviously, back when I was doing centralized log management, know, it's like using, you know, our syslog to centralize stuff, but that's kind of

small stuff. I think, actually, when I got into grad school, you know, I was working on computer architecture and doing lots of, like, large simulations of microprocessors, all these cycle accurate, you know, microprocessors and, you know, collecting memory traces and doing all sorts of analytics on that data. And I kind of found to kind of unique amongst all my peers in grad school, it's like, wow, I'm generating tons and tons of data, and I'm trying to analyze all of it.

And it doesn't fit on this stupid little desktop that's under my desk. I actually need some hard iron to process this stuff. And so I went through a process of just grant writing to buy server clusters and start to piece together, you know, just like, you know, cruddy,

large, you know, distributed file systems like Gluster to, like, kind of, like, wrangle all of this data and start to build a bunch of crummy tooling to, like, analyze all this data. I was using systems like Torq as, like, a a cluster workload scheduler to, like, start to process all this stuff. And, eventually, I started playing around with MapReduce, kinda getting into that space, and, started to just use that for my distributed data processing. And then, and then a whole bunch of things just started kind of kind of building on top of that. So on the basis of some of the research I was doing about the power proportionality of MapReduce, which is a very kind

of a a baroque,

sort of a topic, I ended up how I ended up at at Google as I was interning on the MapReduce team at Google, kind of doing basically performance engineering on the MapReduce team there. That was around 2010. And that was cool. That was, like, very, very interesting to kind of be kind of, you know, kind of at the locus of, like, where a lot of this, like, very, very interesting large scale, you know, data processing work was happening. But there's, like, kind of two things that I discovered during that time at Google, particularly working on MapReduce. And the first was that Google had long since moved on from MapReduce in 2010. And, like, no one outside really knew that. And so everyone still thought, like, oh, Hadoop, this is the future of data processing. And it's like, nothing could be further from the truth, you know? So there's, like, all sorts of, like, new things happening kind of in the in the sort of data processing space, the parallel data processing space kind of around that time. So that was one kind of, like, thing is that sort of, like, starting to learn that there's, like, more going on in this industry than than just MapReduce,

and and it seems that there's being a very, very large reversion back to traditional strategies in the RDBMS space for data management, which is which is very interesting to me. And then second is that in order to do my job at Google, like to do this performance engineering work on MapReduce, what I ended up doing all day long was using a tool called Dremel, which was their internal, basically columnar

scale out SQL analytics, you know, sort of a query engine, and, you know, separation of storage to compute, and sort of columnarizing all the data it does work on, multi stage phase execution, parallel processing, all that sort of stuff. And I started to learn about that system, and I was like, wow, this thing is amazing for processing petabytes of data.

And hey, it's actually really nice to be able to express all my queries in terms of SQL, even if there's some, like, kind of weird idiosyncrasies dealing with this sort of semi structured nested data. But, like, all all things to consider, that was an incredibly, powerful

tool that was used ubiquitously internally. And so, you know, I you know, later on, you know, a lot of that technology got, you know, commercialized in the form of BigQuery. And so I got to kinda see just, like, kind of how a lot of the nuts and bolts of something like that were made and sort of, like, what it's used for and sort of how it sort of sort of blended back into, you know, what's maybe just the more traditional,

you know, data management space. And, you know, if if you don't mind, I'm gonna keep going for a second here because Absolutely. Because this story kind of like, it just kind of builds like like a snowball as as time goes on in my career. So when I ended up at Splunk, you know, there's actually a lot of similar aspects to, kind of their use case, you know, a large scale log data management system. And, you know, it's kind of roughly architected as a system inspired by MapReduce,

where you have, you know, a cluster of indexers that all have, you know, their portion of the data. When you execute a query, you know, it gets, you know, distributed out to all those indexers, they do their partial search, and then it all gets aggregated by search head. And it's all a very MapReduce style of execution model. And

I I think what what I could see there was that this was an incredibly powerful commercial tool. The users loved it. I loved it as a as a user myself, but it was definitely architected kind of in the pre cloud era. You know, a site like Splunk was built in you know, started in 2003 and was very much architected in the spirit of a shared nothing database where you're heavily reliant on local disk, you know, for storage. It's doing its own replication of data for durability. And,

you know, I think that that systems that were you know, that I I sort of saw at at Google were very much, you know, cloud native. We're like, hey. We have this thing called, you know, Google file system, you know, that we can store

all of the data in. And that means that the storage is separated from the compute. You can bring compute resources online and offline as you wish. Durability is sort of handled for you. You'd never have to worry about it. And that's very much true of of Amazon s three and, you know, kind of all of the, object stores you see in the cloud today, where kind of infinite scalability and durability and low cost are just sort of, like, solved problems.

And so the question is, like, you know, how can you take advantage of this, like, this, like, very important building block in the architecture of a future data management system? And at Splunk, like, we we kinda knew this. Like, we we had a bunch of hardcore database folks, like, on on the team there, and, you know, sort of, you know, know, me with my experience kind of working with these systems in grad school at Google, I I kinda knew what they should look like to be cloud native. And so we had an we had an awesome road map kind of of, like, all the different things that we were gonna have to build in order to kind of really modernize this platform. And as a technologist, I was like, man, it's my dream. Like, I'd love to spend the next ten years, like, you know, building out all this this new technology. But but

there's I had a chance encounter kind of in that that time period around 2016 or so. I had a meeting with the two cofounders at Snowflake. So so me and the chief architect at Splunk met with Benoit and Thierry from Snowflake. And it was just the regular meeting of the minds and just chat about technology and partnership opportunities and blah blah blah. And in preparation for that meeting, I went off and read, Snowflake's SIGMOD paper where they described, sort of the architecture of Snowflake

and kinda dawned on me like, holy shit. These guys have built our road map. They have built a commercial instantiation embodiment of basically all of the best technology ideas that were available at Google in terms of the separation of storage to compute, in terms of sort of how do you do parallel processing of semi structured data, How do you do this in the form of relational database? I mean, there's kind of there's so many things there that were just sort of from

what I saw, it's like, we now had a commercial database that actually solves a lot of the hard scaling challenges and sort of takes advantage of the technologies available now for this use case of, like, the analytics and storage and management of very, very large volumes of semi structured data. And so and so that kind of sparked the idea for me. It's like, wow. So what would it look like to try to build a observability system on top of

a commodity off the shelf, you know, cloud native data warehouse like Snowflake. And so that's kind of, like, where we started to kind of just, like, work through the nitty gritty details of doing this use case in what's effectively a lakehouse architecture.

Tobias MaceyTobias Macey

And that brings us back around to the question of observability. And before we dig too much into some of the benefits of lakehouse architectures for that problem domain, can you start by giving a bit of an overview of some of the key pain points that, in particular, customers of observability products encounter when they get to the decision point of which system I am I going to use or which open source components am I going to try to deploy, and how does that act as a

Jacob Leverich

a speed bump to the solutions that they actually care about, which is I wanna know when my systems are not behaving properly and understand how to get the information I need to be able to fix them. Yeah. Yeah. Totally. I mean, I guess, you you and I are both like old old system and hats, right? So like, we we live and breathe this stuff, but like, for folks that aren't, like, super familiar with the observability space, I mean, there's maybe like a couple of ways people think about it. It's like one is, hey, you know, you're operating large digital infrastructure, whether it's applications or IT infrastructure.

What you really just want to understand, what's going on with those systems? Are they up or down? Are they performing the way I expect them to? Am I having errors? And so forth. People tend to deal with several different modalities of data, whether it's you know, system logs and application logs or infrastructure metrics like CPU usage and memory usage and network throughput and stuff like that, but then also request traces.

So, you know, when a when a message, you know, an API request comes into, you know, a front end load balancer, you know, it probably talks to five or 10 or a 100 different back end systems and sort of how does that message get bounced around. So how do you trace kind of all those different interactions? And so there's just a

large volume of this data in a modern sort of digital environment, and then there's a lot of different modalities of this data. And I I think, you know, you also have, like, lots of different sort of, I guess, components, you know, to a digital infrastructure, whether it's, say, the front end load balancer, the application server, the database server, you know, a third party API. So there's, like, all these different, like, things that all are sort of come together to, like, serve an end user request. And so observability is just really, you know, kind of like, hey. You know, how do how do we actually, like, cope with this? Like, how do we answer questions about what my end user experience is given this, you know, proliferation of different components and and modalities of data? And, you know, what what's been very common in the industry for a long time is to have a bunch of best in breed tools for each one of these modalities of data. So folks might put their logs into Splunk or Elasticsearch.

They might put their metrics into Prometheus or Datadog or New Relic or something like that. They might have something like AppDynamics, Dynatrace for their application tracing. And and there's, like, a bunch of things that that come about as a result of, like, kinda selecting best in breed tools for these different use cases. One is that no single person at the company can answer a question.

Like, you kinda you end up like, when there's ever there's, like, a real issue, you end up with, like, a war call from hell with, like, 50 people on the line, like, all looking at their tool that they understand to try to come together to solve a problem. And so that's a very common experience for folks. The second problem is that these tools can be very, very costly. And so they're sort of like premium best in breed tools.

A lot of them architected kind of in the pre cloud native days, where you're using shared nothing architectures, you're storing all the data on local disk, you're doing replication on your own. So there's all sorts of costs associated with these platforms. And that's whether you buy you know, like a SaaS solution or you self host. Like, you kind of bear those costs one way or another. And as a result of, like, kind of like the extreme cost of lot of these tools,

like, people end up actually rationing their usage of them quite a bit. Well, they'll be like, well, you know, I have all these logs, but, like, it's too expensive to put them in this or that, so I might as well just not. And, you know, people end up, like, either not having access to the data that they need to do their job or to, like, kind of actually troubleshoot things, or they end up, you know, creatively misbehaving

by sort of saying, like, well, I'll put some data in this thing, and I'll put all the other data in this other thing that's, like, kind of cruddier but cheaper. And so it just kind of proliferates this challenge that, like, no one on earth can see the big picture. It's really, really difficult to, like, just, like, wrangle all of this data, and as a result, you know, people find it very, very difficult to actually figure out, like, well, how do I answer a question, and how do I, you know, share this kind of dashboard with someone? You know, it

ends up being just a very, very challenging proposition for folks, and it sort of even further exacerbates the cost problem, because it's like, man, if this stuff is so hard to use, even though I'm paying tons of money for Splunk, like, don't know if it's even worth it, because it's so hard to use it and so it doesn't have all the data that I need. So, like, what's the point? So, like, there's kind of just all this stuff that gets wrapped up that everyone seems to experience

kind of when we talk to them about these use cases. And and it's not just like a single, like, cost problem, and it's not like a single, like, you know, consolidation problem. Like, they tend to, like, feed into each other, and it's sort of just like, as a result of all the constraints that people live with, these things just sort of end up being endemic.

And so those are some of the sort of kinda big, big pain points. And I guess one of the things that's always been a reflection for me is that, you know, I'm describing this from the perspective of the observability user, you know, the SRE or the SIS admin or the the SAT application developer, and, you know, they kind of experience all these things. But you also I mean, if you talk to a business analyst,

you know, they'll describe all the same stuff. You know? They'll be like, I don't know where any of the data is. And like, oh, some stuff is in this system and other stuff is in this other system, and no one really knows how to do this thing. So I gotta ask the data engineering

team to build a new pipeline for me so I can build this dashboard for my boss, and my boss keeps yelling at me because he never knows what's happening with the business. Like, kind of like, you know, kind of there's like a the exact same issues exist kind of in the the kind of traditional

business intelligence space. And so I I guess what's funny is that I I always just think about, like, you know, the way the way that our I guess the way that I think about the problem that we solve in the observability space is that we're basically solving the same problem that business analysts have in the BI space, but we're solving it for a different user, and so we kind of need to meet their expectations, which are a little bit different than the typical business analyst.

Tobias MaceyTobias Macey

I think that your observation of a lot of the observability systems having been architected in the pre cloud native

time frame, many of them just at the cusp of cloud native starting to become a thing, but not all of the patterns have been proven out yet. A lot of them too are going to be architected with latency being a key consideration of I wanna make sure that I can ingest this data as fast as possible and serve it up as soon as I have it, which is not necessarily the strong suit of many of the

architectures. Lakehouses are optimized for, I want to be able to run a query across petabytes or exabytes of data and return an answer to you in something short of twenty four hours, generally in the a few seconds time frame, but not necessarily millisecond millisecond time frame. And as a corollary to the pre cloud native versus cloud native era, I think it's also worth discussing a little bit about some of the juxtaposition

of architectures such as Loki, Cortex, and Mimir and the way that they think about that separation of storage and compute compute as a means of better economies of scale versus the ways that lakehouse architectures are structured, particularly in the context of columnar data and just some of the ways that cardinality

poses a problem no matter which architecture you're going with. There's not really a a cohesive question in there, but I guess I'm most interested in the way that you think about how you're approaching storage, retrieval, querying of

observability data using these lakehouse technologies versus some of the way that these early era cloud native architectures were designed to take advantage of the economies of scale of s three and being able to run queries against that in a relatively performant fashion with Loki, Cortex and Mimir being the ones that are most top of mind, and I guess Tempo is the other one that I'm aware of.

Jacob Leverich

No. Great. That's a great question and great observation. And so, like, maybe, like, the segue from what I was talking about earlier with, like, kind of the observability

users' expectations and needs being slightly different than the business analyst needs, you hit the nail on the head. It's like, hey, business analysts, I'm okay if it takes thirty minutes for the data to come into the system because I'm gonna run run the report at 3AM, so I don't care. With the observability user, I need to know what's happening right now because the site's down. So, like, the latency

and and kind of ability to do kind of snappy interactive queries is very important. And and so what what I can say, like, right off the bat is that, yeah, if if you try to, like, architect a full data, like, pipeline or data management solution just sort of generically on top of it, a lakehouse for the observability use case, you're gonna fail. Like, if you just, like, dump all the data into a big table and, like, expect that, like, things are just gonna work, I'm afraid to say that they aren't.

Like, because for exactly the reasons you you mentioned, like, these systems are designed for you know, petabytes of data, and and if if a lot of my use cases are actually looking at the most recent data very quickly, it's a little bit different. So that's actually kind of, like, more or less a lot of the things that you know, it's interesting because when I think about Splunk's architecture, like, Splunk actually had to solve a lot of these problems too, because it also, you know, kind of handles, like, know, terabytes or petabytes of data in very, very quick interactive queries. And so kind of the trick for us was, like, well, how do we twist a columnar analytics database into

being suitable for a use case like this? And I I think just thinking about it for, like, the the, like, kind of stages of, like, day in the life of the data from, like, data origination through to, like, an end user querying it or doing alerts is, like, kind of the way I think about it. We have to kind of solve problems at each step of the journey there. Starting with like,

you know, data collection, you know, like people have like either log data sitting on disk or they have, you know, infrastructure metrics from like, you know, their their Linux box or their VM or whatever they need to scrape. And so you need an agent to like pull in that data. And so, you know, we're gonna be trying to optimize for like end to end latency. So like, I'm not gonna use something, you know, just totally

generic for data pipelines. I'm probably gonna build something that is like purpose built for more sort of streaming of all this data. And so back in the day, we we were just using the open source, law collectors and telemetry collectors that are available, like Fluent, D, Fluent Bit, Telegraph, that kind of stuff. OpenTelemetry these days, actually has fortunately come into its own and provided a very good answer for the industry of, like, what is a capable,

vendor neutral, open source data collector that's well supported and getting lots of investment? And so we recommend to all of our customers, deploy OpenTelemetry

Collector, you know, sort of here's some golden configurations you can use that you will be successful with. And, we also package our own distribution of OpenTelemetry Collector with a few bells and whistles on, like, troubleshooting and and and setup and stuff like that. But but at the end of the day, it's just OpenTelemetry for kind of collecting the data, which in the simplest case is probably just tailing a log file off a disk, and then sending that data over OTLP,

so the OpenTelemetry line protocol to a, you know, sort of a sync, you know, some place to, like, collect this data and load it. Your typical lake house, you know, doesn't have an OTLP endpoint. You know, there's no OTLP, like sort of like API for like S3. And so kind of as part of like actually solving use case, you actually need some place to actually receive that data and handle authentication. And so we've kind of had to build ingest APIs for serving as a destination for OTLP data.

Then we load that data into Kafka, and just think about, like, loading data into a lake house. Like, you don't do, like, row by row inserts. That doesn't make sense. You're trying to build large, you know, partitions. And so, you know, you kind of kind of need to buffer that data for a couple of reasons. So the first is that you wanna reply

to those data collectors before you've, like, kind of you you wanna reply to the data collectors when you have the data, like, durably stored, but you don't wanna rush loading the data into the lakehouse because you're gonna be it's gonna be inefficient to load it row by row. And so so what we do is we write the data to Kafka, we commit, and then we reply to the client with a 200. And so our guarantee to collectors like OMT Collector is when we send that 200, this data is now durable. We've got it. We're gonna we're gonna make sure it makes it the rest of the way. So that's that's kind of one of the benefits to having something like Kafka sort of in that ingest path. But then the other benefit is that, that's kind of our opportunity to just, like, batch the data and right size

what we end up loading into the lakehouse. And and there's this this sort of key trade off between latency and efficiency.

You know, kind of if your batch size is is too small, then you're gonna be really not getting, you're not gonna be running sort of, your data loading jobs in sort of like a throughput manner. So you're not gonna amortize your overheads, it's gonna be, you know, expensive. So you really don't wanna do loading too frequently. But then also you don't wanna wait too long because if you wait too long, then, you know, hey. The site's down, and I I don't wanna wait thirty minutes to go see the logs. You know? And so, like, I need to load the data swiftly. And so what we've had to do with our loading is actually just be kinda build a a loader that's very, very dynamic with respect to these aspects. You know, sort of never waiting too long before loading data, so always having,

slightly good timers and being tunable with with in regards to this. But then also doing everything we can to batch data up before loading it and to try to make sure that we're loading data, you know, tens to hundreds of megabytes at a time so that so that we can get the the efficiencies of sort of throughput oriented processing and

batch sizes for for this kind of lake house loading. And and I guess one of the things that's kind of interesting is that, you know, that that kind of sounds like that sounds like a, like, terrible trade off and like, oh, you know, I don't wanna wait, you know, five minutes to load data because I'm waiting for, like, a ten megabyte batch. And when when it turns out with, like, larger digital operations,

the the volume of data that you generate is so large that you end up getting nice sized partitions very, very quickly, like like, on the order of, like, five to ten seconds. And so I think what we found is that, like, our our solution, our our kind of technology choices, and a sort of, like, lakehouse architecture,

it it maybe doesn't make as much sense for, like, very, very low data volumes. Like, if you're only loading, like, a few tens of, like, gigabytes of data per day. But a lot of the folks that we see that, like, really have these challenges around high cost of all their existing observability tooling. They have, like, tons of different modalities of data. They have tons of, like, users of, like, kind of have different, you know, needs for this data. They they tend to be dealing with very, very surprisingly large volumes of data. And and, I mean, on on the low end, we see customers that are loading, like, tens of terabytes of data per day. On the higher end, we see people that are loading hundreds of terabytes of data per day or petabytes of data per day into these systems. That's like, you know, 500 terabytes

per day every day, you know, and and wanting, you know, in many cases, needing to keep it for months. You know? And so this is, like, very, very large volumes of data, and you're talking about gigabytes per second of, like, throughput. And so

so even though, like, kind of, like, the natural intuition is that, like, well, it's gonna be very slow and high latency to load this data into a lake house. The reality is that it's a throughput problem. And and really, the the trick is, like, actually, I need enough parallel processing, you know, like, kind of, like, throughput to just even handle the data, and the latency doesn't end up being your biggest problem. So that's kind of, like, one of these least counterintuitive

things about this use case, at at least at scale. And so that's kind of the loading of the data, you know, and sort of like kind of the batching, the buffering, you know, kind of the durability, making sure it gets loaded into the lake house efficiently. But but simultaneously,

hey, you just load all the data into like a single big table, and then you run your SQL statement on Like, yeah, I I can't wait twenty four hours for the answer to come back. So, like, you gotta do something about that. And so there's actually, like, two, like, main tricks that sort of, like, that sort of, I think, are are the key to making all this stuff work. So the first is that, you when you're dealing with all this observability data in the raw, the OTLP data in the raw, you know, it's it's you're talking about unstructured log data. You're talking about just all sorts of these, like, JSON blobs coming out of your distributed tracing data. Like, it's very, very raw. And,

you know, in many cases, like, the the what you wanna do is

either do, like, more of, like, an analytic query on this data. Like, you wanna do, like, a dashboard. I mean, you wanna have a chart on a dashboard. So you wanna, like, have some time series data. And so, like, you know, doing that over the raw data can be very, very expensive, so that that's not ideal. But then also, you're rarely querying, like, all of the data at once. Like, you're usually querying the data, you know, kind of confined to a specific use case or a specific

property, you know, like I'm looking at my database logs, you know, rather than all logs, or I don't wanna look at my VPC flow logs. With the scan based, like, kind of data like architecture, like a lake house typically is like a very much a scan based thing. You know, the thing you worry about is always like read amplification. Like, don't wanna read a bunch of data that that is irrelevant to my query. And and so, you know, in a normal data management system, we would have a bunch of indexes,

but, you know, with a lakehouse, you know, you don't really rely as much on random access. That doesn't make sense when you're accessing an object store.

And so so the the trick that we use is actually we just organize the data around use case. So rather than loading data into a single table, we go to the extra effort to sort of, like, curate the data around particular use cases. You know? We put the logs over here. We put the metrics over there. We put the traces over here. Within each of those, we do we make every effort to kinda split it out further, you know, kinda say, like, the VPC flow logs should go into this table.

The database logs should go into that table. And, our system is very, very flexible about it's basically programmable for doing that kind of, data pipeline building. And we've architected this as a as essentially a big old streaming ETL platform where kind of, you know, data's coming in, and we immediately begin just sort of doing this sort of percolation for, like, the the transformation, the curation, the enrichment of the data into these downstream tables

that are the ones that we actually have users query. And so so this this solves, like, two key problems. Like, one is it columnarizes the data so that when I wanna build a dashboard on top of any of these things, that dashboard is only gonna read the column that it concerns. And so that query is gonna be fast. It's gonna, like, make the best use of what you can do with an analytics database, a columnar analytics database.

But the second is that it solves the read amplification problem. You know, it's sort of like we're we don't have indexes, but we've at least organized the data so that you kind of minimize the degree to which you're getting read amplification. And so, you know, if you don't do those two things, you you end up, yeah, kind of in a very bad profile in terms of query latency and cost when dealing with the system. But if you kind of can amortize

those that by doing kind of this one time data transformation when you load it, you can catch back up kinda with what you would get with a with a traditional index system. And then that's that's kind of the first trick is doing this, like, curation enrichment, you know, kind of, like, a columnarization of the data. But there's actually a second trick that we do that's very important, that actually is sort of, like, what what actually gives us a lot of the interactivity that our users expect, which is that, we do not ask our users to go off and just write a random SQL query on this data. We're not doing, like, single shot,

sort of queries on this data. We've built our our kind of workflows and UIs and query and APIs to abstract SQL away from the user so that we can play a bunch of tricks. And so one of those tricks is that when one of our users queries data, say they query, you know, all the error logs over the past twenty four hours, our back end will actually take that and break it into a series

of SQL queries. And we'll either run those, like, sequentially, like cursoring through time, or it can even run them in parallel. And and the idea is that, you know, hey. If I'm searching for, like, a needle in a haystack, I kinda just wanna see some results quickly.

Like, it's very important to, like, be able to, like, very, very quickly sort of see kind of what hits do I get or what data do I see in the last minute versus the last six hours. And so we've, like, optimized a lot of the query execution around that sort of, like, urgency and timeliness of getting data out, and it's it's forced us to definitely

kind of engineer around the around a lot of the typical challenges you'd face if you were just trying to run plain, naive SQL queries. But, you know, again, we found that with the volumes of data we're dealing with, kind of like doing these tricks, like breaking it into sequences of queries, actually works pretty well. The the overhead's

amortized very well. We can get back results to people very, very quickly, and it's not materially more expensive than running the full batch query. And we handle all of, like, the gnarly mechanics of cursoring over all of the, you know, distinct results sets that are being returned by the engine. And so these are the things that we've done to make a lakehouse amenable

to this more real time use case with lots of heterogeneous data. And I'd say just kind of thinking about where the technology is going and where the gaps are, I mean, I think we're clearly operating at the fringe of what lake houses were conceived to be able to do,

we've built certainly a lot of bespoke solutions to overcome some of the limitations and to great effect for our users. You know? And and I guess I I would expect over time that, you know, a lot of these strategies will will generalize and that people will begin to build, you know, sort of, you know, sort of more general purpose query execution engines for lakehouses incorporate a lot of the ideas that we've incorporated, whether it's handling, you know, sort of efficient

loading of streaming data, fast interactive queries. You know, I think we've at least from the success of our business and our customers and our unit economics and all that sort of stuff, I think, you know, kind of we have, like, the sort of proof that, like, it can be done. It's not easy, but it can be done. You know?

Tobias MaceyTobias Macey

And in terms of the overall lakehouse ecosystem, I think one of the key pieces of it that can have maybe an outsized impact on your ability to handle the streaming ingest and performant querying is what table format you build on top of. And given the time frame that you were starting,

that was sort of right at the beginning of the table formats becoming a thing. I think Eggsberg may have been started at Netflix around the time that you were starting Observe. Hoodie was probably in maybe the early stages of being developed at Uber around that time frame. And I'm just curious how you're thinking about the table metadata layer, the impact that that has on ingest and query capabilities, and then also in particular, some of the ways that these AI native workflows impact

the utility of those table formats. I'm thinking in particular about the Lance format that has native vector indexing capabilities and just some of the ways that the evolution of the space is getting pushed by a lot of these agentic workflows, particularly with things such as AI powered agentic SREs and things like that?

Jacob Leverich

Yeah. You're absolutely right. So when we started, Iceberg wasn't really a thing yet. You know, if I remember correctly, Snowflake GA in '20 the Zigmod paper in 2016. It had a lot of core ideas in space. I think Netflix published the Iceberg spec in 2017. And so, yeah, it's kind of like all this stuff is happening right around the time we were founded. Iceberg, kind of in 2017, was like actually, I remember

learning about it right then. Was like, wow, this is amazing. This is like a lot of the things that Snowflake does, but it was still obviously very early days for it. It's like there's no vendor support clearly at that point. And so, you know, we kind of just as a circumstance of, like, the moment in time, you know, kind of we chose, like, kind of the best solution available at that moment, which was Snowflake, which which it's interesting now when you think about, like, just the the degree of overlap between Snowflake's

internal table formats and and Iceberg, you know, being built with the idea that the data is gonna be stored in commodity object storage, the idea that you're gonna maintain, you know, metadata for columns so that you can do partition pruning efficiently versus what you used to have to do with Hive, the idea that everything's gonna be built around a, kind of snapshot isolation model,

so that, you know, kind of it's very easy to provide, you know, consistent views of the data and to handle things like scheme evolution, all that sort of stuff. I mean, like, it's just so funny because, like, you can read the SigMod paper from Snowflake and you can read the iceberg spec and you can just, like, they, like, hit all of the same bullet points. And so so this has always been just, like, totally apparent, you know, kind of for me as like a

solution vendor building on top of a database like this. It's like, yeah, there's like very, very there are commonalities to all this technology and sort of where does this stuff go in the future? And and I think what's very, very exciting now is to see the degree of industry adoption

of these table formats. And and I think, you know, Iceberg you know, what sealed it for me was when Amazon announced that s three tables was gonna be based on Iceberg. That sort of, like, was a clear indication that, like, yeah, these guys that Iceberg is going to be

a fixture in the data management space, and we can start to make bets on it. And and meanwhile, you know, kind of, you know, fortunately, I think that, you know, both Snowflake and Databricks saw that, like, kind of trying to build their own, you know, silos. And silos is, like, maybe too partial a term because I know Databricks has always been open source to respect this stuff. It was clearly, like, a vendor specific,

you know, sort of thing. Once we saw, like, both Snowflake and Databricks start to really lean in the iceberg, it was it was kinda for us, that that meant it was go time. It was actually go time to start, like, actually figuring out, like, what do these table formats mean for the future of data management for observability? And there's a couple of things that I could see, and we're actually we're acting on actively now. So one is that in my all of my conversations with with data officers and and and CIOs and CSOs and stuff like that, like, everyone, like, wants to consolidate around a common

sort of data strategy for a large enterprise. And and, you know, a lot of times, like, the, you know, companies,

you know, just wanna be basically, they wanna own the data. They want the data to be in their data lake. They want it to be in formats that they can can take advantage of. They want to load the data once and use it for multiple use cases. And so so kind of like this this the emergence of these OpenTable formats that have, like, real vendor support actually is very, very important because it means that that is now a reality. You actually can begin to ideate about doing things like that. And the second was, well, you know, we we've kind of we've proven out that we can do the lakehouse architecture for observability using Snowflake's proprietary table format. Well, is that gonna hold as we move to a future world where we're using nonproprietary,

you know, sort of table formats? You know, we're using Iceberg or whatever. And so so, fortunately, you know, we've we've had very, very good relationship with Snowflake for for a long time, so we were able to get early access to kind of their their development, features for Iceberg.

And we've been benchmarking the ever living crap out of it and kind of taking all our different use cases and trying to analyze, the different ways in which we process data. And, our experience has been that Snowflake's implementation of Iceberg is essentially at parity with their proprietary table format, which is, like it's just a huge win for for everyone. It's a huge win for us. It's a huge win for our customers. I think it's a win for for Snowflake. They can participate, you know, in this open data ecosystem and and and,

sort of the vision and direction that that, you know, large enterprises wanna have with their data. And so so I think, you know, all kinda goes back to, like, you know, Iceberg and and Snowflake, you know, kind of a lot of the the technological principles underlying those table formats are very, very similar.

And in my experience, it follows through with the actual performance and efficiency outcomes we're getting with it. And so we have we have now features in our in our solution where we can selectively store data in Iceberg format rather than Snowflake, tables. We can, ingest data directly out of Iceberg and just, you know, read it and do searches on it directly.

And we're kind of building towards this world where all data that we load and and process, you know, kind of is is iceberg, you know, kind of in customers' own data lake, their own s three bucket in an OpenTable format. You know, kind of our idea is it's it's your data in your data lake in an open format, sort of like the vision that we aspire to for this solution.

That also opens up kind of interesting new opportunities. Is it like, hey, I described earlier, like our ingest pipeline, everything we do to load data into Iceberg. That's kind of stuff that we've tuned for our use case and stuff that we've done for loading our data. But also, folks

at the same time, they're going be building their own data pipelines. They're going be coming up with their own strategies for loading data into Iceberg. I think kind of as we get into this more, like, open, you know, data lake ecosystem, I think it's it it opens up opportunities for people to do things, like in the data collection agent, you know, rather than sending, you know, kind of, like, OTLP data to some API endpoint to then buffer it and load it into

a lake house, well, maybe that data collector can go ahead and make parquet files directly.

And then you can just just load them, like, directly into the s three bucket, or you can copy it through an s three endpoint or something like that. I mean, there's probably, like, lots of ways in which to, like, just start to kind of even further reimagine sort of, like, how data is loaded into a lake house and how you can, you know, sort of either distribute the computation that goes into it or or further optimize the cost profile of it or or a a customer's autonomy

sort of over how data is loaded. And so so I think I as you can tell, I'm I'm very excited about all the different stuff that's going on the Icebreak system. I think it's very compatible with, like, kind of our vision for, like, what the observability user needs. Thankfully, it's very compatible with the vision that, you know, kind of the traditional data needs of a business. And so there's there's lots of synergy and sort of conserve the tailwind behind this technology right now, and so we're kind of all over that. I think it's it's kind of just the trend. It's the direction things are going anyway. And then I wanted to touch on the you kinda mentioned AI topic. You know? Kinda what does this mean for, like, AI and stuff like that? And I think I think a lot of our perspective on it has been for a long time, you know, you can only use AI to ask questions about data that you have. And, like, all of these, like, sort of challenges

that people have faced with, like, best of breed tools in the observability space in the past, where they have some data over there and some data in this system and some data in that system, and and, you know, they're trying to, like, you know, build together, you know, sort of some agentic reasoning workflow to deal with all that data. It's, like, kind of a disaster if it's, like, all spread out everywhere. You know, I I guess the way I I sort of just dream about this is that, like, as as more and more of this data is centralized

into the lake house, it's sort of like much more practical to imagine that I actually will have the data that I need in order to build this workflow. I don't have to go hunting for it. And then,

you know, the the last part of it is, I think, you know, kind of is a topic we touched on earlier, which is that, you know, you know, everyone's always concerned about the just the the speed and the and, you know, maybe the cost of, like, accessing data out of a lake house. I I guess what I've come to, I guess, just experience building our product is that when you go to the effort of actually organizing

the data like we have in a lakehouse, where you sort of, like, split it out on a use case by use case basis and you go to the effort of columnarizing it, can actually be very, very cost efficient and performant to query data even if, you know, kind of it's sitting at rest in s three. And the last thing that we kinda realized about, like, kind of that strategy of, like, organizing the data is that we did a lot of it for reasons of, like, making the lake house architecture worked, but then we also did it for the purpose of making it easier for a human to work with this data. I don't typically wanna just ask questions about my raw logs, and I certainly don't wanna query, like, all of my logs. Like, I typically have some use case in mind when I'm going off and dealing with this data. I have a user who reported an issue, and I wanna go see the logs for that user or the the kind of health metrics for that user or whatever. And so I always, like, have, like, a a use case in mind when I'm kind of, like, going off and interrogating one of these systems. And so when we, you know, started to, like, build all these this streaming ETL platform and sort of divide data into all these different tables, like, very much the idea was, you know, let's organize the data around a use case because that's gonna be both which makes the lakehouse architecture work, but it's also gonna make it easier for a human to find the data that you're looking for because it's now organized around concepts that they're familiar with. And, well, when we started building our our AI capabilities,

we kind of just, like, already had that, like, curated sort of, like, sort of graph of the data, if you will. And so now now, of course, we call it the context graph. But we do just little tricks. We make sure that when we do our context engineering, we know the schema of all the data. We have descriptions of all the different columns. We load in descriptions of the metrics that are in our system. When we go off and and

run our agentic workflows, like, the first thing it's doing is, like, figuring out, okay. What is the natural language question that a user asked? And then, hey. Like, let's do some vector search on all of the metadata about the data I have in the system. Let's find the right data that may be applicable

to the question they're asking. And then kind of with that whittle down set, you know, that's the thing that we do our query generation on and then ultimately answer this natural language question. We found that the curation of the data not only made it, you know, more easy to use for a human, but it made it more efficient and more accurate

for an LLM agent to work with the data as well. And so I think I think, curiously, like, kind of all the things that that that we do to make a lakehouse effective for a specific use case is also what makes an AI effective on sort of data that's hosted in a lakehouse as well. And then just to kind of, like, I guess, go back full circle with that, it's like you can only answer questions about which you have data. And so so kind of if all these things come together in a world where,

yeah, I actually do have this, like, trusted, centralized repository of all of my data, and it's at least roughly curated so that the AI or the human can go off and navigate it, then we can actually extract some value from it. So sorry. Long winded way to answer a question, but that's No. That that that was great. And I guess briefly,

Tobias MaceyTobias Macey

one of the key elements in there is roughly curated because one of the challenges of data lakes in particular that led to the creation of these lakehouse architectures, which is also a challenge particularly around some of the labeling and cardinality management of logs and metrics and traces is that entropy will win if you don't fight against it. And so you can have all of the data, but to a point you made earlier, it's completely useless if you can't find it or understand

which data you need for which problem. And I'm wondering if you could just talk to some of the guardrails that you think about factoring in in Observe or just broadly in some of these architectural primitives that help to guard against the effects of entropy in these lakehouse architectures for observability systems?

Jacob Leverich

Yeah. Yeah. Great question. So there's really, like, two things that come to mind right off the bat. So one is that one one of the the thankful things of, sort of building our solution in the context of a specific use case

is that we can kind of be prepared for a lot of, like, use cases right out of the box. And so if someone has, like, you know, like, logs from their AWS, like, Lambda that they wanna, like, do analytics on, like, we we've seen that use case a 100 times, you know, so, like, kind of we already have sort of out of the box pipelines for organizing that data and for making associations

between, you know, sort of the right CloudWatch log groups and the right arms for the Lambdas and all that, you know, sort of jazz so that people can, like, work with this data, and no one has to really worry about it. And so so I'd say there's, like, an 80%

of a use case that, like, we just sort of handle out of the box, and so people don't need to worry about it and sort of, like, our the benefit of being a vendor in a particular context is that, you know, we can bring our expertise to bear on that. But then there's naturally just yeah. There's it's it's what I do with the other 20%,

and and and more importantly, like, all the infrastructure stuff is great, but, like, my business is unique. You know, one of our customers is Topgolf, and they have tons of I don't know if you know about Topgolf, it's kind of this entertainment venue with golf driving range. And they have venues and they have bays and they have these kiosks with games and all this sort of stuff. And they have a support desk where like, Hey, if one of the kiosks breaks down,

now, oh no, we need to refund this customer, and we need to figure out what to do about it. We need to figure out if it's a common problem with this kiosk or if it's one off. So their questions are all about

kiosks and bays and venues and customers and things like that. And the questions aren't about their low level logs. And so so so, you know, we thankfully kind of have built this system that sort of is designed for doing that kind of curation. We've built into our UIs sort of very, very, like, easy to use tools for building these types of data pipelines

and to build, these types of entity models and relationships between data and the curation of the data. And we did that partially to make the Lakehouse architecture work, but then also just to try to address these use cases head on. And I think that the thing that's gets, you know, kind of typically challenging with data engineering is that, like, when you wanna build a data pipeline like that, like, it's not a trivial exercise, you know, in terms of

actually, what are you gonna use? Are gonna use DBT or how are you gonna actually engineer this whole thing and how much expertise do you need to do that? And if you're dealing with very, very large volumes of data, how do you handle, like, full refreshes versus incremental updates and all that kind of stuff?

And so our platform sort of being built from the get go to, like, dynamically build these types of pipelines and to handle very, very large volumes of data, we've had to, like, kind of build in a lot of strategies for for dealing with those things. And so for an example, when you first off, when you build a data pipeline in Observe, you use our query language.

So the same query language used for building dashboards is the same language that's used for building a data pipeline. The whole language is, like, very much built around the streaming semantic. And so kind of for whatever expertise someone uses, like kind of interrogating the data, like, they're kind of halfway there to start building data pipelines. And that's very different from the typical experience building like a SQL based data pipeline or a Spark based data pipeline.

You know, this is kind of like a slightly different way of doing that. The second is that, you know, we've kind of built things, again, very much along a streaming semantic, which means that when you build a new data pipeline, you know, it's by default go forward. You know? And so kind of like we'll process, like, data going forward based on on sort of this new pipeline. And so the the like, kind of the the expense of, like, rapid iteration is very low.

And so we find that people kind of, like, get to success very quickly just by, like, kind of building a simple pipeline and just, you know, kind of just, like, continuing to curate it and kind of do extractions in different ways. We try to optimize around just, like, fast interactivity for that stuff, kind of being able to save those pipelines. And so that that actually makes the kind of the the entry cost to building a data pipeline much lower in our solution. And it's very much geared towards, like, the observability user, but, like, it's sort of, yeah. I think I think sometimes I show this to traditional

business users, and they're like, I wish I had this. You know? But then but then because everything is built in terms of our query language with the streaming semantic,

we also have the ability to selectively backfill the data. And so kind of after you've arrived at, like, a good shape, kind of like this curated view of the data, you can go back and say, okay, cool. And I'll go backfill this for the past month. And we can, you know, tune, like, you know, you know, how fast do we want it to go or do we want it to just, like, slowly process that data and we can decide which size Snowflake warehouses to do to do that stuff. So we've kind of had to just, like, just in terms of practical matter, deal with a lot of the challenges around the evolution

of these data pipelines. And, you know, I think just kind of doing it in the context of a real time interactive use case, like observability has, like, forced us to, like, really reckon with, like, what does it mean to, like, do this stuff quickly? What does it mean to do this stuff without being a data engineering

expert. And sort of, we found, like, a few, I'd say, design points in this space that, aren't are, you know, readily available, kind of throughout the industry that, yeah, I kind of expect we'll see over time.

Tobias MaceyTobias Macey

You're a developer who wants to innovate. Instead, you're stuck fixing bottlenecks and fighting legacy code. MongoDB can help. It's a flexible, unified platform that's built for developers by developers. MongoDB is ACID compliant, enterprise ready with the capabilities you need to ship AI apps fast. That's why so many of the Fortune 500 trust MongoDB with their most critical workloads. Ready to think outside rows and columns? Start building at mongodb.com/build today.

And as you have been building Observe and evolving it and scaling it, what are some of the most interesting or innovative or unexpected unexpected either ways that you're seeing it applied or outcomes that you've seen from teams not being constrained by some of these questions of economics or scale?

Jacob Leverich

Good question. You know, I think I think maybe one of the things that's been more surprising to me is the kind of different users that can all, like, be successful on the solution. And I think, like, maybe maybe the there's an obvious version of this, which is security analysts and application developers.

And in many cases, like they wanna look at the same data. And the difference with the security analysts is that they typically wanna investigate something from like, you know, a few weeks ago, or they wanna do an evidence check from, you know, for a compliance sort of audit. They did do an evidence check for like six months ago. And so like, they have very different retention requirements.

And so, us offering a solution that kind of does both of those because it's all stored in a lake house and retention's very affordable, You know, it's kind of like maybe an obvious one, but I think the less obvious one is that we see a lot of people with, product support, teams that get a lot of joy, out of using our solution. And and I guess a lot of these product support teams, like, know, they're they're sort of serving in many cases as, sort of a front, for sort of engineering teams.

And, you know, the the more that, like, as product support analyst can answer a question or can troubleshoot an issue and hand either a well, kind of, you know, designed issue back to an engineering team or can divert an issue from the engineering team altogether,

that's that's a benefit. And so having multiple teams have access to this this data, it definitely seems to be like a productivity boost for everyone, which is really, really cool. And then I think what's what's been further, I guess, surprising is just I I don't mean to hop on the AI bandwagon because I'm kind of, like, a natural skeptic about most things in life, but, like, what I've seen our users do with very, very simple agentic workflows to answer questions

is is kind of starting to become surprising, And I'd say it started, you know, around about the first quarter of last year when we started to see some of the newer foundation models come out, and those things are just a little bit more capable at, just like sort of reasoning workflows.

And, you know, one of the first things we did is we made an MCP server endpoint that, you could hook up Clod to, and we just started asking open ended questions and gave it, like, a few tips about how it could query our data. And it was, like, amazing what it could do in terms of navigating our context graph and and sort of generating the queries. And so we've just been continuing to build upon that. And then so now we have support analysts who, you know, know nothing about, you know, like, you know, IT infrastructure or log analytics or anything like that. And they can go off and, like, ask detailed questions about, like, what error did this user experience? And the thing will just go and figure that out. And we've seen now

sort of quantified evidence from our customers

about what that means for them in terms of how many support tickets they don't have to send to the engineering team. You know, it's like, I guess, a ticket diversion kind of question. Like, they like, they they can handle more of the tickets in house than they could previously, and we've had customers send us, like, like, quantified evidence of the lift that they're seeing based on like kind of being able to do it, you know, kind of using our system and also using our system with the AI.

And so I guess what I never expected to see was like this, like quantified evidence of the productivity boost and the MTTR impact that that our solution would have. And I think it's always been the holy grail in in, like, kind of observability and monitoring systems to, like, have, like, this, like, clearly quantified MTTR t r or or productivity outcome. But in many cases, it was very hard to, to spot it. But we're just seeing it being used in circumstances

where you actually can quantify these things, and we're getting that evidence. And it's, like, sort of, like, wow. This this actually did have a lift for this business. Wow. That's like it's so easy to see what the business value is for our solution, and it's more than just kind of a lot of really cool technical mumbo jumbo that I love and can talk about all day long. It's actually like, oh, I can see why this actually matters. And

it is it's kind of like just a fun journey kind of, I guess, building a startup and getting to, you know, having a bunch of customers and tens of millions of dollars in AR is that we're actually sort of seeing just sort of the the real, like, business and personal impact that our technology has on users. And and I I guess I never I I obviously,

you you always wanna get there when you're building a startup, but, like, I'm a technologist at heart. So it wasn't like the it wasn't like this, like, obvious thing on my mind, but at this stage, it's just so cool to see it come first full circle and kinda see the impact that our solution has in the wild. And

you asked the question, like, what's surprising? I guess I I just, it it maybe is just, like, gratifying to sort of see our technology work, and, and I'm I'm just kinda thrilled by all of that.

Tobias MaceyTobias Macey

And in your work of building Observe, investing in these lake house patterns for observability data and understanding the technological elements and user experience around that, what are some of the most interesting or unexpected or challenging lessons that you learned in the process?

Jacob Leverich

You know, so I think maybe it's abundantly clear in hindsight, but it wasn't totally obvious to us when we first started. I think when, you know, when we first started, we were we were very much focused on a lot of the technology challenges of using a Lakehouse architecture, and so we had built a kind of like fairly generic user experiences for querying

data and building charts and building alerts and stuff like that. But, you know, and and we also kind of built this experience that like, you know, there's lots of different datasets in the system, and that's, like, a little bit different than the typical user experience when using Elasticsearch or Splunk or something like that, where you just go in and GREP all your logs.

And I think, you know, kind of I've kind of mentioned that, like, kind of a lot of what we do is, like, we bring this technology, though, this, like, kind of modern data management technology to the observability user, and they've never really had access to this technology before.

But I think the thing that I didn't quite appreciate as much in the very early days is that, well, you you also gotta meet your users where they are. You know? So they are they are used to experiences like Elasticsearch. They're used to query languages like Splunk's language. They're used to, query languages like PromQL.

They're used to building dashboards like they do in Datadog. And, and kind of when you're trying to build a solution that appeals, to a user who's has probably had experience with some tool like that in the past, you know, anything that you do that is idiosyncratic or sort of, like, doesn't quite meet their just, like, out of the box expectation can be a real burden to their adoption.

And even if you can give them sort of, like, the cost outcome they're looking for, the retention outcome they're looking for, the performance outcome they're looking for, any of those kind of things. Like, you know, if that kind of, like, first impression isn't one that they're expecting,

you know, if the new car smell, you know, isn't kind of a smell that's appealing to them, you know, that can slow down adoption. And so it took us a while to, like, really, figure that out, and it took me a while to really talk to enough end users and sort of get hints that people were struggling to do basic things like search or logs. So, know, the

question for us just came at some point. It's like, okay, so we've got enough feedback now that people are, like, kind of not necessarily seeing us as

a of like a log search experience that they expect to see, so what do we do about it? And so we kinda had just a moment where we had to, like, check our ego for a second. We had to start saying, like, hey. Our our solution isn't just about building data pipelines and building this lakehouse architecture. Like, our solution is about making it easy for users to search through logs, and we built, you know, a handful of user experiences in the product that would be very familiar to users coming out of

Splunk or Elasticsearch or Datadog or Grafana. And the idea was to kind of give people this gentle, familiar on ramp into our product so that they could come in and sort of be, you know, like, oh, yeah, oh, there are my logs. I know how to search This all, like, works as I expected to. And then, you know, over time,

they would then come to realize it's like, wow, this doesn't just have my logs. It has everyone's logs. Wow. This thing is actually fast. I can I can query thirty days with their data? No problem. Like, we kind of, like, had to set up an environment where people could discover

all of the benefits of what they get out of a lakehouse. And I I think just, like, meeting users where they are, you know, again, kinda given their context and what they're expecting to do with this thing and what they've used in the past turns out to be very, very important for, you know, kind of like making a lakehouse architecture work for a particular user segment. And it's obvious in retrospect, but it wasn't obvious to us when we kinda started at the very beginning. And so

so that that was that was a lesson we learned. And and as soon as we started to release those user experiences and sort of really position ourselves in that manner, I I have our our MAU graphs, and I always, like, I'm obsessed about, like, our user growth and stuff like that. And there's just, like, a marked, like, inflection point at which, you know, kind of we started to GA these capabilities, and our user adoption went through the roof, particularly

in customers that we already at. And so just sort of the natural organic user adoption went up. And that that was when we knew that, like, oh, wow. This was an important thing to do. And

Tobias MaceyTobias Macey

as people are evaluating which observability solution architecture to build on top of, whether commercial or open source, what are the situations where you would advise against using a lakehouse pattern or Observe specifically?

Jacob Leverich

Yeah, yeah, yeah. I mean, I think probably the biggest inflection the kind of biggest, like, point dimension that this cleaves on is on scale. You know, if dealing with fairly low volumes of data, it'll fit on your laptop. Don't worry about it. Maybe it fits in memory. Even any of the open source solutions are actually going to work pretty well for this. They're obviously easy to set up. You can kind of begin to understand what value you can get out of your telemetry data.

So, you know, I'm always happy to encourage people like, yeah, play around with Elasticsearch, play down with Loki. You know, if you wanna throw your logs into ClickHouse, feel free. I mean, like, definitely, there's, like, there's lots of of solutions for kind of starting early on kind of working with this data. And I think particularly,

like, a lot of, like, kind of the smaller scale deployments, it's also naturally endemic to, like, earlier stage companies and earlier stage businesses where you just, like, don't really have, like, the hundreds of people trying to collaborate or something, or you don't have, like, just all of the different variety of data. You probably have a simpler context. And what goes hand in hand with that, like, kind of earlier stage company or simpler context is that you probably have bigger fish to fry than, like, your observability strategy. Like, you need to make sure that your business works. You need to make sure that your products work and that your users happy. Know? Like, you know, you might even need, like, you know, sort of user analytics tools before you really need something to do, like, hardcore observability data management. And so I would say, you know, kind of early stage, just pick whatever feels natural. Don't spend too much time on it. Kind of make sure that you're working on the right problem. Make sure that your business is focused on the right things. And then when you start to scale and when you start to, like, hit these, like, hard decision points where you're like, wow. This used to work in Elastic, but, like, I'm spending,

you know, two full time people just keeping the thing running now, and that doesn't seem like it's an efficient use of, like, capital, you know, like, kinda once you start to, like, start to feel like you have to start rationing things or you're spending too much effort to to manage it, that's probably when you ought to start looking at a vendor solution, and you might wanna start considering an architecture that can scale to that data volume.

And and that's just kind of a common story I hear from, like, a lot of our customers is that, like, they are perfectly successful at early stages,

kind of with whatever technology they pick. Because their use cases and needs are are are very modest, and, they probably have more important things to do. But then there's sort of a scale point that they hit where it just like, it becomes overwhelming and they've made hard trade offs or they're already starting to send data into different silos and they're spending just unbelievable

amounts of money on this stuff. And that's usually when when, you know, someone like us can come in and just sort of like help, like sort of, I guess, guide people towards, like, maybe there's a different way to do this. We we can help you out with, like, all of the negative consequences you're starting to feel with the scale. So scale, I think, is, like, kind of one of the big things, either in terms of data volume or in terms of just sort of, like, the organizational complexity,

you know, lots of different teams, acquisitions, stuff like that.

Tobias MaceyTobias Macey

Are there any other aspects of the work that you're doing at Observe or just the overall space of lakehouse architectures for observability data and some of the capabilities that it unlocks that we didn't discuss yet that you would like to cover before we close out the show?

Jacob Leverich

No, not really. I think maybe there's just like one, like, on the technology front, you know, I think the one kind of space it's like, we're seeing a little bit of innovation in is just how do you build solutions for needle in a haystack, search on top of a lake house? Sort of like, it's all well and good to colonize the data, but if you want to pull out just a needle,

you might need something more than just column metadata to do that. And so there's interesting stuff happening in that space. It's something we've developed solutions in. We know other folks are developing solutions in that space. Snowflake's built their search optimization solution. There's sort of all sorts of interesting stuff there. So it's

just such a really interesting space. I think we're definitely, at least for me and Observe, we're living on the fringe of this technology right now. And so I would say just pay attention to all the people working on the fringe and sort of the challenges that they're kind of addressing these days. You know, hopefully in the next, you know, you know, three to five years, a lot of the challenges that we encounter become sort of a commodity generalized solutions, and everyone can get take advantage of them. It's sort of when you think about your future planning, you know, like, if people are dealing with these problems now, then you can anticipate we're gonna have good solutions to them in, like, three or five years. And so so when you just think about your overall technology strategy and sort of what technologies you wanna incorporate in the future, just keep in mind that this is a fast moving space. A lot the problems that you might intuitively feel exist today, they tend to get solved, sooner than you expect.

So, that that'd be it.

Tobias MaceyTobias Macey

Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap of the tooling or technology that's available for data management today.

Jacob Leverich

So, you know, I think it's actually I guess that's funny. If you'd asked me this question, like, three months ago, I I would say it would be the shredding of semi structured data. You know, like an iceberg, you know, it it it sort of the the iceberg v one and v two specs didn't really have the support for automatically

columnarizing JSON data. And I think that's one of the things that really held Iceberg back for a long time from use cases like observability because a lot of the data you're dealing with is basically JSON data. Like, metrics labels it's basically a bag of JSON keys, key values. Same with OpenTelemetry, sort of tracing attributes. You know? It's a bag of key value pairs.

And, the fact that those wouldn't be columnarized in Iceberg was, I think, the thing that was holding it back from this use case, and it's one of the things that made Snowflake especially effective for this use case because it does automatically colonize JSON data. And the fact that Iceberg v three now supports shredding, it should ports subcolonization, however you wanna call it, that's actually a a major unlock for the industry. And so so I I I I'm kind of anticipating sort of a a surge in

in sort of at least observability solutions built on Iceberg as a result of Iceberg v three. And I would anticipate that it's gonna affect a lot of other, sort of, use cases and segments as well. So it's it's very exciting stuff. So I think that was the biggest gap that was just plugged.

Tobias MaceyTobias Macey

Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing at Observe and some of the technological

challenges and architectural patterns that you found and iterated on to build such an interesting solution for observability data. It's definitely a very critical area that a lot of teams need to be able to have as a reliable substrate for their work. So I appreciate all the time and energy that you're putting into that. I hope you enjoy the rest of your day. Yeah, thank you so much. I really appreciate this and great questions, so thank

Jacob Leverich

you so much.

Tobias MaceyTobias Macey

Thank you for listening, and don't forget to check out our other shows. Podcast.net covers the Python language, its community, and the innovative ways it is being used. And the AI engineering podcast is your guide to the fast moving world of building AI systems. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts@dataengineeringpodcast.com

with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Transcript source: Provided by creator in RSS feed: download file
For the best experience, listen in Metacast app for iOS or Android