Hello, and welcome to the Data Engineering Podcast, the show about modern data management. This episode is sponsored by Data Driven dot I o, the free data engineering interview prep platform built by data engineers for data engineers. Have you ever walked into a data engineering interview and gotten a question that has nothing to do with real data engineering work? Interviewing is its own skill separate from the job.
Watch your code execute live, inspect Spark internals, and whiteboard your data models and pipelines and defend your decisions. Unlike SQL only or Python only practice, datagerman.io covers the full interview loop. Star schemas, slowly changing dimensions, grain and fact table design, item potency, watermarks, dead letter cues, change data capture, and back pressure. Every question comes from real data engineer interview loops at Google, Amazon, Meta, Stripe, Databricks, Netflix, and Airbnb.
Go to data engineering podcast dot com slash data driven today to start practicing. Your host is Tobias Macey, and today, I'm interviewing Weimo Liu about the engineering behind Puppy Graph's zero copy ETL for querying your lakehouse as a graph. So Weimo, can you start by introducing yourself?
Hello, everyone. This is Weimo, co founder of PuppyGraph. The name sounds like self driving. And before that, I worked at a graph database staff called Tiger Graph and also Google F1 team. F1 is a unified SQL query engine inside Google. It can query all the data across Google without ETL, and it's serving billions query per day. Yeah. So that's me.
And do you remember how you first got started working in the data space?
Oh, it's a long story. When I was in college, I'm working on some research projects on some open source spatial database. And after that, I went to GW, the George Washington University, for my PhD degree about database sampling technique. After that, I joined TIGR graph with a closing series A, because the CTO and the co founder is a good friend of my PhD advisor,
because he was a professor in database area as well. And laterally, I joined Google working on the SQL query engine. And finally, I convinced my friend to find PuppyGraph. And so digging now into PuppyGraph,
can you give a bit of an overview about what it is that you're building and some of the story behind how it got started and why you decided that this is where you want to spend your time and energy.
Yeah. So PuppyGraph is a federated graph engine on your tables or even Mongo and other data source. You don't need to load your data to somewhere else, but just connect the PuppyGraph and run graph query, graph pattern, and the graph algorithm on top of it. The story is that since 2022, ChatBet was becoming popular, and some of my friends is some founders of BigLaran model project. And they share with me that
no one will write a SQL or any other query in the future, and the agent will do everything. And we feel that, oh, this is a big opportunity. So we're trying to build an engine for agent. And then we think about what's the agent need. And we try to follow the first principle, and then we start to build the Puppy graph. Yeah. Since recently, there's a lot of buzzword
like agent harness, but at that time we don't know it yet. But we're trying to, we believe agent need something like that, so we're trying to follow the principle. And currently, we collect a lot of customer feedback as well. And we believe there are three hard requirements to have a successful data agent. One is the process unlimited data. The second is the sub second real time performance. The third is that so called agent harness.
And ourselves, we pick a graph query language to meet this, to enable the ontology enforced, which is how we believe we are pretty unique in this area. As you mentioned,
the introduction of these agentic capabilities and the graph based knowledge retrieval that a lot of them benefit from has caused a pretty substantial resurgence in the overall interest in graph databases and graph query engines overall.
And I know that fundamentally graph data, particularly if you're using a graph native storage layer, is very challenging to scale horizontally because of the constraints of graph topologies, difficulties of figuring out where and how to shard, the impact of super nodes within the overall graph structure.
And I'm wondering if you can talk to some of the ways that the architecture of PuppyGraph helps to address some of that or just some of the overall challenges in dealing with graph data, particularly as you move to larger volumes and more complex queries.
Yes, yes. So this is a long time challenge in the graph space. I think the first project that solved the graph scalability issue is Prego from Google, and Google published a paper about that. It's called large scale graph processing. And TagGraph is kind of a base that paper to develop it. And after that But internally, we have something involved because it's kind of an iteration run by run, and we feel it's too static and too consuming like MapReduce. And then we make it more flexible.
And after I joined Google, I realized, since I was at Telegraph and people like it, but they spent, like, big banks spent eighteen months to load data into it. And our customer complained about this a lot. And after I joined Google, I figured out what's wrong. Since when I was at Google, we query everything. And like if you have logs, because logs
follow the certain format, so we just define that table. We don't need to load it, rewrite it, and then query it as a table. And then I think about whether Graph can do it as well. And in this case, if we can just query, for example, your data lake like iceberg and the scalability for storage is no longer a problem, then the problem will be how we can scale the computation. And we do something different since most graph database is sharded by the
nodes, and then they have a partition of different nodes. But then it's highly dependent on what's the distribution of a graph. But we are more like shard the data by edges. In this case, even you have a super node like Justin Bieber. He may have too many followers. But if we can just shard it on our edges, for example, if we have 3,000,000 followers, we can make it a three partition. And for some others like me, I don't have a lot of followers.
But we can, for example, all my teammates, we can be in all the edge between our teammates and the followers can be in the same partition. In this case, the partition side can be similar to each other, and then we can just shred by the edges to solve the hub node situation. And also, we make the graph traverse much easier by doing this. Before that, for example, if you do 10 hop neighbor traverse, it really one more hops and the complexity is increased especially.
But now it's kind of a linear increase. And we can solve some 10 hop network queries in one or two seconds with cluster because we can shut the edge and also shard the computation. During the hops, during the graph traverse, we can do the shuffling between different nodes. So this is a kind of way to solve the problem.
You mentioned too that one of the features or one of the capabilities that you're aiming for with PuppyGraph is to have low latency, and lakehouse architectures in particular struggle with that broadly just because of the architectural fundamentals of it, and there's a lot of work going on at all the different layers to help mitigate that.
But I'm wondering, particularly when you're in a data exploration phase or if you are using agent generated graph traversal queries, which might be very complex or require multiple hops, how you mitigate some of those challenges of being able to cut down on latency as well still being able to allow for exploratory and discoverable connections?
Yeah. First, there are certain overhead if we read data from a lake house. Like, we don't optimize for ten milliseconds or twenty milliseconds query at all. So in that case, there are a lot of all in memory solution, and we just kind of give up it. What we are good at is like a sub second or a single digit second one. In this case, we still have the overhead, but the overhead usually may be fifteen milliseconds or one hundred milliseconds to fetch from S3, for example.
But at the same time, we optimize for the computation. And we also have a vectorized evaluation and also MPP. In this case, we can handle more nodes and edge at the same time. And also, we can scale out more machine, better performance. In this case, we can try to optimize for sub cache second query. And at the same time, because Iceberg has metadata, so we can do active cache or adaptive cache. And then we can just gather data and store the cache in memory and also
local disk. And the next time after we read the metadata and we see that, oh, this parquet file was not updated in the last several minutes or is a kind of a cache hit, and then we can just load from local disk or even memory, and then the
performance will be much better. There have been other attempts at being able to add a graph traversal layer on top of other storage. I think the graphics package from Apache Spark is probably the most notable one. Some of the recent attempts at that are the KoozieDB project, I think, was aiming for that, and that I think has been taken over by the LadybugDB
fork. And then maybe the most notable recent entry is the LanceGraph package for being able to do graph traversals on top of the underlying Lance table format. And I'm wondering what are some of the areas of inspiration or comparison that you would like to highlight between what you're doing with PuppyGraph and some of those other technologies?
Yeah. I think our projects are kind of inspired by the GraphX and the GraphFrame. But the tricky part is that Spark is not optimized for graph at all. And also, we saw some friends build on top of other SQL query engine like Treno. And in this case, you highly depend on the compute framework itself. And usually, it's not designed for Graph. For example, if Spark optimized for
Spark jobs and also Spark SQL and also like TreeNote optimized for SQL. In this case, the engine to optimize for something else first, and then on top of it, it optimized for Graph. The direction is not the There are no alignment, so the performance is a big bottleneck. And we also see that Lens, Graph, and Kudu. I think they are a very great product, but it's more like small data. Maybe the storage can be scalable because they can just try to raise the iceberg.
But I think I will propose this first. And laterally, both of them support a read from object store. But the issue is that if the data is big, and also even the data is not very big, like 200 gigabytes, but because computation of graph is very heavy, the data is highly connected.
And in this case, shuffle is a necessary feature for this kind of workload. And we are very good at this kind of stuff. And since we saw a lot of all in memory solution like MemGraph, and it's very good at small data because all data load to memory first and then do the computation. And I think CoolGraph and the LessGraph are also kind of a single machine one. And this is a long term problem in the graph world. Since the graph data is highly connected, so no one wants to do the shuffling.
And then the bottleneck is that if it's a small data, everyone is publishing benchmark and is very fast. But when it really scale in the industry data size, it will be a potential problem. And we also see most of our customers coming to us for this because their data is too big, because they're already leveraging Databricks, Iceberg, or train or hire things. Data is already there, and it's super big. And when they're trying to have a graph solution, and I
think we are very unique in this position. That's why we are a small company, but we have a lot of big logos as our customers.
Now digging into some of the data modeling question, your focus is on that zero ETL aspect of you don't have to move your data into a different layer, but a lot of data maybe doesn't necessarily have that natural graph topology or you need to do some explicit modeling of it. And, also, there are a few different flavors of graph definitions, whether it's a labeled property graph or the RDF triples.
And I'm wondering if you can talk to some of the ways that you approach some of that data discovery and data modeling aspect of being able to take the existing data as it is in its natural, probably tabular structure, and be able to represent that as a graph and manage the evolution, particularly as the underlying schemas evolve?
Yeah. Yeah. You definitely have a deep expertise in this area. Yeah. So this is a typical question. And first, let me talk about the ideal case. The ideal case is that all the table are normalized, and then it's nature to be a graph. Like you have a customer table, you have a product table, you have order history. Actually, history is an edge between customer and product, means customer A, product B.
Those are perfect. And at the same time, since in database 101 and there are some principle, like everybody do the data modeling. Please create the data tables in a normalized way, and it saves a lot of cost for storage. And also, we provide a very easy and straightforward data modeling on this situation. But of course, this is the ideal case, and some customer are okay if they already normalize some tables, but now they feel, But denormalize is actually for predrawing
something and to have better performance. But after they reach out to us, they realize, oh, maybe we don't need to do predrawing, we just run a graph pattern. And it's also very faster, either very fast and in real time. And so this is the ideal case. And some customer either have normalized table or they are okay with normalizing their current table, because before that, the denormalized table is for the single table, wide table performance. And if they can normalize it and still have a
very fast performance on graph pattern, which means they can tile drawings but without the slow performance of drawings. So they are pretty happy. Another is that because table is already in production for some other use case, they don't want to change it at all. There is some tricky part for us. And one way that we have a logical view and then define graph on a logical view. Another possibility
that we have a very flexible mapping. For example, if you already draw in the customer profile and product profile as the wide table for order history. In this case, we can define one column like a ZIP code as a node. So it's not a node table, it's just an attribute in a wide column. But we will dedupe it for you logically, and then you can have a flexible mapping from the graph schema to your tables.
So this is another way we do it. And of course, because for the same 100 table, for example, we can create a lot of different graph schema. And different graph schema have a benefit and have an advantage and a disadvantage. And usually, they highly depend on the use case and the query they are running. And then we can suggest the best GRASS schema. But usually, our customer, because they want to build some customer facing agent system,
so they are pretty familiar with their tables. In this case, we will discuss together and build some graph schema best for their use case. Yeah, but of course, sometimes it's not that good, but we just, and if it works, it's fine, but definitely there are always space to optimize it.
Beyond the relational structures, there are also potentially document models you mentioned that you're able to execute across MongoDB as a storage layer. And also, increasingly, we're looking to unstructured data sources and doing some transformation of that into semantically
enriched deep data or extracting structured data from free text. And I'm wondering how you're addressing some of those as well and being able to map that into a graph structure and maybe some of the interesting use cases that you unlock because of the fact that you're able to work across these different storage layers.
Yeah. So for Mongo, so it really depends on how unstructured the data will be. And for Mongo, it's actually pretty structured already since most of the collection follow the same pattern and something like a JSON file as well. And even though they are not flattened at the table, but logically, you can just flatten, you can have a flattened table. Like, when I was at Google, we also do a lot of this work, like if it's a nested field, like aws.c,
we can just use SQL to select adobe. C from like a collection A, something like that. And in this case, we can just connect MongoDB with a JDBC interface, or maybe it's a surprise to some of our audience, like MongoDB can access by JDBC.
And in this case, we can just run the query similar to the table ones. And also for Mongo collection, there are still something like a foreign key. And then we can use the key to link to each other to form a graph. And for even more structured data, like, for example, PDFs, we have two partners. One is from Treno team, one is my old friend at Google. They are building the index of
Google Search. So what they are doing is that they already have a bunch of documents like PDF, and they do the entity expression and store it in, for example, expert table. This is an interesting part because for a lot of abstract data like PDF, assuming the Bank of America bank statement, even they are unstructured. But the unstructured data itself has some potential
structure inside. Like you have all the PDFs follow the same pattern, and they have a bank account number, they have the home address, they have a account holder name, and they also have transaction tables there. So there are certain rules, and our partner will help us to do the extraction from the PDFs, and then to markdown, and then to tables. And since both of our partners work in this area
for many years, like the Google Search team and also the Trindle team. So we just partner with them closely. And what we are doing is just you already have tables, and then you can have a graph query, and also you can have agent system to query it. And we already have a lot of drawing customer, and it's pretty smooth. And it's even better than, for example, just to make chunks and then do the embedding. Since when do the chunks and embedding, actually you didn't leverage your potential structure
inside of your documents. For example, if in the case that every PDF have a different pattern, then maybe the embedding is better. But if you have like 1,000,000 PDFs and all of them follow the exact pattern, it's actually a structured data, a structure representation. So in our experience, if we can flatten those tables, it'll have a lot of benefits.
Digging into PuppyGraph itself, I'm wondering if you can give a bit more detail on the architecture, some of the technology choices that you're investing in to enable this use case and some of the core ecosystem primitives that you're leaning on to be able to manage the complexity of the space that you're working in?
Yeah. So since the beginning, because we want to build a system for agents, so we consider a different ecosystem and which part of the system and which component to pick. Like the first is that I think three years ago, we believe that the even before that, after the expert team left Netflix, we believe they will be the standard for OLAP in the very near future. So we pinged the founders and
showed them how we can run a three hops graph query on Iceberg without any change. And it's much faster than most graph database on the market, and they also feel surprised because Iceberg is not optimized for graphs. And we work this closely. And at the beginning, we only support Iceberg, but my teammates question, what if Iceberg won't be popular or is not become
popular fast enough? And then we will just bankrupt. And then we support other different data source, but our favorite is Iceberg. And another thing we are waiting for is that we're waiting for the agent is capable enough to generate all the query automatically without human in the loop. And in this case, we're to pick up an interface. And we consider SQL at the beginning, but we feel that
because when human writes the SQL, there are lots of context in their mind. And when they write something wrong and they know it because they know the building logic, like a student won't join with a teacher's salary table, otherwise the student will have 100 income last year or something like that, but without a throw out an error. And then since I work at TechRaf, I think the graph is the best because the graph itself not just contains data
but also contains the ontology. And if people are doing ontology, the ontology results in the graph, why we just query the graph rather than use the graph to generate better SQL? It's against the first principle. And then we support the Cypher and the Gremlin at the same time because they are popular and there are enough public data on GitHub, like the large model and stand, and generate the query.
And also, we only provide a Docker to our customer because then they can just use Kubernetes to deploy it easier. And so this is basically like the interface layer, the graph query, like Cypher and Gramming. And for the computation part is what we are doing by ourselves. And for the storage layer, we just leverage the table format. And the best one, of course, for us is the iceberg. Then we design this system, and also it will be easy to leverage by a different community.
Like we connect the community of graph work and the community of common data engineer work. In the last ten years, I think the graph community is separate with data engineer community. Since the data engineer community is involved a lot, but the graph community has still not changed a lot since the last ten years. And I think if we can bring the capabilities together, and, it's not just benefits the agent as we design, but also benefits the human user as well.
One of the other use cases for graphs, particularly in the data engineering community that I've come across a few times is for master data management and being able to do things like named entity reconciliation to be able to say these two documents that are talking about slightly differently worded entities are actually the same thing and being able to do some of that resolution
there. And I'm wondering what you're seeing as far as applications of PuppyGraph in that more I'm gonna use air quotes and say traditional data warehousing use cases in addition to these more agentic workloads and maybe some of the cases where those two coincide of being able to use agents to do some of that master data management and entity linking.
Yeah. So we saw some user are using Graph database to do entity resolution, and also there are some other ways in SQL. And we're trying to see, since now you know, our theory and every data warehouse and the data lake now support graph. So we're trying to apply the data resolution solution of a graph database to the SQL tables ecosystems. And then the user can just apply the existing solution
to the SQL one. And also, what we're doing is that we can write back the result to tables. In this case, and for example, Iceberg will be the bus of the data pipeline. And like a bus,
write the result back, and we read the result from Iceberg and other engine like Treno and the Spark SQL, read tables and write tables. In this case, we don't need to talk with each other and do the data loading. Everybody just read from Iceberg and writes to Iceberg. And then your output can be our input, our output can be other input. And we're trying to make the data pipeline easier.
Now circling back around to what you were commenting with some of these other more point solution graph engines that are very efficient on smaller scales and volumes of data.
What are some of the ways that you're seeing people maybe use both in concert where they've got a dedicated graph engine for their data that needs to be low latency and in the hot path of a certain workload, but then using PuppyGraph for more of that scale out across larger volumes of data and maybe being able to transfer data to and from the hot path and into the more warm path at the iceberg layer.
Yes. Exactly. So this is what we're expecting. Like, in SQL world, it's pretty common. Like, you have PostgreSQL, and also you have Snowflake Databricks Trino. And then you have Postgres to handle the transactional CRUD with ACID. And for the large streaming data or batch data, and then you use Trino or Snowflake or Databricks. But before that, in graph word, it seems all the stuff similar to PostgreSQL position. And no one cares about the OLAP
one. And we're trying to be the OLAP one. And at the same time, for graph database, because people won't store all the data in graph database. Usually, they handle the hot data or transactional data.
And in this case, in SQL world, it's just made like a CDC or some data loading things, like you wear the AirBiz jacket, right? So you download the data from Postgres SQL to Iceberg, for example, and then to do the OLAP. I think, hopefully, in graph world, this can be the common practice as well. Like, you don't want to store all the historical data in graph database. That's too expensive. And also, it affects your transactional QPS when you run some heavy analytical query.
And also it's very slow since we see a lot of time out and auto memory to run that kind of query. But if you can still use a graph database as transactional updates and then load the data to the iceberg and then use, for example, PuppyGraph to run the OLAP query.
And then it will have a lot of benefits, which proved very well in the SQL world. So hopefully this can be the common practice. And also we help some customer keep in the graph world. Before that, they're trying to migrate away from Neo4j to PostgreSQL because of ecosystem problem. And we said that you don't have to do it now. Graph have OLAP as well. And then they just do the data loading on the CDC from Neo4j to Iceberg, and we run a query on top of Iceberg.
I'm interested in digging a little bit more into some of that translation layer and the data modeling and representation of graph structures in the Iceberg ecosystem because Iceberg was designed primarily
with tabular structures in mind. And I know that, for instance, Kuzu DB actually uses a columnar representation under the hood, but I'm just curious what are some of the points of impedance mismatch between a graph native representation on disk, for instance, from something like a MemGraph or a Neo four j and how to actually do that translation into Iceberg and mapping to and from the structural semantics?
Yeah. So in our design, we decoupled the computation and the storage at all. In this case, the computation is still on graph mode. But the storage, just since what we're doing then, we define the node operator and the edge operator. And the operator's input and output are collection of nodes and edges. And in this case, we're assuming
all the graph query, graph pattern, or graph algorithm can be a combination of node operator and edge operator. Then we can do the cost based optimization. And for a single operator, because the input and output are collection, so we can do the MPP and also vectorize the evaluation. And of course, in this case, because it's a collection, so the column based storage is really, really important. And then final stage, we still need to fetch the data. In this case, we just run the Parquet file reader. And then read the Parquet file and translate it into collection,
and then to the computation. And I think this is an interesting part. Before that, all the graph database, they're trying to speed up to support a complex query, but it's still close to row based. So because it's hard to support the high QPS transactional updates. But in SQL world, everybody know that the OLAP and the OLTP need to have a different storage, and OLTP need a row based one and the OLAP need a column based. And with the column based one, it's much more memory efficient.
And in this case, we can handle much larger data. And at the same time, the query complexity is no longer a problem since, for example, one CPU instruction can handle a vector of nodes and edge. And at the same time, because we only access the necessary attribute, like even one node or add have 100 attributes. But maybe for single query, only three or four is related. If column based, we can just leave all the other 97 or 96 attributes on disk. In this case, it's much more memory efficient.
One of the other challenges for a product like PuppyGraph is that graph engines, as we mentioned before, have been somewhat niche for a while. They're not as broadly adopted as a Postgres or a Snowflake. And I'm wondering what are some of the areas of education that you've had to invest in to help people understand the power and benefits of having that native graph structure and graph traversal capability available to the underlying data that they're already investing in?
Well, I think this is a kind of a chicken egg problem since before that, the investment before you run the first graph queries are too heavy. So even some perfect graph use case, people still want to, for example, write a SQL or write a Spark job to do it. Because even it's slow and complicated, but you don't need to do a lot of have another copy of data and have another pipeline. And in this case, I think it is hard for the user to adopt the native graph engine.
And at the same time, we feel that actually there are certain requirements, and a lot of users just give up the use case after they try different ways. Like when they have very complex things, they try the complex SQL, but it's very soon become no longer, it's no longer human readable. And there are hundreds of lines SQL or either too slow
or like if a lot of customers have 1,000 tables, but in daily work, there are only 20 or 30 tables are used. All the others are just left there and no one accesses it at all. But now because we show the possibility to a lot of our customer and they see that, for example, they can write very complex query short way, like ten ten lines of query. The expression capability is the, more than 100 lines SQL.
And then they feel that and the the interesting part is that when they feel that, oh, this works and can just return the result, they will keep trying our capability and write more complex query. And this is more common when they are using an agent since agent don't care how complexity the query will be because people just ask us some question and assign a task to agent, and the agent will decouple into subtasks. And each subtask can be more complexity, more and more complex.
And in this case, some sometimes they send the logs to us to help let us debug.
We feel that even the graph query, the one hundredth line, this no longer can be readable. But the agent can just write the correct one, which is a surprise for us. And we feel that if we show the stronger capability and, like, a more complex query can be handled in a short time and the response is in real time, and then people don't care, like, they want to issue more complex query and assign the agent more complex tasks.
So we believe that the usage will be larger and larger. Before that, maybe it's limited because the limitation of the tools. So people have to give up some wonderful idea. But now they can just try, and also the agent that can help them to try. So is the cost is pretty low now. So they can try some fancy idea without heavy invest. Another element of the overall graph ecosystem
that has varying levels of support depending on the underlying engine or the language that you're working within is the core graph query and traversal, and then a lot of engines will add another layer of out of the box graph algorithms or graph machine learning or data science capabilities such as between the scores and centrality scores, etcetera. And I'm wondering what the capabilities are around PuppyGraph for being able to do some of those more native graph feature extraction and discovery.
Yeah. I think this is something different from PuppyGraph to other graph solutions. Since most of the graph solutions, their query engine and the graph algorithm are implemented separately. Like, have a query engine, and also they have an independent implementation of a graph algorithm one by one. But for us, we try to all leverage our engine, as I mentioned, the node operator and the add operator stuff. And in this case, when we implement a new graph algorithm, we don't need to start from zero,
like how to parallel process the data or how to share the data. We just need to implement an algorithm like decode pretext. And after that, we can deliver a new algorithm within one week, something like that. Some of the customers even implement their own graph algorithm by the query language. Since, you know, Grammarly is Turing complete. So before that, people don't do it just because if we implement through query engine, it will be too slow. And like GNS Graph, it's a single thread engine.
So if you implement on this, the algorithm will only run on single thread. So it's a potential issue for the performance. But for us, some of our customers just customize their algorithm based on the query language, and that is another option. So we feel that. And also,
we won't charge additional for that. The people just charge our usage for the engine, whether it's algorithm or Cypher query or grammar query, we charge the same. We don't have additional, like, enterprise feature charging for that.
And so for somebody who is interested in using the capabilities of graph engines for doing some of that graph traversal discovery, semantic capture, and ontological representation. What are some of the guiding questions that you would ask them to help them determine whether PuppyGraph or another solution is the appropriate solution or maybe even just say just use NetworkX and Python because it's a one off type of use case?
Usually, if our customer already have data in data warehouse, data lake, or even database, we recommend you use a polygraph since it makes the pipeline much shorter and the system complexity is much lower. So in this case, they can just have, for example, all the XBERG tables, define graph on top of it, and then run the graph query and the graph algorithm.
Make the and also the results can write back to XBERG and then leverage by Spark or some other tools, and also even PyTorch, this kind of stuff. But some of them, for example, some of the data scientists, they don't care the data warehousing. All the stuff already They use Python all the way and all the stuff already in CSV file. And if We also support it, but it seems it is easier to use embedded one, like they can just use Python to read the CSV file like DuckDB
and then just run some network X on top of it. And then we feel that the people are using Data Lake and Data Warehouse like us because the reason they use Data Lake and Data Warehouse is because their data size is big, and they don't want to handle the distribution things. And so we're in nature to be a good fit. But if people just use DuckDB
and have small data set, they can use DuckDB to handle the things, and it's embedded on Python, and all the stuff can be laptop at all. So we're trying to recommend our product to some of these users, but I don't think we can convince them because, frankly speaking, just the Python ecosystem with DuckDB and also NetworkX is better and more convenient than PuppyGraph.
One of the other interesting aspects of where we are right now is the proliferation of vector embeddings for some of these unstructured sources. And one of the patterns that I'm seeing is using a vector query to determine the starting point into a graph and then doing traversal from there. And I'm wondering how you're seeing people deal with that, particularly if they're using Iceberg as the underlying storage given that Iceberg doesn't really have native vector indexing capabilities.
Yeah. For ourself, we can query the Iceberg array as a vector. And also, I hear some news from the Iceberg community. They will support it very soon. And I think, like, lessDB guys are also actively working with them. And hopefully, Iceberg can have a vector type very soon. But currently, we just query the array type in Iceberg as a vector. And I think it works fine because it's more like an index on top of a read type, and then we can run vector search on top of it.
And as you have been building PuppyGraph and helping your customers get up to speed with it and understand its applications? What are some of the most interesting or innovative or unexpected ways that you've seen it used?
One for customer is Palo Alto Network. Have several teams using our products. Some teams are using the as a posture management, is a customer facing project. But the interesting part is that the security research team, while they're doing that, they just have all the logs in Iceberg and use polygraph content to all their logs and look back, just to visualize all the data. And then they found some botnet work. And the author was arrested already. And people believe that the malware was gone
and no one do the detection anymore. But after they use public graph to look back at the logs, there are still a lot of bot bot network is attack all the stuff. And so the bot network is still active, even the attacker was arrested. So they feel surprised, but they also feel that this is very helpful. And we also didn't expect it. Some the usage is more like a Splunk and is a more complex Splunk. They can just use the public graph to to be the log reader and then see what happened in the past.
And in your experience of building this product and platform and investing in this zero ETL capability for graph traversals and graph exploration, what are some of the most interesting unexpected or challenging lessons that you've learned in the process?
So we have some, like, one case is that at the beginning, we only support the Iceberg and later Delta Lake and Hudi and Hive. But then some customers want us to support the database. We're trying to use the similar way, but it's not a packet file reader, but projection with filters, like select attribute one from table A with some filter. But literally, feedback from customer is that if we read too much from the transactional database, the QPS will be affected.
And then what we're doing is that we just do a cache layer and cache all the data from the database. But the lucky thing is that the database,
usually the data in database is not very big. So we just have a snapshot of the database data and then do the CDC, which is very different from our initial design. But I think with design partners and early customers, their feedback is very, very important since what they care is not like if it's pure in this stage, it's just how we can fit into their production and how they can leverage the public graph technology.
So I think it's more like not just like we designed and then everybody follow our pattern, but also after our early adopter trial product, they provide feedback and we're trying to follow their request and then have a different design for different use case. And we feel this is very valuable. And also like the cybersecurity guys, one and a half year ago, we are totally outsider of cybersecurity.
But after the different leading cybersecurity company reach out to us and they teach us how cybersecurity industry can leverage polygraph, and they also let us to find like some other company may use our product as well, and they gave us the names and let us to reach out to them. And their insight and the terminology, they're very helpful for us. Since just the engine is not that useful, but we can talk with the user a lot. We know what's their pinpoint and, how we can address their pinpoint.
And what are the cases where PuppyGraph is the wrong choice and either the problem is just not a good fit for graph data generally or you'd be better served with a different graph engine.
Yeah. So really one typical case is that, for example, you want to have some personal AI memory storage. And in this case, polygraph is not a good one since, really, for personal AI memory, it's not big. And embedded solution like maybe Kuso or some others is better. Like, you can just run on top of your run-in your laptop. And at the same time, you can support the transactional updates. And single data stack is good enough. And all staff can embed it in
Python program, for example. And in this case, I think it's a better solution. And also, we have a lot of similar case. And
I think one good, when we do the judgment, I think whether the data size is very important. If the data size is small, I think the graph database is much better. Because you are writing data into it and you are reading data from it, And you don't need a data pipeline at all. In this case, I think, especially for the embedded graph database like Kudu, it's very good. And then you can have a you don't need to have a service right now. You just have,
for example, Python program and embedded Kudu in it, and then you can have all the functionality you need. So this is a one typical case.
And as you continue to build and iterate on PuppyGraph and its capabilities, what are some of the areas of improvement or new features or projects or problem areas that you're looking to dig into in the near to medium term?
Why is that? Definitely the enterprise features. Like, because most of our customers are very big, and either our customers are big or our customers' customers are big. So we are supporting the enterprise features, single sign on, rule based access, and all is in preview already. And another thing is that we want to have a better support of data warehouse. Same for data lake, because it's an open format and we can have access to all the metadata and table stats
and all the related information, and then do the cost based optimization or some others. But for the warehouse, because some of them are pretty closed, and so we need to collect all the information by ourselves and then have a better push down on the cost based optimization. But another good information for us is that currently the founders of Parquet files and Apache Arrow are working on a project called Columnar, and their project is ADPC.
And a lot of the warehouses are supporting ADPC now. And then we can read data through Apache Arrow. You can see that is much faster than just pure JDBC. So I think because ecosystem evolve a lot and sometimes we do need to implement what we need, we just wait. The feature we need is coming. And also, culinary is our good partner. And after they share the project with us, we feel, oh, it's it's amazing. It's yeah. Chinese word is something like, when you're trying to sleep, you have a pillow.
Are there any other aspects of the work that you're doing on PuppyGraph or this overall zero copy ETL graph traversal capability that we didn't discuss yet that you'd like to cover before we close out the show?
I think these covered all. You are super expert and you ask a lot of problem, even we don't know before. After we engage with our customer users, they propose that. But definitely you have super long users in this area. So I think it will cover all questions.
Thank you. And so for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data and AI management today.
Yeah. I think we want to have a better agentical framework, and currently, it's working, and we have a confidence to put it in production with a lot of customers already. And I think the improvement is more easier to use. And also, we want to make it easier to connect with the fine tuning and reinforcement learning things and the tools. And then we can have a better framework to embrace the ecosystem.
All right. Well, thank you very much for taking the time today to join me and share all the work that you're doing on PuppyGraph and the different use cases that it enables and some of the technological and architectural challenges of being able to act as that zero copy representation
on top of customers' underlying data. It's definitely a very interesting project and problem space, and I appreciate all of the work that you're doing to make graphs more available and accessible to a broader variety of use cases. So thank you again for that, and I hope you enjoy the rest of your day. Yeah. Thank you so much for the opportunity. Yeah. Have a good one. Podcast.net
covers the Python language, its community, and the innovative ways it is being used. And the AI Engineering Podcast is your guide to the fast moving world of building AI systems. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts@dataengineeringpodcast.com
with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
