Postgres vs. Elasticsearch: The Unexpected Winner in High-Stakes Search for Instacart with Ankit Mittal - podcast episode cover

Postgres vs. Elasticsearch: The Unexpected Winner in High-Stakes Search for Instacart with Ankit Mittal

Sep 17, 202522 minEp. 46
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Summary

This episode details Instacart's strategic shift from Elasticsearch to a self-hosted PostgreSQL cluster for its retailer search, driven by the unique demands of fast-changing grocery inventory. Ankit Mittal explains how consolidating search, ranking, and filtering into Postgres, leveraging extensions like PG Vector and robust data pipelines, drastically reduced network calls and improved efficiency. The discussion also covers the architectural decisions, trade-offs, and future directions for building high-performance search systems.

Episode description

In this episode of The Data Engineering Show, Benjamin Wagner sits down with Ankit Mittal, former Senior Engineer at Instacart, to explore how they revolutionized their search infrastructure by transitioning from Elasticsearch to PostgreSQL. Learn how Instacart tackled the unique challenges of fast-moving grocery inventory, achieved high-performance search capabilities, and leveraged PostgreSQL extensions for complex retrieval operations. Whether you're scaling search functionality or optimizing database performance, this deep dive offers valuable insights into building robust, production-ready search systems using PostgreSQL.
  • Discover why Instacart moved from Elasticsearch to PostgreSQL for retailer search
  • Learn about handling real-time inventory updates and search optimization
  • Explore PostgreSQL extensions, sharding strategies, and data flow architecture
  • Understand the trade-offs between different search infrastructure approaches

What You'll Learn:

  • How Instacart managed fast-moving grocery inventory data by consolidating search, ranking, and filtering into a single PostgreSQL cluster
  • Why pushing compute closer to the data layer can significantly improve search performance and reduce network calls
  • The architecture decisions behind using PostgreSQL extensions like PG Vector and custom solutions for search functionality
  • How to implement efficient data ingestion through S3-based pipelines and bulk writes instead of real-time updates
  • Why table maintenance operations like PGD pack are crucial for optimizing read throughput in production environments
  • The trade-offs between traditional search engines and relational databases for complex search implementations
  •  The challenges of maintaining self-hosted PostgreSQL in a predominantly cloud-managed environment

If you enjoyed this episode, make sure to subscribe, rate, and review it on Apple Podcasts, Spotify, and YouTube Podcasts. Instructions on how to do this are here.
About the Guest(s)

Ankit is a Software Engineer at ParadeDB and former Senior Engineer at Instacart, where he specialized in PostgreSQL infrastructure and search systems. With extensive experience in database optimization and search architecture, he played a key role in modernizing Instacart's search infrastructure by transitioning from Elasticsearch to a custom PostgreSQL solution. In this episode, Ankit shares deep insights into building and scaling high-performance search systems for e-commerce, particularly focusing on the unique challenges of grocery retail's fast-moving inventory. His work at Instacart revolutionized their single-retailer search functionality, demonstrating how traditional relational databases can be adapted for complex search operations. His expertise in database systems and their practical applications in high-scale environments makes this conversation particularly valuable for engineers interested in modern search architecture and database optimization.
Quotes

"Think about it. If there's a lot of things that you can get the database to do, then the applications become simpler." - Ankit
"My non-Instacart experience has largely been in pre-PMF startups where the approach of abuse your database to its absolute limits works wonders." - Ankit
"Almost everything that we got retrieved had to be filtered out. So we go back to Elasticsearch again." - Ankit

"We traded off the quality of retrieval, hardcore core retrieval, with the whole system reducing the network calls." - Ankit
"It's a place to go to find what item is available, in what store, what item is available, at what price, including full product taxonomy graph and product and ontology." - Ankit
"The grand theme here is that we wanted more control over the cluster, how to spin it off, what kind of disks it would have." - Ankit
"We tell teams who want to have their data in this cluster, create an s3 home, create either a bucket or a home, whatever they want to do, and tell us that we would sync ourselves." - Ankit
"What we found is that the read throughput, we can throw more data if the tables are repacked nicely." - Ankit
"Most engineers who want to work on search, they are more used to the Elasticsearch shape of the query." - Ankit
"The relevance is better because they could join more things in the database. They also saw the cost of the normalized data reduced." - Ankit
Resources

Company Websites:
- Instacart - Grocery delivery platform
- ParadeDB - Database technology company
- Firebolt - Cloud data warehouse (firebolt.io)
Tools & Technologies:

- PostgreSQL - Database system

- Elasticsearch - Search engine

- PG Cat/PG Dog - PostgreSQL proxy tools

- PG Vector - PostgreSQL vector extension

- PG Repack - PostgreSQL table repacking tool

- ClickHouse - Column-oriented DBMS

- TantiVy - Rust-based search engine library

Articles:

- Instacart Search Modernization Blog Posts (Series on hybrid retrieval)

- Target's AlloyDB Migration Blog Post


For Feedback & Discussions on Firebolt Core:


 Primary Speakers:

The Data Engineering Show is brought to you by firebolt.io and handcrafted by our friends over at: fame.so

Previous guests include: Joseph Machado of Linkedin, Metthew Weingarten of Disney, Joe Reis and Matt Housely, authors of The Fundamentals of Data Engineering, Zach Wilson of Eczachly Inc, Megan Lieu of Deepnote, Erik Heintare of Bolt, Lior Solomon of Vimeo, Krishna Naidu of Canva, Mike Cohen of Substack, Jens Larsson of Ark, Gunnar Tangring of Klarna, Yoav Shmaria of Similarweb and Xiaoxu Gao of Adyen.

Check out our three most downloaded episodes:

Transcript

Instacart's Initial Search Challenges

Now, what different Instagart was that our items was so fast moving? It's a grocery store. Things go in and out of stock very frequently. Almost everything that we got retrieved had to be filtered out. This is Benjamin. Before we start with today's episode, I wanted to quickly reach out on a personal note. We've just launched Firebolt Core. Firebolt Core is the free, self-hosted version of our query engine. You can run Core anywhere you want.

from your laptop to your on-camp data center to public cloud environments. Core scales out and you can run it in the multi-node configuration. And best of all, it's free forever and has no usage limits. So you can run as many queries as you want and process as much data as you want. Core is great for running either big data ELT jobs on, for example, iceberg tables or powering high-concurrency customer-facing analytics on big data sets. We'd love for you to give it a spin and send us feedback.

You can either join our Discord, enter our GitHub discussions, or you can just shoot me an email at Benjamin at Firebolt.io. We'd love to hear from you. We added a link to Firebolt Core's GitHub repository to this show notes. And with that, let's jump straight into today's episode. All right. Hi, everyone, and welcome back to the Data Engineering Show. Today, we're super happy to have Ankit on. Ankit is, well, now a software engineer for ATB, I think. But before that...

was a senior engineer at Instacart, did a lot of work on Postgres, their data stack. Excited to have you on the show. Do you quickly want to introduce yourself? Yeah, kind of share your background. Yeah, thanks. Honored to be here. So yeah, as of two weeks ago... I was an engineer at Instacart. Now I've just recently moved. I mainly worked on the Postgres storage infra search infra team. And there's a series of blog posts on how the search was modernized that will more talk in detail.

Today, search was modernized with hybrid retrieval. And there's also talk of how we moved from Elasticsearch to Postgres. And we'll talk about the reasons later. Before that, I had worked in a bunch of different startups in Canada and India as well.

always been interested in databases and special love for Postgres because in my experience, choice of database boosted dev productivity in general. Think about it. If there's a lot of things that you can get the database to do, then the applications become simpler. And my non-Instacart experience has largely been in like, I think it's like pre-PMF startups where the approach of abuse your database to its absolute limits, it works wondrous. Now you can argue that it can be done at scale too.

Well, that's for later. Nice. So take us through search at Instacart, right? And I think one thing that also is interesting to our listeners is like, and was similar for me as well for a long time, you have... in your mind like okay here's my traditional relational sql database then there's search things and i think traditionally there would actually be little overlap between these things and i think what you're seeing

Comparing Instacart's Search Retrievals

Nowadays, and I think Gen AI is actually also driving that even harder. It's like you see this kind of merging of these concepts into individual systems that can do everything. So yeah, kind of maybe take us through how do these problems actually pop up in the Instacart product? Yeah, that makes sense. Let's go one step back. Let's talk about retrieval in general. What is the lay of the land of the retrieval? What was the lay of the land? So there is like...

If you go to Instacart app as a consumer, now again, there's different apps for shoppers, there's consumer, and then there's ads where brands do their thing as well. So I'm sure there are retrievals over there as well. But as a consumer, the first thing that you...

get in touch is auto-suggest retrieval. So you type B-A-N and you expect banana to be there and bunch of banana smoothie and whatever is there in grocery store. So that's one retrieval. Then there is, we call it home search, which means you just Say, I want eggs.

The app will show you a list of retailers and a bunch of items in each retailer. That's one retrieval. And then there's another retrieval, which is you know which grocery store you care about because you're very passionate about. I want my eggs from this specific. And that's me.

My eggs come from one grocery store. I know the chicken. I know the mother of the chicken who raised her and she has great eggs. Right. So that's the single retailer search. And that's the biggest, you can say, search retrieval by volume and by server strategy. importance as well like instacart is a retailer first marketplace compared to the peers and this blog only talks about this line we don't talk about anything and even autosuggest

And you can imagine a bunch of other retrievals exist that I don't even know. I'm sure some teams are still using Elasticsearch for their underlying retrieval, like ads manager. I have never even seen that. So it would have some boxes for search. Someone's team was also using ClickHouse via a foreign data wrapper in Postgres because it works for them. So why not? And one flag that I get after the blog was released by Elasticsearch fans is that they felt that...

The blog claims that we kicked Elasticsearch out of Instacart. No, that's not true. We only removed it for the third kind, which is Singular Tailor Search, and there were good reasons for it. So let's deep dive into Singular Tailor Search. So previously, we had...

Consolidating Search with PostgreSQL

an elastic search that was doing the search retrieval. And the other things was application. So what are the other things? Filtering the retrieval for some use cases, ranking and hydrating whatever was retrieved with more attributes. And then a ranking after hydrating as well. It's a bunch of different steps. And it's a standard model, if you imagine. And that's what everyone does if you have multiple kinds of filtering and ranking.

What different Instagart was that? Our items was so fast moving. It's a grocery store. Things go in and out of stock very frequently. And we also have people who care differently about different things. I might care very specific about this brand I care about. Someone would care. I just care about getting eggs at the point. I just care about ice cream. I just want an ice cream right now. So you can imagine the solution we were...

Moving towards was an ensemble of models and the weights of those models being controlled by the context. The context can be the search query, understanding of search query. Now these days, some of the context is enriched by LLMs as well. Because of the fast-moving nature of the grocery store, like things going in and out of availability, in pathological cases,

almost everything that we got retrieved had to be filtered out. So we go back to Elasticsearch again, hey, give us 200 more results or whatever, X number of more results. You do that five times. Now, this could happen because we get updates from retailers that, hey, this item is...

out of stock right now, or our machine learning models tell us that the item is gone. So for an application engineer, think about it. This situation is exactly what you would see if you, by mistake, do an N plus one query. You have to file multiple queries. So what's the solution? Well, the solution is to avoid going back and forth across the network where the network delay eats up all your latency. So we created a cluster that had everything in it.

It had search index. It had ranking and boosting tables. It had availability, machine learning availability, and all other filter tables that the query may care about. And also, we could add more ranking and boostings as the team wanted. And that's what the Postgres was too. In essence, we traded off the quality of retrieval, hardcore core retrieval, with the whole system reducing the network calls. So I guess another way is to say that we pushed down the compute.

to the data layer, closer to the data layer, which is, I guess, an approach opposite to what you guys are more familiar about. We love pushing computer to the data layer. By the way, the beauty about our space, it's confused. There's a lot of confusion. You know, like you said, something now pushing down the computer, the storage, we can even go further and say to get the speeds we need, we need to offload a lot of the compute from reading to writing. Right.

And it goes into pruning, even into the codecs you apply on your data. Like what we say, you push compute down to the storage is actually, in many cases, 80% of what gets you from A to B and gets you to production. Yes, this is all hidden, done by engineers. Go on. So you came to the realization where, okay, we need to do that. So what did you do?

Yeah, so we traded off the elastic search, like PM25 with a TS vector, and some modifications had to be made to how the TS vector is created. But essentially, this gave us avoiding of the pathological cases. And there were side benefits also, like, you know, data stored in normalized tables was...

cheaper than, you know, full-blown denormalized. And also there's operational benefits and simplification of stack, like the infra engineers, which is like sort of my team. We were not... owning operating and managing two different systems or just the application would call single query that does a lot and we'd get done with it

Also, one thing which we missed out writing in the blog is that the cluster is not just a search cluster. So search is one of the workloads it's offering right now. It is a place to go to find what item is available. in what store what item is available, at what price, including full product taxonomy graph and product and ontology. So you could query products, filter, rank them in any sort of flexible way you want in a single query. Now, does that sound like a silver bullet?

Understanding Instacart's Search Query Flow

Maybe. Yeah, so maybe one or two questions just for kind of like me to also understand better the flow. So there's this big Postgres cluster kind of running. It can do a million different things. It can also do search and ranking.

Now I go into my Instacart app and I start looking for, okay, let's stick with your exit example. It's like, how does the request flow actually look here? Like, do you really send a query for every character I type or you... predict how a single interaction with the app of looking up a certain term, how many queries does this actually cause on the backend?

Right, right. That's a very good question. So the single character goes to the search auto-suggest service, which is a completely independent thing. It's still using Elasticsearch. It is always used, and we didn't change it. It was working beautifully. does what it's supposed to do. You got a lot of hate from Elastic fans. You sprinkle in a lot of love.

Or elastic. I mean, it's good. It's very stable and it does what it's supposed to do. So my intention was never to say that, hey, this is bad. This is good. It was like, hey, this data model and this whole system doesn't work for that use case. So basically you're saying if it worked for spell checking, it shouldn't necessarily work for anything else. And it's good that it still does spell checking and searching. That aspect of searching really well. So all good.

And no disrespect to anyone. Yes. And we're doing more than that. We're doing some like if it's a brand search. Because I do remember there was a time when we were running a Red Bull campaign. And if someone would type red, the red bell pepper would be at the top because guess what? The CDR boosting, the CDR, the click-through rates of red bell pepper, there are so many people who buy it day-to-day that they were making it at the top. So all that ought to suggest.

it still would give you the knobs. Benjamin, top K problem. Yeah, auto-suggestion doesn't go into your service. So at what point, right, like again, like maybe take me just through the user flow, would my interaction start sending queries against your system?

So as soon as you press enter that you choose your search term. At that time, the application would construct a SQL query for Postgres and it directly hits Postgres. There's no memcache. There's no other caching systems. It directly hits the cluster.

The application would do certain things like query, we call it query understanding, to build more parameters for the search query to the Postgres engine. And then when we get the response from the Postgres query, it is ranked again by the application.

So we have like different passes of ranking. The first pass ranking is done by the Postgres itself. The second one and the division is just because whatever we want to move fast, whatever application engineers feel that, hey, this is where we want to iterate.

it's better to keep that in application and not push down to the database. And that's it. Now, it sounds simple from the architecture side, but each component of it has its own complexity. I don't think I'm even aware of what that complexity...

PostgreSQL Infrastructure and Data Management

happens. So I can talk more from the architecture, from the data flow side. Hey, it's all because you're using Postgres. This is why it's complicated. You know, yeah, go on. Right. But the flow is very simple. The application prepares query, sends it to... Postgres gets the result. And then there's some hydration that final clients also do. By clients, I mean, the actual mobile apps or the web application, that also is hitting the same Postgres. Gotcha. It's all.

I mean, this wasn't vanilla Postgres in the sense that you had like use Postgres extensions there as well, right? Like use something like PG vector kind of take us through your modding Postgres kind of journey like What extensions did you use? Did you custom build any extensions? That's a good question. So the shape of the poster has changed throughout the years so that the large team here is having control.

over our own Postgres. So that's the only reason why this was self-hosted. It's a self-hosted Postgres. It uses a very new proxy called pgcat. pgcat has moved on and now there's a pgdog as well. the same author. It's pretty good. It's stable. It works. Would you recommend PGCat or PGDog these days? So I have not used PGDog at scale, but I've played it on with it and I like it. Conditions apply, like try it at scale. There could be bugs, but the...

Devs are like, I know like Lev is a friend and he's very responsive. So what it allows is a work with a sharded database in a way as if it was not sharded. So it adds the sharding layer to the PD bouncer. So coming back to the question, the grand theme of here is that we wanted more control over the cluster, how to spin it off, what kind of disks it would have. So that cluster is purely NVMe based. We don't use any network attached.

okay if the primary goes down for an hour. We would rebuild it back from the backup instead of failing over. Like, I didn't want to be on call for that if the process had to be failover. So all the intelligence... here that handles the failover in a way is an application. Even if a replica goes down, you need to blacklist that replica. All that is happening on the application. So you can imagine that the client for the application is a thick client.

It does a lot more than just the query. It does all the failover, high availability. So that allows us to be simple on our infrasight. How does data flow into this Postgres database in the first place? Was it attached to... Kafka, is it just plain updates and deletes kind of being run all of the time? Take us through the entire journey, basically. It's a first-class citizen of how this cluster operates. So all the writes are pipelined. We don't use Kafka.

The interface is S3. So we tell teams, whoever want to have their data in this cluster, create an S3 home, create either a bucket or a home, whatever they want to do, and tell us that we would sync ourselves. So we'd pull S3 for changes. get the data in bulk and all the writes to the Postgres would happen like a bulk write. So like either 16,000 rows or 32,000 rows, whatever that is. So it's append only or there were updates and deletes as well? It is merge.

The Postgres has merged commands. So we would tell teams to tell us what are the primary keys or what are the keys that would help us identify if it is an insert or an update. And there is no real-time writes or update allows that. We also supported deletes through a very special mechanism. It was like you just configure that these rows are supposed to be deleted and we would do it at a cleanup at a later stage. A big part of the system was also PGDPack.

So PGDPack is something that is like a super vacuum or something. It does it very fast, although it uses a lot of IOPS and it uses a lot of system resources to do that. It essentially creates a copy of your table, real-time syncs it, and at the... opportunity it finds, it swaps, hot swaps, it deletes the old. So the new table that you get is defragged, it's nicely packed, it's clustered again. Suppose Gris has this cluster table.

Cluster is an overloaded term. I hate that. It causes so much confusion. But what we get is a nicely clustered table that has its rows tied in a way that helps query at the read time, essentially. We were running like... tons of repack on, like almost always there is some repack running on. And it was hard, like it was tuned heavily because what we found is that the read throughput, we can throw more data if the tables are repacked nicely.

These are fragmented and they're not clustered. The writes would slow down. More of a write throughput optimization. And it's probably an order of magnitude more by the three-pack tuning. So that's how the writes are going in. From the customization, we had in the past a custom extension for a dot product. One of the rankings we were doing in the Postgres was personalization ranking as well. So you would get your products.

And at the end, the final ranking would be personalized. So this was happening in the Postgres through a very, like it's a 50 line extension in C, just to a dot product and give it back to Postgres. Now that was all deprecated because PG Vector came somewhere in. 2023 and we moved our retrieval to pgvector but also pgvector could do your own default dot product nice okay very cool yeah it's always interesting to hear like these production stories from postgres really like the amount of

moving pieces and extensions being used is really kind of cool. The interesting thing is you hear about more and more workloads, use cases where when engineers say simplicity, they mean like we want that Postgres. interface. We want to be able to normalize. We want to be able to have more complex queries running. There are many moving parts. It's not just a on-the-metal time series, real-time, single denormalized.

five seconds, query span. Like this is something completely different up to the point where some of the data is being batch processed on the side and hot swap because you need it to be super hot. The data. As it's being served to those users, it needs to be super hot. SSDs were mentioned, like you can trick your way around it, but then you need to deal with Postgres backend, which is at a whole different place.

You need to have modes, extensions, which gives you exactly the flexibility you need as an engineer to take the concept, the kind of the strength of Postgres and evolve it over time. As you said, like types change, search patterns change.

Developer Experience, Trade-offs, and Future

Absolutely. I love it. Thank you for sharing with us kind of the whole thing end to end and specifically kind of the whole effort you guys are putting on the back end. It was almost a silver bullet, but not quite. There were certain trade-offs as well, like...

Eventually, what we found out was that the DevEx, the developer experience and how engineers use it, especially search engineers, it was less than ideal because you realize that most engineers who want to work on search, they are more used to the Elasticsearch shape of the query.

Even those who knew Postgres among those, they found the final query to be hard to work with. And scaling, provisioning, sharding, operating this self-hosted Postgres in a company that relies on managed cloud for everything else. Like everything else is on RDS. It's amazing, right? That was, again, a hard thing. So you have this small team that does its own thing that no one understands. And we also noticed that replicating production behavior in a non-production.

It was impossible. So if someone is making a change to really know what's the impact, it had to be tested in production. And that makes everyone hard. Like, you know, slower is the velocity. puts more questions in the queue, like, have you made sure everything is right for this? And this sort of thing steered me to what I'm doing now, like to, sort of, as of last two weeks, I joined ParadeDBN. The promise here is that it's a transactional elastic search alternative.

So we ship a Postgres extension that is built on Tantivi, which is a Rust fork of Lucene, and it uses Postgres storage. And that's my personal sort of observation, is that if I have to rebuild the Instacart-like search... It will be like a pre-DB. A lot of things would be, especially the DEMX, would be much simpler. And taking a state into that conversation in different ways. Yesterday, I saw a blog post by Target. They moved their...

search workload to LIDB in a sort of a Postgres compatible way. And it's not clear what were their motivations. They could be similar motivations, but their outcomes are very similar. Like the relevance is better because they could join more things in the database. They also saw the cost of the normalized data reduced. So that's pretty much matching with what we saw at Instacart. Nice. Very, very cool. Well, Ankit, it was amazing to have you on the show. We appreciate the...

Total, Postgres, Instacart, Deep Dive. Yeah, we'll follow your work at ParadeDB. We're excited about all of that. Thank you for being on the show. It is an honor. Thank you. Thank you. The Data Engineering Show is brought to you by Firebolt, the cloud data warehouse for AI apps and low latency analytics. Get your free credits and start your trial at firebolt.io.

This transcript was generated by Metacast using AI and may contain inaccuracies. Learn more about transcripts.
For the best experience, listen in Metacast app for iOS or Android