Append-only tables

Nikolay

00:00

Hello, hello, this is Postgres.FM. I'm Nikolay, Postgres.AI, and my co-host is Michael, pgMustard. Hi Michael.

Michael

00:09

Hello Nikolay.

Nikolay

00:10

And guess what we are going to discuss today?

Michael

00:14

I'll guess. Is it append-only tables?

Nikolay

00:18

Exactly.

Michael

00:20

Ah, got it right.

Nikolay

00:21

I was surprised to hear we haven't discussed it in the past I'm sure we touched it many times right

Michael

00:29

yeah for sure it's come up in episodes but me too when I saw it in our listener suggested topics I did have a few searches on our site just to make sure we hadn't talked about it already as a whole episode and yeah agreed but it's not something I see all the time like it's it is it's relatively common like to have an events table but even then I mostly see append-mostly tables rather than

Nikolay

00:59

like oh yeah I was thinking we are going to discuss both. Yeah. Do you feel in the episode for append-mostly tables? No?

Michael

01:11

I don't, well, I actually, I don't think this is super complex. I think it's quite nice because almost by definition if we do accept that it's append-only We've got quite a narrow scope and there's only a few things to consider Maybe it gets a bit more complicated with append-mostly. But yeah, we can definitely cover that today. I think it's still not that complicated

Nikolay

01:33

yeah where do we start like definition

Michael

01:40

go for it yeah why not

Nikolay

01:43

well if we just insert that's append-only that's it

Michael

01:46

insert and select don't I would say yeah no updates no deletes.

Nikolay

01:51

Select is allowed right yeah insert and select this is the only 2 things we can allow ourselves from DML data manipulation language and that's it yeah we just select and insert this is append-only If we have occasional deletes and updates, it's append-mostly or insert-mostly, I don't know how to say.

Michael

02:15

Yeah, I like that.

Nikolay

02:17

And why do we care about this particular use case? Because it has characteristics, right? It usually has... If it's append-only, for example, we don't care about dead tuples anymore. No bloat, right? It's good. And usually we talk about huge volumes of data. And at some point we think, oh, we need to compress it, we need to offload it maybe to cheaper storage, or just clean up, because all data is not needed anymore in raw form.

02:53

Sometimes it's aggregated and in raw form we can just remove it from database. Or we just need bottomless. And usually we want inserts to happen very fast, because a lot of volume is huge. So we need to make sure performance of inserts is good. Did I miss anything?

Michael

03:17

No, I think those things aren't necessarily always true for append-only tables but they correlate like a lot of the use cases for very very fast growing data and and by definition append-only means it's never going to decrease in size. It's only going to keep getting larger

Nikolay

03:33

and larger. Unless you clean it up. Well, there are specific cases. For example, imagine we discussed many times the topic of slow count. And if you can allow a synchronous calculation of count, maybe it's like materialized or something. I don't know. So idea is instead of updating the count somewhere on each insert in the original table You can aggregate operations in intermediate table, and then it's append-only.

04:05

So you register events in some table, and then you process chunk and reflect this in count in final storage. And then you can delete it or better drop this partition or truncate or something. Right, in this case, it's append-only, but it grows, grows, and then the size drops. It happens also, right?

Michael

04:29

Yeah, I think dropping partitions definitely pushes the definition of append-only, but it's the thing that makes sense to do in most, or a lot of cases, at huge scale. But yeah, is it still append-only if we're dropping partitions?

Nikolay

04:48

Yes, this is how we should do

Michael

04:50

it no I know but do you see what I mean you mentioned deleting data but

Nikolay

04:56

well we again again again it's append-only we draw partition we never delete we never update it's append-only but we if we don't need like last year data we already processed it somehow, made all calculations we need, we can get rid of raw data. We just dropping partition, It's the best we can do instead of cleaning up somehow using deletes. I think we need to discuss it because I did it many times and participated in huge projects in very large companies.

05:30

The idea, let's offload all archive data. It was e-commerce. Old orders, let's offload it to cheaper storage for longer-term storage, and then we need to delete it in original place, original database, and it was not partitioned. Deletes, it was a project for a couple of months. Because it was, like downtime is not acceptable, it costs a lot of dollars. E-commerce guys know very well, they can calculate it, each second of downtime, how much it costs to company.

06:09

So if I had partitioned table there, it would be magic. And it's append-only. That particular table was not append-only, right? But it can happen with append-only. For example, we have audit log. Some actions are stored in some append-only table, but we have a policy to store only 2 years of data. Then I would prefer to drop partition with all data, that's it. So cleaning up is a very important topic for append-only tables, this is what I was trying to say.

Michael

06:43

Yeah, I completely agree. I think there are other benefits to partitioning with append-only, or append-mostly as well, due to, like, if we do have the occasional update or delete by having

Nikolay

06:54

partitions yeah

Michael

06:57

well partitioning helps with that as well right so let's let's zoom back out maybe we've got we've got inserts and SELECTs so we do have to we might have to if we're talking about a very very high volume, we might have to worry about insert performance and SELECT performance.

Nikolay

07:15

We can also have, sorry for interrupting, we can also have copy. Yeah, sure. In both ways. And

Michael

07:24

I guess that's about to come up if we're talking about optimizing.

Nikolay

07:26

Also, common table, you know, like table, which reads everything. Yeah, but it's kind of SELECT. So yeah, inserts or SELECTs, no UPDATEs, DELETEs and so on. So what use cases? You wanted to discuss use cases, right?

Michael

07:44

Or even, I was actually thinking of diving straight to performance of, like, I think there's a few things that we don't have to worry about, and a few things that we can then optimize for. Like, if we're having to insert at extremely high volumes, which sometimes these use cases do lend themselves towards.

08:04

You know, if we're, I think IoT for example, Internet of Things, if sensors are sending information and we're logging for each second amongst thousands or tens of thousands of sensors that could be that can end up being a lot of data so Inserting can be a bottleneck and you might make design decisions for those tables that you wouldn't make If you had a different type of table a different type of data So there's that side of things but then there's also the read side of things I think you know

08:35

and I think those things maybe sometimes play off against each other so But the fact we've got append-only we have some benefits to like index-only scans for example become even better I know I know you often talk about always trying to get index-only scans, but in a table where the data is often changing, that can be a losing battle. It can be a battle that's not always worth fighting, or it's maybe not always worth including as many columns to the index, for example.

09:02

There's different trade-offs for append-only versus...

Nikolay

09:06

Let's unwrap everything here. You mentioned so many things in just a minute, right? So first, Let's talk about performance of inserts. I would say the ideal situation is we don't have indexes and we don't have triggers, including foreign key triggers, because foreign key in Postgres is internally implemented via system trigger. This trigger is going to slow down inserts, especially if you need to insert a lot of rows.

09:38

If you just have a few foreign keys, it can multiply the duration of this massive insert. So, ideally, we should get rid of foreign keys and keep as few indexes as possible for this particular case. I remember in some cases, I decided to go without primary keys, you know, breaking relational model and so on. There's no relational model in Postgres in any relational database which implements SQL model, data model, which has null and it breaks relational model completely anyway.

10:17

But this is a different topic. Side note. Anyway, so it's not good to be without foreign, without primary keys, but sometimes you think, oh, I just need to dump these to some table reliably. So we have ACID, so Postgres guarantees, it's stored, it's saved, it's replicated, it's backed up. But even 1 index, sometimes you think, oh, it slows me down. And I remember I decided to leave without primary key. It was a weird case, but it was some archive, maybe just for audit purposes.

10:57

I decided to use BRIN at that time. BRIN is actually a good idea to consider if we have append-only because layout, physically, rows don't move. If we have a row, it's a tuple, it's saved in some block, it's there, right? So this is exactly when BRIN indexes work well. And we had an episode, 1 of our first episodes, I remember. It was. BRIN indexes. BRIN is a block range index, right? Yeah. So it's very lightweight.

11:30

It speeds up select performance, not as good as other indexes, especially B-tree, but it's still good, right? Or we might consider hash indexes also, right, because they might be more lightweight than B-tree sometimes. They're smaller,

Michael

11:47

for example, right? Well, I think, Yeah, but when it comes to append-only, I think you make a really good point. Each index we have slows down the inserts.

12:00

So the fewer the better, possibly none if we aren't, let's say it's a table we're never reading from or it's an audit log that we only ever have to read from extremely rarely we might consider 1 or even 0 indexes on that maybe not an audit log because maybe that's not 1 you would actually be writing an insane volume to but I've read a timescale but sometimes they have to worry about this kind of thing that they have whilst they've designed for

12:28

these kind of time series workloads They've written a good blog post on optimizing inserts and they list all the same things as you and go further. So they, as well as foreign key constraints, basically other constraints can add overhead as well. So for example...

Nikolay

12:45

Of course. Checks

Michael

12:46

A unique constraint... yeah check constraint but unique constraints...

Nikolay

12:50

yeah index additional check for sure

Michael

12:53

yeah so not having it basically deciding for each constraint if you really need it or what value it's adding having it and make

Nikolay

13:02

yes that been said I must say like in most cases, I prefer having primary key. Because it's like the center of consistency, of data consistency, right? So it's good to have. But it depends. It's good that you mentioned timescale, but I think we will return to timescale. My question to you is a tricky question, but I think you already know, and I must admit, when 2 years ago we started the podcast, I didn't realize it fully. Now I realize it much better. So we have an index.

13:38

What operations does it slow down? You said it slows down inserts. This is for sure. Does it slow down updates? Well, yes. And there's a mechanism, hot update, which deals with it in a limited number of cases. Does it slow down delete? Well, maybe no, because during delete, index is not updated. Postgres-only updates xmax, as we discussed a couple of times. Does it slow down selects? What do you think?

Michael

14:13

So we've talked about how having a lot of them can.

Nikolay

14:17

Yeah. Yeah. It slows down the selects, especially if we have a lot of them and high frequency of selects, and this is about planning time and a lock manager locks during planning, all indexes are locked. It's some overhead in a very heavily loaded systems to keep in mind. But in general, I would minimize the number of indexes and try not to use foreign keys.

14:45

Foreign keys, in many cases, we can imagine they exist, have maybe routine checks that referential integrity is fine, but drop them intentionally because in this case, we want, for example, good insert performance.

15:04

And as usual, I would like to remind that when I say all this, in many cases when I deal with new system, I have some of these principles, but I never trust myself, I always check again checking should be like consider like sometimes you spend time there right but it's worth doing experiments

Michael

15:27

yeah I well and I would say we're talking about extremely high volumes here if if you can I would much rather normally have primary key have some foreign keys if they make sense and have a unique key if I need it and then test if like can I get better discs if I need to? Are there other ways I can improve, like I can cope with higher write performance instead of...

Nikolay

15:54

Perform checkpoint tuning if you expect huge volumes to let into the store.

Michael

16:00

Yeah, so maybe pay for it in other ways. It's only at

Nikolay

16:03

the apps. Bigger buffer pool. Exactly. Make sure backends don't write all the time. It depends, right? So checkpointer is not crazy, it's not too frequent, and so on. Yeah, yeah. And there's a lot of stuff here. And if we think about Selex now, what's the number 1 problem usually? I think, it makes me so, I'm still wondering how come we lived so many years until I think Postgres 12 or when autovacuum_vacuum_insert_scale_factor was added. I think Darafei initiated it. Darafei.

Michael

16:46

Version 13 I looked it up yeah.

Nikolay

16:49

Okay it's very recently compared to like my my experience with Postgres. So strange. What it adds? Originally, Postgres vacuum, which also maintains Postgres statistics, which is important for good query performance, including selects, right? Originally, it was triggered only after, say, like 10% by default, 10 or 20% of rows are changed. There is some complex formula, not very complex, but some formula. But roughly after 10 or 20% of rows changed, change means deleted or updated.

17:27

It triggers, but not after inserts. And only in Postgres 13, a specific parameter was added. I think by default it's also 20% or 10 which tells what the vacuum to run and process a table after 10 or 20% of rows were added

Michael

17:49

yeah and I look this up and it's it's like, it's because there's 3 jobs, right, of autovacuum. There's the removing, well, there's roughly removing dead tuples.

Nikolay

18:01

4 jobs actually.

Michael

18:03

Freezing and analyze statistics.

Nikolay

18:06

Removing the tuples, maintaining visibility maps.

Michael

18:09

Maintaining visibility maps, of course, yeah. For goals

Nikolay

18:12

maybe actually more but these 4 come come to mind quickly

Michael

18:17

yeah and if you're only doing inserts you don't need the removing their tuples yeah But that isn't the only thing vacuum's doing. So this then enables, though, the visibility map and the freezing to happen.

Nikolay

18:31

Well, freezing will happen regardless of inserts. It will happen... Well we can insert a different table.

Michael

18:38

Yeah okay yeah good point.

Nikolay

18:40

And autovacuum we'll see that xmin or xmax or both xmin right it's very very In the past we have risk of wraparound, so it's time to freeze this table. We can have 0 operations in terms of, like, table can be left unchanged for many, but at some point, we can decide, okay, it's time to freeze.

Michael

19:03

But you're right, visibility map would never be...

Nikolay

19:07

Visibility map is huge. You mentioned index-only scans, the performance of aggregates, counts, right? So we do want to keep it up to date. I think default is not enough as usual with autovacuum. We must tune it and even cloud providers, their defaults are not enough. We must tune it and go down to 1% or smaller and make sure autovacuum maintains statistics and visibility maps more often so performance of SELECTs including index-only scans are good right? Yes,

Michael

19:40

another reason to partition as well so you can keep those yeah yeah that makes sense I was gonna say it's that it is it's 20% So it is quite high still as you say, would you ever switch

Nikolay

19:53

to same? I cannot imagine any OLTP system any website any mobile app which would be okay with Postgres or to vacuum defaults this like ah That's it like I don't know why they are so. They are so for what? We have so many beautiful websites working, Huge systems working with Postgres. It's like it's so cool to see that Postgres main handles so big workloads, but these defaults

Michael

20:27

Well, and the strange thing is this 1 for example if we did reduce it to 1%, it would add overhead on small systems. Sure, if you've only got 100 rows, it runs vacuum every row for a while, you know. But who's running a small system that can't handle a vacuum of 100 row table every row? Like, that's fine. And also,

Nikolay

20:49

with append-only specifically, when some page is already processed, it's marked all visible, all frozen, or whatever. Vacuum just skips it.

Michael

21:01

Yeah, so it wouldn't even be

Nikolay

21:02

much overhead. There were many optimizations in this area, so to not to do work which can be skipped. So it's doing good job skipping and it's many years already. So I think like I never saw any system and I saw maybe already hundreds of them, different sizes, websites, like OLTP, right? I didn't see any time we decided, oh, you know what, we need to increase scale factor. I don't remember this at all.

21:36

We can throttle it if we like, we can balance work among many workers and so on, but deciding let's make work of autovacuum less frequent, 0 cases I had. Maybe my experience is not enough, maybe 1 day I will see such a system.

Michael

21:56

I've not seen 1 either.

Nikolay

21:58

Enough rage about defaults, my usual fun I have with Postgres. Let's talk about partitioning, maybe, right? Why do we want it? I see several ideas here, and TimescaleDB is definitely for a append-only table, so it's a good thing to have in many senses. But unfortunately, it's not available in managed offering except their own Timescale cloud, right? And some others, but those some others usually choose Apache 2.0 version which doesn't have compression. Right?

Michael

22:30

So... Doesn't have a lot of their good features, yeah.

Nikolay

22:33

Yeah, so partitioning is good. Again, there's some rule, empirical rule, we say, like many people say, not only I. Let's consider partitioning if table exceeds 100 gigabytes or has chances to exceed 100 GB. Partitioning adds complexity. It's not as well automated as in Oracle, but it's a very important tool to consider. Many factors here. First, for example, you might say, okay, I have a partition where I insert and then many partitions where it's like my archive.

23:13

And as we decided, we want a very low number of indexes in the main partition, which is receiving inserts, and constraints like foreign keys and so on. But there is no such problem in all archive partitions, right? We might have more indexes there and constraints and so on. This is 1 thing. The second thing is autovacuum. If occasional deletes or updates are happening, the block which contains the raw data basically is out of visibility.

23:49

It's marked not all visible anymore and not all frozen anymore. So a vacuum needs to process it. And it's good to have data localities or archive data is in some partitions and fresh data is in particular partitions. So autovacuum is focusing on fresh data in fresh partitions. It reduces the number of blocks it needs to deal with, right? Because all data is rarely touched, so we... autovacuum visits are very rarely, right? This is another reason. Cleanup is another reason as well, right?

Michael

24:27

I think that's... I think cleanup's the biggest reason. I think... I think maintenance... Partitioning helps so much with maintenance.

24:37

It does have other benefits for sure but it helps so much of maintenance that I can't help but feel like that's the biggest 1 and I actually I've started to say I think I must have stolen this from somebody else because it's too clever for me but partitioning based on how you want to eventually delete data makes sense so if you want to eventually delete old data partitioning based on time makes sense But for example if you're a bit like b2b

25:03

sass and you eventually want to delete data based on a customer quitting the service you probably want to partition based on

Nikolay

25:10

or both a level of partitioning calls

Michael

25:13

yeah exactly but but that being the like a guiding principle for how you partition because it makes that deletion or dropping so easy.

Nikolay

25:23

What will you do with data? And as I said I participated in projects where Delete was a big issue and of course with partitioning it's very different and it's good. Deletes can be a problem. Postgres deletes, like if you have a terabyte, 10 terabyte table and you need to delete 20% of it, it's a big headache because you need to make sure vacuum will be okay. It will, autovacuum will catch up all the time.

25:55

You need to, again, to pay attention to a checkpointer and you need to find a way how to delete so delete doesn't degrade. This was my problem. So I created beautiful queries, but they degraded over time because of that tuple accumulation and bloat accumulation as well. So I needed to adjust them and so on. So there are many problems with delete and it takes time to delete millions of rows. If you rush with it you can put system down or have degradation of performance.

Michael

26:29

Well yeah, And it can really affect even your SELECT performances. So you mentioned, BRIN is probably the 1 where it gets, it used to at least get affected the most with the default way of creating a BRIN index. If you have a row inserted way back in an old, if you don't have partitioning, if it goes miles away and you get some real scattered data, BRIN performance can end up effectively looking like sequential scans.

Nikolay

26:59

And all indexes degrade, B-tree degrades very quickly if you perform deletes and updates and you need to rebuild it. And rebuilding is better with partitioning because the smaller partitions are, the faster rebuilding is, and Xmin horizon is not frozen, right? So autovacuum is not affected in the whole database right

Michael

27:22

yeah so yeah

Nikolay

27:24

building and rebuilding indexes vacuum itself maintenance tasks are good if you have smaller physical tables or partition is great

Michael

27:32

right yes on the BRIN thing I'll link up the old episode we did, but the min-max-multi I think makes a big difference, especially if you don't have to, like, well, it handles loads of outliers, so I do think that's easier. And if you are able to keep on top of autovacuum, I guess the B-tree stuff doesn't degrade that quickly. So I feel like these things aren't as big a problem anymore.

27:55

But yeah, often in these cases, if you're dealing with high volume, like many, many thousands of queries per second, like just extreme volume, anything you can do to help fight on the performance front will be helpful.

Nikolay

28:08

Yeah, and as usual when we touch partitioning, the state of caches and buffer pool, For example, if you have archived data which is touched rarely, those blocks are evicted from the buffer pool, and cache efficiency might grow, hit rate might be better. But yeah, I agree with you. So partitioning is good in many senses. It comes with price of overhead and maintenance as well, but it's worth to have it. But imagine, like all this said, we moved slightly from append-only to append-mostly, right?

28:43

But let's move back to append-only. Imagine we have many partitions where data is not changed. Archive. Indexes created. All frozen, all visible. It's a beautiful state of data, right? So all index-only scans are working well. And that's it. Maintenance not needed. Autovacuum not needed there, and so on. However, what if we have 100 terabytes of data, and this is like heavily loaded cluster, we have many replicas. The data is not changed as good.

29:17

It's evicted from buffer pool, but we still need to keep it on the main storage right and At some point we think oh like we pay a big price because this data is replicated it increases the volume of backups, full backups if we consider, right? So this is like, this legacy, it's a lot. And if we have, for example, 5 replicas, 1 primary, We need 6 times to store the same data and nobody is using it. Like people read it occasionally. At some point you think it's not efficient.

29:55

And you think I would rather store it somewhere else, not on the main disks, not on SSD, fast SSDs or I don't know, NVMes I have, or cloud storage, which is also expensive, right? So this leads to 2 ideas. First idea is it would be good to compress it. Again, TimescaleDB, full version of TimescaleDB is doing a great job, and their blog posts about compression are great.

30:24

I like especially 1, I remember, first big 1 which explained algorithms in row and basically kind of column compression, although we still have row storage, it's great. And also, I think, second topic here, which opens up naturally, is what I know Aurora now offers it, right, and Neon and Timescale as well, in cloud only. Bottomless approach, where all partitions are offloaded to S3 or object storage, GCS on Google Cloud, or blob storage on Azure, or how it's called, I don't remember.

30:59

And Now even Hetzner has it. They just recently released, which is big news. I like it because I like their prices and I worked with them since I think 2006 or so in a few companies. When you bootstrap and you have a small startup, Hetzner is like number 1 in terms of budgets and the hardware they can offer. So they just recently released S3-compatible object storage, right? So we can have normal backups and so on. But what to do with old partitions? It's a natural way of thinking.

31:36

We don't want to keep them on these expensive disks we have, having multiple copies of that. So offloading it somehow, like implicitly, like transparently in the background, to S3 or S3-compatible mini or something, if you have self-managed Postgres, it would be great. So we have it in timescale cloud in I think new 1 also does it right

Michael

32:05

I don't know

Nikolay

32:06

Bottomless bottomless like new 1 they store data on s3 originally anyway So idea is we want to have petabyte size cluster, but don't pay for lots of disks and headache it comes with. And for append-only, it's very natural to decide, okay, we want to store data forever, not to clean up, but we want cheap storage here. So S3 is a good idea to consider, and it has tiers also, right? It can be slow to retrieve, but it's okay because it's rare, right?

Michael

32:45

Well, and it depends what you mean by slow like I think there is cut there can be performance advantages I think when some of this data is fresh we might want to retrieve it row by row like if you do if you're looking at some audit logs you might want to look at some recent ones that you might want all the information about them but if you're looking at data from 2 years ago there's probably a higher chance that you're looking at it in Aggregate you know on average how many audits of this type

33:13

will be having in in 2022 versus 2023 and I think actually the types of queries that happen on older data tend to be these Aggregate ones that often perform better once it's Column store compressed you know these file formats often suit that kind of Query so I could I don't even think I know what you mean

Nikolay

33:34

likes compression good compression it likes it yeah and TimescaleDB compression I have seen how good it is it's gonna be like 20 20 times smaller and and indeed like if they even support data changes for compressed data, which is great.

Michael

33:53

I have seen a project or 2 come up about lately, I think, open sourcing some of this stuff, or at least putting it under the Postgres license. Is it pg_parquet that allows you to...

Nikolay

34:05

Yeah, but it's different. It's for analytics. And actually for analytics, we also might want to consider append-only tables, obviously. But There is a new wave of this and many, I know many people, companies look at it. PgDuckDB, not PgDuckDB.

Michael

34:23

Yeah.

Nikolay

34:23

DuckDB. Right. And the idea let's marry it with Postgres. And there are a few projects I looked at a few once recently. And 1 of them was just released maybe last week. I remember they use logical copy from original tables, regular tables, to these tables, which are stored on, I think, in Parquet format on object storage, and then DuckDB is used as processing for analytics.

34:57

But I remember, I think Álvaro commented on Twitter that I'm not going to consider it until it works with like basically CDC logical replication or something because right now it's only full refresh of like it's not serious but they will do it I think also I think new guys looked at DuckDB and I saw some activities and Hydra, right? They also looked at that.

Michael

35:21

Yeah. But I understand that most of the marketing at the moment is around analytics use cases, but I don't see why it couldn't work for append-only

Nikolay

35:30

data types. I'd be sure. I looked at a couple of extensions because I have a couple of customers with such need to offload all data and Parquet is this format supports only like mapping of data types might be tricky if you have some complex data types, as I remember. And when I looked at some extensions, it didn't work well. And I think right now, I have plans to look at Tembo's extension, which is called pg_tier, for tiered storage.

36:02

The idea is, with this extension, we can have all partitions on object storage. It's a great idea. So if it works, it's great. I just haven't looked at it yet. If somebody from Tembo is listening to us, please let us know how this project works already. Is it already in production or beta stage? I'm very curious. And is it worth trying? And what are the limitations, for example?

36:36

Maybe we should actually have a separate episode, because I think this extension might bring bottomless idea to the Postgres ecosystem for everyone. It's great. It's open source, unlike what other companies do. So kudos to Tembo for making this open source

Michael

36:56

extension. And is it transparent? Do I have to... Because I think some of the DuckDB stuff, I would have to be... I'm not sure if I have to write...

Nikolay

37:07

I don't know. It might be semi-transparent if you need to make some transitions with old partitions, I'm okay with that. And I can even create new partitions in the background and move all the data from 1 old partition to this kind of new old partition, which already has different storage. This is doable. It's not a big deal, as I see it. I think what, for example, Timescale Cloud has, it's transparent and they have some kind of interesting, as I remember, very rough form.

37:44

They have something also with Planner and some algorithms to decide when to bring this data to caches and so on right so it's interesting but idea is we want partitioning to evict blocks with all data from memory 1 thing but then we think we want to evict them from our disks, because disks are also expensive, right? Let's evict it. And this is an alternative idea to cleanup and to compression, or compression goes well with offloading as well. I don't know.

38:18

So parquet format is definitely good with compressing data in terms of column storage, right? So if it's time-serious, it's good. So there are interesting new directions of development of Postgres ecosystem here. And I think we mentioned a few projects, both commercial and open source, which is great. So if someone wants to store petabyte of data, is preparing for it, doesn't want to delete everything, I think there are ideas to consider.

38:47

Well, we didn't mention with partitioning, you can also have some kind of foreign data wrapper and store it on very cheap, also Postgres. It can be Postgres, it can be not Postgres as well, right? But for example, we can consider a cheap Postgres cluster with very slow disks, HDD, for a good price, and we can have foreign data wrappers and offload it from the main cluster to that cluster and live with it.

39:14

There will be some corner cases, maybe, with transactions failing sometimes because of, I don't know, because of... I expected foreign data wrappers, Postgres, FDW, code in terms of how it works with... Basically, you need 2 PC, right? You need a two-phase commit to have a reliable commit because it's distributed system already, but without 2 PC, there are risks to have inconsistency, for example.

Michael

39:47

But not if you're, if it's append-only, you're never going to change that data by the time you're pushing it to a different server.

Nikolay

39:56

I remember I inspected code and found some kind of edge cases, maybe even corner cases, but for inserts it's also possible. You wrote to 1 place, you're trying to write another place, this 1 is already committed, here it's not committed. But I remember code was quite smart to reduce the probability of some cases, but it's not 100%. As I remember, it was like 5 years ago. So I haven't revisited it since.

Michael

40:26

Tell me if I'm wrong, but you're saying new partitions would be on the local Postgres, and it would be old ones that we would move to the second.

Nikolay

40:36

Yeah, if we don't have inserting transactions which deal with multiple partitions, there is no problem at all. Yeah, and old we can move to find it with Postgres FDW to different cluster. So you're right. No inserts happening, no problem. Cool. Select only.

Michael

40:55

I never considered using FDW for partitioning. Or like, you know, the old partition.

Nikolay

41:01

It's natural. Yeah, it's natural. This is how many folks were going to have clustered Postgres, right? I mean, I mean, or sharding, sharding, sharding. Yeah, sharding. Yeah, but this path has issues. Yeah, it's a different, different story.

Michael

41:20

Hereby dragons, is it? Or what's that old phrase?

Nikolay

41:24

And I don't know. So, okay, we discussed many things. I feel we might be missing something, right? But for a good overview, I think it's enough. For a general overview of what's happening with append-only tables. So I think it's an interesting topic, actually. Many people need it. Many companies store logs and so on in Postgres. And I'm looking forward to the future where Postgres at some point will have better compression. Better compression and bottomless feature.

42:01

And as I like to say imagine TimescaleDB was Postgres license. Okay this is I think should happen at some point I know this is there are people who don't like this idea, but they developed so good stuff that Postgres would benefit from it. I know some people would not be happy with these words, I know. But it feels natural to have these features in Postgres itself.

Michael

42:33

Some of them for sure, but I do think some of it's tricky, like the bottomless stuff for example, where would it go?

Nikolay

42:46

There is extension already, right? S3 compatibility is standard. Even Google Cloud GCS is also S3 compatible. Everyone does it. So I think, no, no, no. I don't see why not here. I mean, I see it, but it's manageable. The biggest question is business-wise, I guess. So, TimescaleDB, license, yeah. I understand that business-wise it's not going to happen in the near future, but it could be so great to have good compression and bottomless and false width itself.

43:22

Even if it's extension, I'm okay with extensions. I don't like extensions in general because I like to... All people have some features, but in this case, it's okay to have extensions. Cool. Extensions are great sometimes, yeah. Good, okay. Have a good week. I know we have more than 100 suggestions in our doc. Yeah. We read them. Keep posting. This was 1 of suggestions, right? Yeah. So you chose it. Yeah, good. Thank you for our audience for patience reaching this point, you know.

44:01

Okay, see you next time

Michael

44:03

absolutely take care see you next week bye

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript