Managed service support

Nikolay

00:00

Hello, hello, this is Postgres.FM. I'm Nikolay, Postgres.AI. As usual, my co-host is Michael, pgMustard. Hi, Michael. How was your week?

Michael

00:10

Hi, Nikolay. I'm good, thank you. How was yours?

Nikolay

00:14

Perfect. Very active and a lot of stuff is happening. So we needed to miss last week because I had even more stuff last week, but I'm happy we continue, right? We don't stop.

Michael

00:27

Oh, yeah.

Nikolay

00:27

Yeah. I remember in the beginning I was always against skipping any week because for me it would be a sign that we probably stop Which I don't want so Yeah, right now. I'm already we already proved that like during couple of years we a Couple of years almost how many years?

Michael

00:50

3 maybe

Nikolay

00:51

almost 3 It's like this this July it will be 3 years and I already proved to myself, we proved to ourselves that if we skip 1 or 2 weeks it's not game over.

Michael

01:05

Yeah, this is me as the European convincing you it's okay to have a week off every now

Nikolay

01:10

and again. Yeah, exactly, exactly. Okay, if we stop, that's it. I don't want it. Yeah. Good. And today, this was my choice and the topic is like, it's less technical, but although we will talk about technical stuff as well, and topic is how Managed Postgres services, how they help us or don't help us. Customers, I mean, I'm in a different situation probably, but of course, sometimes I'm just a customer or I'm on customer side.

01:45

And there's a problem when the fact that we cannot have access to the cluster, and we have some issue, there's a whole big class of problems how to deal with it. And maybe we should create some best practices how to deal with support engineers from RDS, Cloud SQL, I don't know, and all others, right? And let me start from this. I learned important lesson, I think in 2015, 16, when I first tried RDS. I liked it a lot because of the ability to experiment a lot.

02:30

Before cloud, it was really difficult to experiment because for experiments you need machines of the same size usually for full-fledged experiments for a very limited amount of time. 15 years ago or so, we were buying servers and putting them to data centers and experiments were super limited. Cloud brought us this capability, great. And with RDS I quickly learned how cool it is to just create a clone, check everything, how it works, throw it out, and then rinse and repeat many times.

03:07

And then when you deploy, you already studied all the behavior. And I remember I was Creating a clone, but then it was so slow. RDS clone, I think it was 2016, maybe 15. Why is it slow? Okay, a cluster is like maybe 100 gigabytes. Today it's a tiny cluster, not tiny, small. But back in those days it was quite already a big 1. And I restored and somehow it takes forever to run some SELECT. And experienced AWS users know very well this phenomenon.

03:50

It's called lazy load, because the data is still on S3, and you have EBS volume which only pretends to have data, but data is still there, lazy loading in the background. And I reached support because we had good support. And the engineer said, oh, let's diagnose, it's some kind of issue. So it was hard to understand what's happening and so on. And I spent maybe an hour or so with that engineer, support engineer, who was not really helpful.

04:25

Right. And, and then someone, I don't know, like maybe my experience of managing people by that time, I was already, Hey, I had already 3 companies created in the past. So I like learn something about cycle psychology and so on. What I did, I just closed the ticket and opened another 1. Although usually any support would hate it, like don't duplicate, right? But this helped me solve the problem in a few minutes because another engineer told me, oh, that's just lazy load.

05:01

And I googled it, I quickly educated myself, okay, what to do? Oh, just SELECT * from your Table to warm it up. Okay. And since then I have a rule and I share it with my customers all the time. If you are a managed Postgres service and you need to deal with support sometimes, it's like roulette, right? Like it's 50 50. It can be helpful, can be not.

05:28

If it's not helpful, don't spend more than 10 minutes and just close the ticket, say thank you and open another one because if it's a big company who has big support, probably you will find another engineer who will be more helpful. Actually, I use this rule with in other areas of my life as well, for example, talking to some support people in like bank, right? Credit cards, debit cards, anything, it's not helpful?

05:55

Okay, thank you, and you can just call again and another person will probably help you much faster. What do you think about this problem?

Michael

06:04

Yeah, I think you must have different banking services to us because if we need to call the bank, you're guaranteed to be waiting 20 minutes on hold.

Nikolay

06:12

So- Oh, yes, it's terrible. It can be hours. I think we'll have a day when someone will create an AI assistant serving on human side, not on company side.

Michael

06:25

Oh, interesting. Yeah.

Nikolay

06:28

Yeah. So they should wait on that line and ask me to join only if everything is ready and small details already negotiated, some approval is needed and that's it.

Michael

06:40

Yeah, so

Nikolay

06:41

maybe one day we will have such systems.

Michael

06:44

Yeah, I think at big companies that makes a lot of sense, at smaller ones much less so. I think there are some smaller managed services out there. But yeah, maybe this problem happens less. I was gonna ask, because sometimes they have the ability to escalate, right? Do you have any tips? So let's say you've got a support engineer that wasn't able to work out the issue.

Nikolay

07:04

Do you

Michael

07:04

have any tips for getting them to escalate the problem to a second tier? Or do you always go to, like, let's open another ticket and hope that they escalate it?

Nikolay

07:13

That's a great, great question. And I think we... ...Position where, I don't know about RDS, by the way, but what I see in many cases, there is no such ladder built yet. So in case of big corporations, banks and so on, there is such option. You can ask to like senior manager, blah, blah, blah. Especially if you go offline, it's definitely an option always. Right. So please let me speak to like another person and you escalate and so on.

07:43

Like, but what I observe and recently what happened, I, we had a client who experienced some weird incidents. And those incidents require you to have low-level access, which you don't have on RDS. You need to see where Postgres spends time, for example, like using Perf, for example, or something. But you cannot connect, so it's all in their hands. And you need also to grant them approval to allow them to connect to your box and so on. So a lot of bureaucracy here.

08:18

And I told them like you need to escalate. And of course, like it's normal, but I don't see this option working. If like if you say escalate, it looks like they don't understand how, like what's happening here, right?

Michael

08:34

Really?

Nikolay

08:35

Well, you can try, like you can try and have some problem and some difficult problem, bring some difficult problem and try to escalate. Will it work? Is there any official option? Because if it's not official and it works sometimes, it's okay. Again, it's like gambling. Like I said, like it's similar to closing and reopening the issue and hoping next engineer will be more helpful. Escalation is also not guaranteed. It's like In many cases, it's good, right?

09:04

Because they are probably, they will try to solve. Well, I also have several, actually, I have several recent cases, very interesting. I cannot share all of them, but let me share another 1. Another company, they are on different, not RDS, not CloudSQL, and they had issues, a bunch of them, like 10 issues, different kinds. 1 issue eventually was identified with mutual effort as don't run backup push or how you call it on the primary. If system is loaded, do it on replicas.

09:44

We talk about it from time to time when we touch backups, right? And this was an issue on that platform. But what I observed is, like, trying to work with engineers, support engineers, and also ultimate escalation if you go to CTO or CEO level and say, oh, look, like, you know, CTOs are talking, right? And this is ultimate escalation. And it's also not helpful sometimes, right? In that case, it's like there was some chunk of disappointment, what I observed, like this was feedback I heard.

10:21

Right, so escalation is interesting, but my point is like, we probably need to learn about escalation ladder and practices from other businesses, obviously, right? And I still think it's not fair that customer pays bigger price and doesn't have control.

Michael

10:45

Yeah, sure, well actually on this topic, I was gonna ask, do you think this is less of an issue as for the... There are managed service providers that give more access. Like we had an episode on super user, for example, and it's come up a few times. Obviously that's still not like you're talking about running perf for example but I'm guessing a whole category of issues just don't exist if you've got superuser access So is it less of an issue on those?

Nikolay

11:17

I will tell you a funny story. It was with CrunchyBridge. I respect CrunchyBridge for 2 reasons, already for 2. It was 1, now for 2. 1 is super user. I don't know any other managed service yet which provides you super user. It's amazing. You can shoot off your feet very quickly if you want. It's freedom, right? And another thing is that they provide access to physical backups, which is also nice. This is true freedom and honoring the ownership of database and so on.

11:52

Because without it, maybe you own your data, but not database. You can dump, but you cannot access PGDATA, but physical backup, nothing. And also, you own your data conditionally because if bugs happen, you even cannot dump. Right? And that sucks completely. And I'm talking about everyone except CrunchyBridge, all managed services, they all like steal ownership from you. That sucks. So, the final story... I think

Michael

12:28

there is at least one other, but like, I think they're quite small. I think maybe Tembo give super user access. I haven't

Nikolay

12:33

actually checked. Oh, maybe, yeah, maybe. Yeah, apologies if I missed something. Yeah, let

Michael

12:39

us know.

Nikolay

12:39

Of course, I work with a lot of customers and expanding my vision all the time, but of course, it's not 100% coverage.

Michael

12:48

Yeah, of course.

Nikolay

12:49

Definitely not.

Michael

12:51

And definitely the big ones don't.

Nikolay

12:53

Right, exactly. And they say this is for your own good, but it's not. So let me talk a little bit about CrunchyBridge. It was super funny. We needed to help 1 customer and reboot a standby node. And it turned out CrunchyBridge doesn't support rebooting, restarting Postgres on standby nodes. They support it on primary or whole cluster, but not specific standby node. It was very weird. I think it's because they just didn't do it somehow. Like, it should be done, it should be provided.

13:28

But we could not afford restarting whole cluster. We needed just 1 replica. And then I said, OK, we have super user.

Michael

13:39

Yeah.

Nikolay

13:39

What we can do, copy from program, right?

Michael

13:44

So you crashed the server.

Nikolay

13:46

Not crashed, why crash? pg_ctl restart, like, it's all good. Just a -m fast. All good, all good. Yeah, there are some nuances there.

Michael

13:58

But on that, let's go back to the topic briefly, because it's relevant.

Nikolay

14:01

Let me finish. Copy from program doesn't work on replicas because it's a writing operation. Oh!

Michael

14:08

So you had to contact support, right? That's where I was going with this.

Nikolay

14:12

Well, support says this feature is not working. I mean, it's not supported.

Michael

14:15

But they could do it for you, no?

Nikolay

14:17

No, no, no. I needed the part of automation we were building. It was part of bigger picture. And we needed this ability. So what we ended up doing is copy to program, writing to

Michael

14:30

a log.

Nikolay

14:31

And this worked on Replica, but we were blind a little bit. But then I talked to the developers and realized we had an easier path in our hands. It's python -u. Anyway, if you have super user, you can hack yourself a little bit. It's your own right. If you broke something, don't do it.

Michael

14:56

Yeah, it's a really good point. So that was kind of my questions. If you've got more access, I presume there are fewer issues that you need support for. But that does raise a good question because there's kind of 3 times you need to contact support, right? We've got an issue right now, maybe urgent, maybe not. I've got a question, how does something work? And then the third category is feature requests. Like I'd like to be able to do this, but which we can't currently do.

15:25

My experience of feature requests or like looking at different forums of different managed service providers of where they ask people to go to request and vote on features. It looks a little hit and miss. How like what's your do you have any advice in terms of how to do that?

Nikolay

15:44

We have 2 paths here advice to whom to users or to platform

Michael

15:49

users. I'm thinking for people listening mostly users.

Nikolay

15:54

Well it's a bad state right now. Again, I think managed services should stop hiding access. They should like, they build everything on top of open source. And they charge for operations and for like support, good, good, good. But hiding access to purely open source pieces, it's like, it sounds bullshit to me. Complete bullshit. Actually, it makes me angry even. So amazing, like yesterday I saw an article from Yugabyte.

16:26

Yugabyte suddenly, I feel it like Tembo actually released DBA AI going outside of their platform. And Yugabyte did a similar thing. They went outside of their Metaverse product and platform, and they started offering a tool for 0 downtime upgrades, compatible with Postgres running on many managed service providers like RDS, CloudSQL, Supabase, CrunchyBridge and so on. And that's great. That's great.

16:56

They did wrong a little bit because they called things like blue-green deployments while it's not... They did similar mistake as RDS did, we discussed it, right? They, this...

Michael

17:06

I, yeah, but I saw your tweet about this and I'm going to defend them because I don't think it's their fault. I think the problem is RDS broke people's understanding. Look, I'm, wait

Nikolay

17:16

a little bit. I'm going there, exactly. I'm going exactly there. So blue-green deployments, according to Martin Fowler, 15 years ago, he published an article, they by nature must be symmetric.

Michael

17:29

We did an episode, remember?

Nikolay

17:31

Yes, exactly, Criticizing RDS implementation. And Postgres definitely supports it. We implemented this, like some customers use it. That's great. And what my point is like, probably you go by it, hit the same limitations we hit. On RDS you cannot change things, it's not available. And since you don't have low-level access, you cannot change many of things. And this limits you so drastically. And it feels like some weird pendulum key. If you want RDS, okay, good, I understand.

18:06

But you cannot engineer the best approach for upgrades. And you need to wait how many years? Okay, blue-green deployments, they released. I see better path for blue-green deployments. And it's my database and I cannot do it, I need to go out of RDS. At the same time, if they provided access, more access, opening gates for additional pieces of changes, it would be possible to engineer blue-green deployments for me or for third parties.

18:41

Like, okay, you go buy this third party, they want to offer or sell some product or tool compatible with RDS, but since they don't have access to recovery target LSN and so on, they are very limited, right?

Michael

18:59

Yeah, but It might be exactly for that reason. If we're talking about the reason for needing it, one of the reasons is migrating off, migrating out, then you can see the incentives.

Nikolay

19:13

And for upgrades, things are becoming much better in PostgreSQL 17. And blue-green deployments, it's kind of not only for upgrades. If we eliminate the upgrade idea, we can implement blue-green deployments on any platform right now. Because you can skip many LSNs in the slot and just... How is it called? Not promote, because promote is different. I forgot.

19:41

Like shift position of logical slot and synchronize its LSN position with the position we need, and then from there we can already perform this dance with blue-green deployments. It's doable. But if you want upgrades, OK, we need to wait until 17, because there is low risk of corruption. You mean 18? 17. 17 has pg_createsubscriber CLI tool. And it also officially supports major upgrades on replicas, logical replicas.

20:13

So yeah, These 2 powerful things give us great path to upgrading really huge clusters using 0 downtime approach. Well, near 0 downtime, unless you have PgBouncer. If you have PgBouncer, you have Pulse Resume, then it's purely 0 downtime. Anyway, my point is, since they partially vendor perform this vendor lock-in, they hesitate opening gates.

20:39

Customers cannot diagnose incidents and they also cannot build tools and third parties like Yugabyte or for example Postgres probably would also build some tools compatible with many other platforms. Not other, we don't have platform, right? We help customers regardless of location of their Postgres database. So if it's RDS, okay, CloudSQL, okay. But building tools for them, it's very limited right now because we don't have access to many things and we don't have super user and so on.

21:13

So yeah, that's bad. But back to support, if, like, my main advice is just gambling advice. Just gamble, guys.

Michael

21:23

Well, I have some, like, I think a lot of people have very high trust when they request features, like, very, or very high belief that people will understand why they're asking for it and I don't I think a lot of people don't include context when they're asking for they don't include why they want the feature or what it's preventing them or what it might cause them to do if they if they can't get it or what their alternatives are going to be.

21:50

So I think sometimes when you make products people just ask for features and you have to ask them why do you want this, like what are you trying to do because without that context it's really hard to know which of your potential solutions could be worth it or if it's worth doing at all. But most vendors I've seen just don't ask that question, like people ask for a feature

Nikolay

22:11

Michael

22:11

you know or a new extension to be supported or something and there's no, there's no, even if that extension has multiple use cases, there's no question back as to why they want that feature. Like it's so...

Nikolay

22:22

Value, right? Goals.

Michael

22:23

Yeah, well exactly. And sometimes 5 people could want the same feature but it's all for different reasons and that's like...

Nikolay

22:30

Which shows bigger value if there are many different reasons.

Michael

22:34

Yeah, or maybe an issue. Like maybe it's actually less of a good idea because they're actually going to want different things from it. Like it's going to be harder to implement, unless it's an extension and you get them all straight away. But I think in terms of customers asking for things, I've not seen this work from managed service providers specifically, but for products in general, I think it is helpful to give the context as to why you're asking for something.

23:02

The only other thing I had to add from my side was if and when you are considering migrating to a managed service provider, so either at the beginning or when you've got a project up and running.

23:13

I see quite a few people on Reddit and places at the moment looking at moving self-hosted things to managed service providers you know as they're gaining a little bit of traction and My I've seen a I've seen at least 1 case go badly wrong when the person didn't contact support At the beginning of the process, you know They tried to do everything self-service and actually it would have been helpful for them to contact support earlier. I think there's 2 good reasons for that.

23:39

1 is to make sure the migration goes smoothly, but the second is test the support out. How Does it work for you? Is it responsive? What kind of answers do you get? Is it helpful, that kind of thing?

Nikolay

23:52

Yeah, we need to write some automation to periodically test all the supports using LLMs. I'm joking, of course. But I just know, it's your database. Even if you have, like, consider them a cat, like, microservices, it's not pet, it's cattle. But it's still like you, you like being maybe DBA, DBRE, SRE, doesn't matter, back-end engineer, you are very interested to take proper care of database and so on. And support, you're just one of many. Your database is one of many.

24:31

And they also have their own KPIs. Like your question closed, okay, goodbye. And also like, okay, do this, and so on. And since we don't have accesses and so on, and kind of, I just feel the big need, like this is a big imbalance. If you ask something support, they can help. I saw many helpful support attempts, very helpful, very careful, but it's rare, right? And Postgres experts also rare Not many

Michael

25:11

right yeah,

Nikolay

25:12

and and this like closing ability to third party for example if somebody is involving us we immediately say okay this you need to put pressure on their support we cannot help.

Michael

25:25

Okay so what do you mean by putting pressure do you mean like following up regularly what do you mean by putting pressure on?

Nikolay

25:31

The opening, escalating and so on, like explaining why, for example, like a big company can have various support engineers. Yeah. And for example, if there is a hanging query in its RDS, it's a recent little story, And they suddenly say, okay, we solved. Query is not hanging. And I wonder, how come? It was hanging because it cannot intercept signal, blah, blah, blah. It was hanging many hours. How did you manage it? OK, support said RDS support. We managed to, OK. Did restart happen?

26:11

Yes, it did. And in logs, we see the signs of kill -9. So this is what support engineer did. This support engineer should be fired, right? In RDS team, this is my opinion, but I'm just saying it's hard to build super strong support team and it will be always lacking and it would be great if company would allow third party people help. If you check other aspects of our life, for example, if you have a car or you have recently I replaced tankless heater in my house.

26:46

If you go to vendor, sometimes vendor doesn't exist. For example, my solar is very old, but anyway, you can variety of service people who can help and do maintenance. If company, Even RDS is limiting maintenance aspects only to their own staff. It always will be very limited because Postgres expertise is limited on the market. They should find a way to open gates. This is my... It's already a message to platform builders. What?

Michael

27:18

Well, I mean, I understand where you're coming from as a...

Nikolay

27:22

I'm coming too, it's not from, it's too, it's future.

Michael

27:26

I mean, I understand where you're coming from that they can't hire all of them and actually there's benefit in terms of other people being able to provide support. But if Postgres expertise is so limited, where is everyone else going to get their support from? It's not... It's open

Nikolay

27:42

market and competition.

Michael

27:44

Yeah, exactly. So you're saying there is plenty of Postgres expertise?

Nikolay

27:50

Well, the company should only benefit if they open the gates and allow other people, help other people whilst they are still on the same platform. Because otherwise, concern and level of disappointment about support can raise until the point they go off. Which is actually probably not a bad idea. I also believe that slowly our segment of market will start to realize that there should be like this self-managed, there is managed, but probably there is something in between.

28:22

And I know this is some work I cannot share is happening. So something in between where you truly own but still have benefits of managed services. This should happen and I think multiple companies are going in this direction.

Michael

28:35

Or, and I'm seeing this more from kind of smaller companies, quite established in terms of like their database and team but not brand new startups necessarily, Moving to services where factoring in support as one of the main things they're looking for in a service provider. I think in the past, like people would factor in, people look at a lot of things, right?

Nikolay

29:01

Price,

Michael

29:02

ease of use, region,

Nikolay

29:05

like,

Michael

29:06

yeah, they look for a bunch of features but don't always factor in support as one of those key factors and I think I like to see when people do factor that in and take it seriously. So that's the alternative, right? Is pick your managed service provider partly based on how good their support is.

Nikolay

29:25

I'm talking about absolutely new approach when a service is not self-managed, not managed, but it's very, very well automated and you can hire if you're not satisfied with some company who helps you maintain it, you can switch the provider of this maintenance work, right? This should be

Michael

29:43

like co-managed.

Nikolay

29:45

Yeah. Co-managed. Yes, exactly. It's, it's, It's great because market is growing and competition is growing and we see, like I just provided a few examples about several managed services, we see bad examples all the time and the problem is systematic. It's not just like some company is bad and others are good or vice versa. It's systematic problem rooted in the decision to close the gates and not allowing others to look inside.

Michael

30:14

I also think providing good support is expensive. Deep Postgres expertise is expensive. I'm a bit surprised by your experience with escalation. Most companies I see do have escalation paths, but I don't deal with Postgres managed service providers support that often. So I'm surprised to hear they don't have good escalation paths. But yeah, if that's the case, I feel like there must be opportunity for people. And I know some do really.

Nikolay

30:46

I have a question about this area. About like, if you, you're also running something on cloud, GCP, right? Yeah. Cloud. Do you have Kubernetes? Yeah. You use it, okay. So Can you, so you use GKE, right? Yeah. Google Cloud Engine or Cloud Kubernetes Engine, right? Yeah. So if you go to Compute Engine, they call it Compute Engine, right? Where you can see VMs. Yeah. Do you see VMs where this Kubernetes cluster is running? I guess yes.

Michael

31:24

See the pods and the, yeah, see the pods.

Nikolay

31:26

No, not the pods, the VMs. Can you SSH to those VMs?

Michael

31:30

Oh, I have a show, Yeah.

Nikolay

31:34

So Google provides Kubernetes engine, automation, everything, and you still have SSH access. Yeah. Why cannot be done the same thing for managed Postgres?

Michael

31:48

Oh, okay. Yeah. Well, Good question.

Nikolay

31:52

If you have SSH access, well, you can break things. Well, okay, I know. I know. If I open my car, I can break things there as well. This is interesting, right? I know companies who provide services to tune, maintain Kubernetes clusters. And this is a perfect example, because for them, There is great automation from Google. Everything is automated.

32:20

But if customers have specific needs, and Google cannot meet those needs because they have limited hands, number of hands still, right, and attention, and so on, Company can hire another company who are experts in this particular topic, they can go and they have everything and they have SSH access to this fully automated thing. Interesting, right?

Michael

32:45

Yeah. Well, any last advice for people, like actual users?

Nikolay

32:49

Well, yeah, I know I'm biased towards platform builders because I'm upset and angry and I hope I explain origins of my anger but Yeah, put as much pressure to support as possible. Politely, but very firmly and explaining. Like, I think it's possible to, you had a great point that reasons and final goals need to be explained, right? And also risks, like what will happen if we don't achieve this? Sometimes up to okay, we consider switching to different like approach, provider or something.

33:29

Yeah, I think just like people should be more like detailed and putting more pressure to support to squeeze details from them. In this case, I'm very interested because many like managed Postgres users come to us more and more recently and they ask for help and if support is doing their job great, it helps us as well because yeah, it's like beneficial for all because we help to level up health of Postgres clusters, get rid of bloat, add some automation, tuning and so on.

34:05

But if support does poor job, well, customer starts looking at different direction where to migrate, right? And yeah, so my advice to users, pressure, details and so on To support.

Michael

34:19

Is there anything to be gained in the cases where they give exceptional support? You mentioned rare cases where the support is very good. Is there anything that we can do in those cases to, like, not just say thank you, but say this was really good or feedback that this

Nikolay

34:34

is what I liked a lot is when support engineers formatted responses very well and I knew it's not a little m actually but maybe partially but It was human behind that for sure because I saw it. Well actually, who knows these days, right? Yeah, and in this case I would say thank you for well-formatted, well-explained, well-structured response and so on. Definitely. So you try to find good things and mitigate my anger, calm me down. Thank you so much. Thank you for everything.

Michael

35:11

Well, it's good. It's interesting. I found this one interesting. Thank you.

Nikolay

35:15

Yeah, less technical discussion today, but I hope it provokes some thoughts a little bit. I think changes are inevitable. I'm very curious in which direction the whole market will go eventually, but let's see.

Michael

35:30

Me too.

Nikolay

35:31

Good.

Michael

35:31

Well, have a good week and catch you next time.

Nikolay

35:34

Thank you. See you.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript