Navigating SQL Data Changes: Tools and Techniques for Data Recovery - RUBY 645 | Ruby Rogues podcast

00:04

Ah, Hi everybody, welcome to another episode of the Ruby Rogues Podcast. I am your host today, Valentino Soul, and I'm joined by a very special guest today. Hey, I am you, Guinny. Yes, yeah, sorry, Yeah, welcome. I appreciate you taking the time to come on today. I'm gonna just introduce yourself and tell everybody why you've gotten so famous here on the show today. I'm not sure about famous, but I'm very happy to be here. And yes, I'm your Guinny. Longtime Ruby

00:42

Rogues podcast listener, thank you for reminding me. I'm a software developer who worked for many years using Ruby as my primary programming language and created some open source libraries and gems like buch loader for exam. Recently, I started my own company called Demi, and this company helps other companies track data changes in their databases. And I recently wrote a blog post about different audit trail approaches that can be used in Ruby. And yeah, today I'll be happy to

01:19

discuss it and share my knowledge. Awesome, thank you so much. And yeah, I appreciate bashloader. I definitely use that every day. So very cool. Yeah, So, I mean audologging such a popular topic to be honest, like one of those like kind of things that you over time will just end up using, right, it's something that you want to just also drop in and kind of hope that it works right, like and just like

01:52

don't have to think about it again. I think for the most part, what kind of you started down this path of like you know, really really diving into this. Yeah, it's funny that. Yeah, you mentioned that everyone is familiar with this problem or maybe at some point heard about gems like paper trail and things like that. But I noticed that it's mostly in the

02:19

Rugby community. When you step aside and for example, go into like JavaScript world or Golang world, like people have not heard about tools like that, they would use completely different approaches. And because Ruby has rails and active record as a like main or library, there are some lots of unified tools. And how I first encountered like, first of all, I used all these gems maybe for many years. For example, paper trail is one of the

02:50

most famous ones. What it does is basically, at a high level, just connects to like your model. In your model, you have some callback and these callbacks record changes in a separate table. And recently I worked at a company called Angelist, angelust Venture and angelist us Ruby, and one of the main problems that I tried to solve there was related to tracking. And there are many different reasons why people would want to track their data changes.

03:21

At this company, specifically, it was related to compliance and regulation, so the company operated in the financial space. It was important. There are some certain compliance requirements for example, soque socks and many different weird acronyms similar to SoC two which is mostly about security, but those are related to like financially reporting and things like that. And it was super critical to be able to record what happens with data in the company, as well as be able to

03:52

audit it and trace back like why this change was made. But I'm noticing that the full disclaim I'm working at a company called Demi that we try to build a productized solution something like that. And there are many different other reasons, like we're noticing why people care about audit trail in general. The most maybe straightforward one is rollbacks, like maybe you made some changes by accident or maybe by design, and you want to allow people to revert the changes they

04:26

made recently. That's one of the use cases troubleshooting is another one. So there are lots of companies that want to understand what happened. Maybe some companies implement on calls for their engineers and they do something manually, go in and do something for their customers, and that needs to be tracked as well. Regulation as I mentioned, in certain like financial sectors, banking and things like that, it's kind of like required. Some companies build activity feed as a

04:58

security feature or table stay features. For example, if you use like notion, maybe enterprise plant would have like a FTIVT feed for oddmans to see what happens within the organization. And people can also use these data for fraud detection analytics and things like that. Yeah, you make a couple of great kind of crucial points here, primarily like there's activity right like that people like the

05:28

trap, which I feel like often is like separated. And it's funny that you mentioned that this is like mostly like rails and Ruby specific, and I think maybe the other realms like go toward like the log capture approach. We're just like blogs are getting blasted and you try and like resolve all of the the time stamps as as they come, which I think is crazy Yeah, there are many different views on how to use, for example, data and

05:57

interact with the database. In their Ruby community, like convention over configuration, we have active record and many other ms data mappers for example. If you take a look at in the Go community, it comes from simplicity of the language. That is why many people try to stick with like a low level adapters and write sco queries themselves, and that is why it's even like hard

06:24

to build a unified solution like that. In the Jazz community, for example, there are lots of many ORMs that I don't know, work on edge and like work in serverless environments and things like that, So there are lots of lots of crazy technologies, which is also hard to build a unified solution. But yeah, communities like Ruby, Python, I'm noticing that. Yeah, they're more mature and they have already lots of great tools around that.

06:53

Yeah, that's that's really interesting. I mean definitely we could take some learnings

06:58

there. But I would like to just think about, like, you know, why people even like you mentioned a lot of like you know, mostly it seems like reporting and like trying to you know, align with you know, requirements, regulatory requirements, or things like that, even internal you know regulation where you want to be able to confirm that things are happening, why they changed, right, And I feel circling back to this like notion of

07:30

like activity is happening on your app and you want to like be able to isolate how the activity affected your data. Is that kind of like the approach that you're taking with themi of trying to visualize how those two things match up or you know, where do you focus on? Yeah, we take a slightly different approach. So there are many different approaches. So for example,

07:56

I wrote this blog post article that describe some of the approaches. At a high level, there are I call them callback approaches, so paper trail Audited, Mangoid History and gems like that. What they do they basically add callbacks and when you use ORM they would write history of these changes in the separate table. There are some tigger based approaches, so in a database you can write code, which is dangerous, could be dangerous, especially if you're trying

08:26

to implement complex business logic, but for simple triggers it works. So you can try to implement triggers when, for example, a record is being updated, inserted, or deleted, please do something. And there is one gem called like ITZA. It's very hard to pronounce when you see it how it's spelled, but this gem implements this approach with triggers, and I think that for majority of the companies, especially small ones, these approaches are great.

08:54

There is like low overhead. You just add them, maybe at them even just in case maybe you would do something dangerous and delete something by accident, or your logic was not tested well and it performed something crazy and you would want to roll back. That's kind of like good insurance. I would say. At DEMI we take slightly different approach. I can go more into details, but at a high level we use replication locks as a solution. There

09:26

is a pattern called change data capture. There weird like birding, but it's a data pattern used by data engineers very often. What it does is basically, yeah, capturing data changes from the database and moving these data somewhere. The most popular use case is probably ETL or solutions like that, when you would use some kind of like solution that connects to your database for example my squol, Postbose or whatever, and it tracks all changes and moves these changes

10:00

to warehouse or something like that. So we use this similar approach and change data capture CDC in short can be implemented in many different ways. There's like query based when you basically query by timestamp for example, give me all records that were updated from the last like hour when we checked. It could be implemented by using triggers again and beme, we use log based solution. So

10:26

posgress for example, has vowve ride ahead log. My sequel has been log mangodib for example, it has uplog like a skew light has vowalve as well ride ahead log. So all these locks they again briefly describe like why they exist in the world. So these right ahead locks, they're designed to be able to quickly record changes in the database in append only log before writing all the chain just to the disc with data indexes and things like that. That

11:03

are heavier operations and these locks they are crucial for data recovery. For example, if your instant crashes, it would try to recover from the last state, replace certain let's say changes from the wall and get to the later state. These replication logs can also be used for like cluster replication for example, if you have like primary and secondary, Secondary would connect to the ride ahead lock or replication log in a primary and try to read it and get and

11:35

seek the state. We use basically the same solution at Theme. We are sort of connecting like a read replica, but we also decoded. It's called logical decoding. We decode all these changes so it's you can understand like it's

11:50

not on a binary like file blog format. It comes from like decoded on a roll level, and you resee with all the changes and we simply stream and store them as a kind of like yeah in the separate table and people can use it and read it and people sometimes yeah, as I described, there are many different approaches like or use cases why you would use it, but at a high level, what we do we simply write a log of

12:20

all changes that people want to track. They can specify like what table to subscribe to, and there are certain technologies on the postulous level that were recently released that may help with performance as well. Plus we have some nice UIs that people can use to see like what happened to these records and things like that. This is really interesting. I love the article that you have that kind of breaks down all of this on. I've never heard of change data

12:50

capture before. It's kind of wild. There's a lot of pieces to the process here, which I think just comes honestly with a lot of the Avenger and kind of architectures. So I just want to take a step back, Like, let's let's walk through kind of like you know, why would somebody you know, aside from like, let's say, like what what are the primary use cases I guess for having your for having a need for that level

13:22

of like you know, orchestration and like granular rollbacks. Right, Like what you're talking about is having like I guess, a massive amount of data that you want to granularize what you're auditing and be able to recover from those in case anything should happen. Is that your primary focus, like focusing on that recovery when you're trying to do the audit trail or like where do you see a lot of the you know, the most common use cases of having the

13:50

audit trail and where like just having paper trail like starts to fail. Yeah, great question. Uh, we are very like we are noticing that it's a very horizontal solution and people use it for different purposes. That is why it's great when you give a solution for developers and they build something on top of it. We have maybe three main use cases. One use case is

14:16

just for troubleshooting purposes. We have, for example, there is one logistics company that has thirty or forty customer support reps and they're in the logistics business, so there's lots of real time data from analog world, lots of unstructured data, and they leverage AI and things like that to help with it. But they need to see what's happening, maybe help customers, their customers and cargo companies and things like that, help them understand what happened to this shipment,

14:52

for example, and they have lots of different inputs. It could be like constructured just an email. It could be a dog sign or something that changed the status of the shipment and they would need to troubleshoot and understand what's happening. So yeah, I previously like as developers, one solution is just going to a rails console or something like that. And if you use any of these solutions, you would just try to see like Querier record and get

15:20

all recent changes that happen to you. When you need to understand like why is it in this weird state, why it's not in like different state, what happened there, like what API in point for example changed it, and things like that, some additional metadata. The second use case is related to just yeah rollbacks. For example, we have there is like one company related to carbon removal tech for example, and they use POSTS, which is like

15:52

an extension for geolocation data on top of prosgress. So they do lots of changes in budgets and in certain cases when there is an error or something, they want to roll back to the previous state. And it's possible to just track certain selected tables and work with any extensions and things like that. And the third one is for many it's like table stakes, whether it's because of

16:21

regulation or for example, there are in healthcare it's very popular. There are like clinical trial management systems for example, they need to track what happened to patients data. And many of these solutions. Like one fun story like at one company, we use paper trail and we had so many locks accumulated in the versions table that we ran out of integers and we had to Like the

16:48

solution was simple. We basically okay, said start recording all changes in versions V two table, so we would keep the old one it will like cold archival, but all new changes please record in a separate table because we don't have any integers there and player IDs, So that is why it's good to use UIDs when possible. But one issue with for example pay Petrail and solutions like that is reliability. Maybe like people, maybe you heard that it's possible

17:22

to even with rails, to execute certain queries that bypass the callbacks. For example, if you execute like delete all or update all or update column or something, it won't trigger the callback, does it won't record the history, so you will not have the complete history. So that is where like log based solutions are critical and could be very useful because even if someone goes directly to a database executes an SCO query, it will be recorded because it happens

17:55

on their database level. And that's one of the kind of like main reasons. In terms of plus, there are some additional benefits like no performance penalty, so you're in the run time, you don't insert any additional records. It happens asynchroneously, but that's at a high level. Yeah, you made me think about because a lot of a lot of what we've been talking about

18:18

has been surrounding like just auditing very specific processes around the data. Right, Like we get a you know, going back to the logistics approach, right, like somebody's submitted an order and they're tracking the process of that and the order you know, I imagine has a table of orders and various aspects of the order that are being tracked with it, and that gets updated in the database kind of the same way, and it gets tracked, you know,

18:45

as it progresses through whatever business logic that we have. What I feel like is kind of missing from what we're talking about here is like that that overarching activity like almost it makes me think of open telemetry, right where we have like distributed systems right where they don't have to be necessarily you know, technologically right, but they are distributed events that happen throughout the business that fire off

19:15

and some activity happens and that has effect on the database. What do we do for like cases like that, Like are there like tools available to audit that process and how that's affecting the individual audit trails that are happening, and like where how do you approach that aspect of it. Yeah, you're definitely right. There are some like tools for example data Dog and things like that

19:40

that help with some visibility, observability, and tracing. Many of these tools they even implement a like just logging, so they would log what happens and for example, if you use like postcris, there is an extension popular one called pg audit. What it does it basically all queries that were executed. So if for security reasons, someone for example, gained access, you'll be able to, like during like cyber forensics operations or something, we will be

20:11

able to detect what happened and who executed what. So there are some solutions that do just log in off activity that happens, and you can do it manually as well. There are some gems like public Activity or ahoy gems that maybe used like Google Analytics on the front end or something like that, or in the browser it's similar but on the back end, when you can manually trigger certain events that you want to track and they will be stored in structured

20:41

format, not in locks, but in a database for example. There are also solutions that help with just performance, observability, open tracing and things like that. But there are not a lot of solutions that do something on the data level, and it's a little bit weird and funny because yeah, most of the products they use databases and that's the source of all data, and yet we don't track it like developers use geed for tracking and version in their

21:17

code, but there is nothing for like databases. And to be fair, there are some solutions that people sometimes use. For example, there are maybe two technologies that may help you solve this problem. One is temporal tables and another one is event sourcing pattern that some companies and people use, so temporal tables. I'll try to summarize it in the simple words. Basically, it's

21:51

within the SKAL standard since two thy eleven. What this feature does basically what I described it kind of like has a version of all records, So you have your record, plus you have a range and a timestamp basically a range

22:07

of time stamps during which period this record was there. So for example, if you're updating a record two times, there will be two records in a temporal table, and the first record will have like an end time range when the second update happened, and second update will have like infinite time range starting with the time when this update happened. So with the functionality, and that's

22:37

very powerful. It's usually helpful when you do for example, reporting I don't know, you need to generate some files or what happened last year, tax or something like that. That's very like powerful feature. Although many databases including postgars unfortunately don't have it implemented, so you would need to install some extension or build it yourself. So that's one option. And another one for just

23:04

all dating changes is rethinking how you make data changes is event sourcing. Event sourcing in simple words, it's kind of like maybe the best analogy is geit. So when we use GIT, we basically use an event source system. So GIT records all events, commits and all the changes that happened and then and that's the source of truth. So committs become the main kind of like unit of data. And from these commitits, the powerful feature is that you

23:41

can travel in time. So for example, if you want to roll back or see what was the state of like three commits back, you can do that by traveling and basically undoing the changes of each commit. I would say from my perspective, like in software engineering, event sourcing is like a unicorn idea. Everyone is talking how cool that is, but very rarely people implemented or do it successfully because it's a very challenging problem because you deal with mostly

24:15

with immutability of these events. So every change basically needs to be mutable. But what if you need to change the structure, then you need to implement some kind of like versioning or replay the events and append some additional data or like change the things. Plus then you need to somehow get to the actual state. So you'd record all events. For example, shopping card, you'd record an event that somebody added an item to a shopping cart, somebody clicked

24:44

on purchase, somebody purchased and it was paid or something like that. All these events would be recorded, but at the end of the day you need to understand what was purchased, for example, when showing the whole order history. To do that, people, you use projections sqres like for like separating reason rights, and lots of lots of other patterns from the book called the

25:11

Main Driven Design DDITY, and that becomes very complex. I would say that event sourcing makes sense in certain regulated industries, and there are lots of great patterns or like how people do it for example double entry book keeping system in banks or like yeah, accounting and things like that. This is standard where you would probably try to use this approach from the get go. Yeah,

25:37

you make a good point. I mean all of my even rooven design architecture related stuff as always, you know, the complexity raises as you start to spread things out and you try and like capture what you know, collect all of the things and merge you know, the state of things. You know, which pieces do you pick? Right? Like anybody that's had to merge conflict and get you know, can understand the issue here, right, like,

26:04

well, which which changes do you use? Choose right? And that definitely becomes more of a problem as you start to spread things out with you know, your event driven architecture. Yeah, exactly. Yeah, it's much simpler to use, for example, like crowd operations for example, just update

26:22

the record and that's transactional, right and do it or not. But it does make me kind of think like okay, well you know, as you know, it makes me wonder, like I guess if you have a need for right, like the things happening at different times, but really they're all the result of like one thing happening and having a simple audit trail, like it makes sense to just throw a gem in and then just have it handle

26:56

all of the things. But when it's like when it's something like you know, a time series database or something like, are we just solving the wrong

27:10

problem right? Like why not just have something like in a time series database that keeps track of the data as it happens versus spreading it out like what's the where do you see the benefit of having it be distributed like that over having it be more transactional and having things like Okay, well it happened in this series, Like you know, is it worth it what I'm saying, you know, to have that distributed nature versus just like taking the time and

27:41

doing things in series. Yeah, that's a good question, and I think the best answer is favorite developers answer is it all depends. But I think if you need strong guarantees and you have scale, then it totally makes sense to use this system. The question is then whether you buy or do it

28:04

yourself. If you do it yourself, then you would need to start thinking about all these like technologies school technologies like Kafka and maybe storing it in like yeah, time series databases and columnar databases and things like that or warehouses and use it somehow. If you need to use it from the app, then you need to somehow connect to I don't know snowflake to read it if you

28:30

need it for like transactional or like operations. Basically I think like back to this like anecdote with large number of paper trail records, like anecdotically, like from talking to other companies, I saw that for many of them, just managing the database becomes much more expensive when you record it everything in the same database. You would notice that, like this version stable becomes your the most

29:06

biggest table in your entire database, and that affects everything. It affects like backups, the storage, how fast your database would run, how difficult it is to upgrade it at scale, how fast it will start, how much traffic is there plus activity and things like that, and at some like that's probably like totally valid approach at a certain scale, but at some point you

29:37

would probably notice, like why am I storing it in my database? That is being killed by this performance penalty because you're essentially doubling the throughput on every For example, insert you would insert an additional record, or it could be even worse. If you're inserting one hundred records in one query, it will be one to one one hundred plus extra one hundred queries. For each record

30:03

that was inserted, it will be one hundred additional skill statements. And at some point it makes sense to somehow extract these data and it could be still in I don't know the same type of database that you use, and there are some benefits to it. For example, if your main transactional database is postgress, for example, what we do we store these changes and we call it destination database where the changes are stored from the source database. We store

30:32

it still in postgress. The reason why is because one is easier to consume. You just use the same adapter, same or m For example, in rails, you can specify a couple of connections for example, for these models go to these database, but for this model go to the other database. So you're on the application layer. If you want to consume those, we don't need to change anything. And two, especially in postgress, there are lots of cool tech technologies that help you with scalability. One is maybe yeah

31:03

time series. We mentioned, there's like time scale extension or super extension because it dramatically changes the way positives works and it allows you to scale, compress data, and record events in chronological order that it is designed for that with all these charding, partitioning and things like that. Yeah, that's really interesting. You know, that's definitely something I've seen, you know, scaling as

31:30

things go. You you you know, you want to set it and forget it, but then all of a sudden you're like, well, why is the database crashing when when we want to like you know, audit the audit trailing. Yeah, in affects not just the database, even like your run time. So your application is slower as a result because it waits for database,

31:52

but also tests. For example, if you don't disable for example versioning for your models in tests, Uh, imagine like eating a large like database or lots of things and testing things, it will just yeah, kill your test performance as well. That's a great point. So I mean, what

32:12

what can we do to like avoid these pitfalls? Right like aside from like, like you said, you like using some of these other extensions, Like, are there like best practices just like out of the gate that you can do to like maybe avoid some of uh, you know the performance issues upfront without too much complexity. Yeah, I think there is no like silver bullet.

32:35

Unfortunately, at smaller scale. For example, if you're a startup or something and you don't have lots of people or like customers or users, then it's probably not a problem and you can just buy yourself time in terms of velocity by building features and not thinking about optimizing things. But yeah, as you scale, unfortunately, companies hit this kind of like moment when they realize

33:02

they need something and they already build it. Usually like cod basic becomes giant, there are already lots of like team members and engineers on your team, and now you have to migrate everything, and that becomes very difficult to do if you want to like change things dramatically and these migrations. Yeah, I talk to some like people who build similar systems at I don't know, Airbnb, Gusto and other companies. It's all like very difficult when you need reliability,

33:36

performance and things like that. Yeah, I would say, yeah, I think it depends on your business. If you're starting and you think you would need all these kind of like tracking functionality and it's business critical. It makes you a business better. You can help, it can help you win

33:54

customers and things like that. Then I would think from day one, like how to potentially designed this system either use like event sourcing, or use these CDC change data culture or like some products that out there something like that. But if it's just like secondary, then you can either buy at later stages if you're successful some solutions or just it's not like the main problem and it's

34:22

probably will be deprioritized for many many years. You know. Part of me always wish that there was like a log rotate version for database tables or things like that, right where okay, after so many so much time has passed, you know, just take whatever's there and like stash it somewhere else. Yeah, almost like an auto database charter. Yeah right, yeah, yeah, Unfortunately it doesn't exist, but yeah, you need to build it yourself.

34:53

Yeah, and we we plan to do that. We call it like retention period, so how long people need to retain their data and changes and audit Because storing everything in the hot database that can be quered at any time is expensive. That is why people sometimes move it to like some other cold storages and archival or something like that. Yeah, it's three buckets and that's

35:15

awesome. I mean it's definitely needed. Uh. I wish so many times I had just like an option to buy that, right, Yeah, rather than have to build it and manage it and worry about performance of everything. Yeah, for sure. So where are you taking this from here? Like,

35:37

what what are the next problems for you to solve? I mean we've talked about quite a lot of different so I mean we haven't dove in quite to maybe some of the specifics, like you know, doing various layers of logging, right, Like we've you've talked about like HTTP logging versus data based specific auditing, right and console log in which we haven't talked about, but really interesting stuff from like thirty seven signals right with Console eighty four and audits

36:08

for doing that in the reals console. Yeah. There are many layers, Yeah, where so many layers? Yeah, many layers are covered and some are just there are huge gaps. So for example, you mentioned, yeah, HTTP logging, it's all done. For example, you can application execution, it's done. There are services data maybe it's solvable, and now we're trying to solve the dot on the database level console access. It's also I

36:37

would say a little bit specific to the Ruby community. For example, in the JavaScript community, there is there are not usually good like repal consoles or that come with frameworks and things like that. And it's good and bad. I'm noticing that people. It's good that people don't go to this console to do crazy things there, but sometimes they go to scale console directly or connect to and scale and perform changes on the database level. I don't know which

37:09

one is better. Yeah, with a review, Yeah, it's luckily we have great console experience. We have like gems like Pride that allow you to i don't know, go and jump into like classes and expect objects and things like that. And yeah, there are a couple of gems. One is Console nineteen eighty four and another one Audience nineteen eighty four. These are interesting gems. I personally haven't used them, but they seem to solve like very

37:39

popular problem, especially among B to B companies. Probably when you have customer data or customer requests. You need to do something manually, but you want to audit what changes an engineer did. For example, what these gems do. They simply you install them in the rails project for example, when your law to your rails console. They'll prompt you and ask like why are you

38:06

here, like what do you want to do? And engineers can specify like a ticket number and specify some context, I'm solving this customer problem blah blah blah, And you can type all the commands, and the commands may be locked in a separate database table. And the audience Jam nineteen eighty four allows you to inspect all these things. You can implement some additional approvals or like

38:31

retroactive approvals. But these solutions are like and I don't know whether there is any better solution, but these solutions are not bulletproof in terms of reliability accuracy. Because it's Ruby. You can do whatever you want in a console. I can overwrite everything and then after that do crazy things that no one would see and notice. Of course, you're hoping that your colleagues or you won't do that breaking all the rules. But it requires discipline, and yeah,

39:04

it's funny. I tried to implement something similar once at another company when we had a little bit more complex set up with SSOs to a WS accounts. These accounts would run on kumbernities certain process with the Rails console with some pre defined en viireables and these kind of like sets certain context for for example pay per trail or tools like that, so you can see who is making the

39:32

changes. And yeah, I haven't shipped this feature or it was not like released, but it's still I would think that it's a problem for many So if you're listening to this podcast, yeah, that's one opportunity to improve how to improve log in and auditing of their console sessions because many companies just leave there like developers leave there troubleshooting purposes, yeah, and things like that.

39:59

Yeah. Yeah, I remember first saying console nineteen eighty four and the audits nineteen eight for a kind of funny naming, but yeah, really cool. I've used them in a couple of projects. And it does have like you have to use like what is it the active record encryption stuff dataly so if you use maybe lock Box or something else, you're kind of like sol. Yeah, but it's really interesting, Like the interface is definitely pretty smooth,

40:34

and I do like it. But you're right, like it doesn't really prevent anything. It's more just like all right, well, as long as you you know, and you know the goal is to hire people that you trust anyway, So as long as you have like at least a process in place to audit it, it definitely makes things much easier to track that. And

40:51

from your experience, what did you do in these consoles? Usually, Like from my experience, it's usually either checking the data what's there because you don't have like fool UI for everything or something like that, or sometimes editing some stuff there as well. Some curious whether it's a common Yeah, I hate,

41:12

I hate going in production console. So for me, it's mostly querying if anything, right, Like, you have a bunch of related objects and it's like almost not impossible, but like much worse to like go into the you know, a sequel dump or something like that and sift through the queries

41:31

for it. But sometimes like you know, you need to, you know, just update some timestamp somewhere, and it's just like, you know, rather than having to generate a rag task that goes and runs, right, if you just have like one record that you're trying to get to that's problematic. That's like you know, failing a ton of repetitive workers or something like that, and you just want to stop it from doing that, you know,

42:02

but you want to track that. It's that that that you were the one that like updated it, right, It's like it's very rare that you need to do that. But there are those cases. Yeah, in these cases when people need to update data. Yeah, I think there are many also approaches how you do that, like one one off, Yeah, you can write a rake task. Some people run it in migrations, some people

42:30

execute stuff in rails, console or things like that. They think there is also like opportunity to improve things because writing, like it's good when it goes through the code, but the cycle can be in feedback cycle can be still like long and you don't know like whether you wrote everything perfectly or not. You need to test it and things like that. And yeah, I mean I would say in general like probably don't update things, yeah in a production

42:59

box and general like no writing. I feel like that's a good rule of thumb. Many just ben yeah, access like only for example one person, or like certain like deblop steam can access or something for troubistioning or like certain

43:13

emergencies. I know, I know some companies will do like you know only uh you know the console access only gets read replicas, so then it just prevents it in general, right, So then you just you can only read from the database in general anyway with your connected to console, which is really interesting and definitely like prevents it. Yeah yeah, but definitely not as flexible.

43:37

Yeah, Ideally you'd need to tightly isolate everything, like there are certain other distributed systems like readis cash man cash people can also use and drop things by accident. Ideally would also want to isolate those, maybe certain HTTP requests, And we also like logging each TTP requests that are going from your application because I can use all environment viiables and send I don't know, stripe request

44:04

to send myself like one million or something. Ideally it should be also traced like who made these HTTP request and or maybe ideally prevented from doing it in the console. So, I mean you bring up a good point, like I feel like we haven't touched much on like the recovery aspect of all of this, right, Like the whole point about and logging is in case you need to recover something, right like, or you need like next to reporting, right like, so reporting you can gather Okay, well we're doing a

44:34

you know, audit specifically of everything that changed this particular thing. But like most of the times I feel like the reason for that is there's a recovery aspect following that up, right, Like, so where is that in this whole process? I know, like some of these jumps we've talked about have some of this, but I feel like it's still kind of missing, Like what is what is your like approach there, and like how are you working to like resolve this kind of like stuff. Yeah, recovery is very tricky

45:07

problem in general. One issue, for example, is what if you somehow backed up some data or certain rows. You have the record historical version, but now your schema has changed. If you're using a scale like databases, Like, how do you resolve this conflict? What if you added the column you prepopulated it, but in your history you don't have it. Now you have to prepopulate it because the column may not like contain like shouldn't contain null

45:37

values or something like that. So if that becomes a problem. In general, you have gems like paper Trail. They allow you to query the version find the right one. There is like a helper method called Rafi. You call it and it basically instantiates an active record record with the version, and then you can call say again and the tool overwrite and store the historical changes. As I mentioned, Yeah, there are some schema changes that they kind

46:10

of like up to people to decide how to resolve these conflicts. There are some other different approaches that on the database layer people solve. So the one is very straightforward to snapshot or take back ups daily nightly. Ideally you maybe test your disaster recovery procedures and test whether you indeed can recover because very often

46:34

no one tests those and just things that they're safe. And there are also new popular techniques maybe heard about point in time recovery, So that's popular feature that many database hosting providers advertise what they do at a like in terms of implementation, how it works, Okay, how would you use it? You would use it by simply selecting any time you want to recover data, and it will either create a new database instance with the state as of this time

47:09

stamp or it will replace your database state. The way it works it uses these ride ahead locks that I mentioned previously. Because ride aheadlocks are like append

47:22

only logs, these solutions they continuous to archive them. So some companies just take these valves and world records and they move them to S three and then if you need to roll back, they give you like either the timestamp when, for example, a certain blob was archived, or they may also do some binary search within a certain file to find the exact timestamp you want to try to recover to this data point. And there are some other like cool

47:57

tag built on top of this idea. If you have like fowl and it's kind of like stored somewhere reliably, then you can do branching. So there are some serverless lots of server less hosting providers. What they allow you to do is to use basically copy on write copies for for example, your branch.

48:20

When you want to test something on staging, what they allow you to do is basically create a database that will simply point to the state from a shared voeve kind of like log So instead of copying all data again and paying for all x ray resources, it will copy from the main let's say history, but any additional changes that you'll make, it will branch off and it will just incrementally record some additional stuff. So some super cool technologies that out

48:47

there that help with branching as well as data recovery. Yeah, I've definitely seen the branching. It's pretty interesting, like is that a you know, it can just either wipe it off or like integrate it as as it merges down if you want it to, which is pretty interesting. Like, but in reality though, like what what have you seen? Kind of like things evolved too, like you know, what are people actually using? Are they

49:15

using copy or on? Right? Like it seems like a very like complicated process to integrate with, like what what are the where the trend's going? Yeah? I think in general people prefer cloud managed databases usually for these reasons because they can manage backups reliably. So if you run something on RDS or like I don't know, render SAPA, base, HEROC or whatever, you can trust that the backups they take they're probably fine and you'll be able to

49:47

roll back if necessary. That's one first line of defense. It's just relying on your hosting provider and any additional ones. Yeah, related to using additional tools, like the way I think about what we do. For example,

50:04

at BEMI, we provide selective point in time recovery. What it means instead of like it's very hard and dangerous to roll back the whole database state, rather than like rolling back certain data records that were affected, and with tools like that, when you can filter out changes and select the ones that you want and just apply those, that's very useful because usually, yeah, I rarely saw when people would recover from like full database back up, because it

50:42

means you would lose your history what happened after that, if certain other clients used it, and things like that, So it becomes very problematic and not very practical. So I have to ask this mostly because it's my interest, But like, do you see like AI fitting anywhere in this process us? Or is it kind of just like you know, there's this it's two sensitive

51:04

of a topic to integrate with. Yeah, I think there are some, and I would share one idea that the problem we try to solve with AI, maybe we'll solve it, But it's okay if you have a desire to do it, to go for it, because it will be great if you can solve it. But one maybe obvious solution is like AI for SKL. There are many non technical people who want to use like query data in a like business intelligent tool, metabase or something else, but they don't have like

51:42

the knowledge how to stitch things together for complex scenarios that's one idea. Another idea is we mentioned the audit trail and log and for example we have some customers, even non technical people use our UI for troubleshooting. But what could additionally help is data mapping. So usually database structure, it doesn't very well translate to real world language. You'd have some different ideas, maybe especially in

52:15

Mongo. I remember like all days in Mango, it's like document based like database, and it doesn't have a structure, so the column names you may have different columns sort of columns. It was recommended for performance reasons to shorten your column names to like one or two letters or something like that because it would take less space. And imagine seeing like an audit trail with weird like data changes and with m is equal these timestan b is equal weird string.

52:45

People would not be able to understand it. So where AI could feed it here is basically taking the structure and humanize it. Understand the data structure, understand maybe some business logic from the code base and muppets or anyone even non technical like customer support ors like the CEO could understand what's happening if they don't have like knowledge about how data is structured, and that's in general related to

53:13

these ubiquitous language DVD concepts. Ideally everything should be the same, but that releively happens. That's one idea. Another one is like some companies they they maybe you heard about techniques like RAG for providing some additional context with custom data to your like AI system, so it can use internal data. Maybe if you've use perplexity. Perplexity uses other resources that maybe even like updated in real time just an hour ago. Plus they have all these citation links pointing to

53:54

the source, so red techniques can help you to do that. But companies in general, when they deal with internal data, audit can also help. So you want to like have clean data, especially if you operate in the space when there are lots of documents and structured things maybe files, maybe emails. So I'm not sure, like it's for me, I don't know like

54:22

how to productize it or what's the exact like solution. But the problem is that, yeah, people want to take this data, but they also want to understand like what happened, for example, why it was like that, why these AI systems aid it like that, maybe we can find unit and you'll see that with like, yeah, certain conflict changes your response rate and accuracy has improved and things like that, so it's important to audit the results

54:52

of the AI queries as well. Yeah, super interesting takeaways there. Personally, I'm looking forward to, like, you know, just just getting some observation aspects of it out, so you know, like kind of just automate, automatically getting like the you know, bullet gem or something like that that just tells you, ay, like these things are not performing well, but without having to actually capture them. You just look at it sideways. Yeah,

55:23

I hope. Yeah, things will improve and things will be implemented. Yeah, I use the topical pilot and yeah it's some people hate it, some people love it. I like it. When it changes your mind. You have to know and understand how to use it as well to better leverage it. Sometimes I would write a comment explaining what I'm trying to do and do it then so just me some code that more or less much is but then I still need to check sometimes and things like that. But yeah,

55:52

there are lots of opportunities for AI. Very cool. Well, we've talked about so much here. Is there any other aspects you wanted to cover before we move into picks here. No, I think we covered a lot of ground. I think in general, Yeah, that's like tracking changes. Auditing is a big problem. There are tons of tons of ways how to solve this problem. Yeah, and I would recommend people try to evaliate them all instead of trying to just going with one by default because it's the hold this

56:31

one or something like that. See what works best for you. Yeah, well, for sure people should check out bemy and your blog has a lot of great content around audit logging and tracking various activities and the database things you can add. I would definitely recommend people check that out for sure, because I'm already going to dive into quite a few articles on there. You've got a lot of great stuff there. All right, Well, let's dive in the picks. Do you have anything you want to share or I can go

57:06

first if you want some time to think about it. I have two picks.

57:08

One is non technical and one is technical. So the non technical one is maybe if you're old enough, you remember there was a technology called RSS and now I'm trying to use it on a day to day basis the way I'm using it, So there are for example, lots of great technical channels or content out there, and instead of these systems being pushed to me or me checking I don't know subreddits or like hacker news and things like that, I try to aggregate all my content as a kind of like funnel into an

57:50

RSS feed. Some of them I use feedther, but there are many others and there's similar some of them they have help you get email address, so you can also subscribe to substack without polluting your in box. So I'm just organizing it and different subfolders I don't know tech related, I don't know, some engineering related and other topics, and I aggregate everything like stopstacks, top

58:16

subreddits within a day or something. With some ural arguments, you can get to these feeds as well, tech blocks from different companies, top hockering news within a day or something, digest. So lots lots of resources that can be accumulated there, and I have just a queue that I'm going through and reading when I have time on mobile or on a desktop. And my second technical peak is Nix. I'm not sure if people heard about this system,

58:45

but it's basically package manager and system configuration tool. It's sort of like our band r VM, SDF and all these things. But it's more powerful on the almost like operating system level. So it's not like just limited to programming languages for example, but the way we use it, for example, we have lots of projects in different languages. There is a full disclaimer. Nicks has, from my experience, has horrible devicts. It's very hard to use,

59:20

very hard to understand what it does. So there are some wrappers around it, for example death box, but there are many others we use death box. So essentially, what it allows you to do is to bring any tax stack without Docker to your local machine, and it's isolated to your project, meaning it will create a separate folder for your postgress for your readis whatever, for your like jilascript engine or not Jazz or Rugby and things like that.

59:49

You can specify all these dependencies so when you go to your project, you can keep using your I don't know favorite at a Sage orgh or whatever shell uh, and it just automatically enables all these additional dependencies within this project. Yeah, that these are my peeks for today. That's super cool. I'm gonna have to play around with that. Yeah, Nix is definitely a terrible, So thank you for sharing death box because that's pretty cool. So

01:00:30

I only have one pick today. Uh. I my awesome co worker Anna, She wrote this incredible article on the Toroximity blog recently on evaluating GPD outputs and how we do it for medical professionals, uh and using golden data sets and programmatic emails and some human emails as well. Uh, to just make sure that uh, you know, that output you get from large language models is accurate, uh, at least following you know, a certain set of

01:01:07

standards. So I recommend checking that out. It's really awesome and just like goes through a ton of stuff that leave them working out, which is really fun. That's great. Yeah, I hope that all healthcare providers not just blindly used AI without really too. Uh. Yeah, it's been great talking to you. I love to see all the work that you're doing, uh to help make auditing easier for a lot of businesses. And uh, you know, thank you for all the documentation and coming on the show today was

01:01:44

really great. H you know, chatting with you today. If people want to get in contact with you, how can they reach out to you or find you on the web? Yes, I mostly and rarely use Twitter or x it's cold nowadays. I have like medium githop everywhere. My handle is exaspark. You can find me there. And yeah, thanks for having having me here Valentina today. It was nice chatting with you. Yeah, totally.

01:02:16

Yeah. I mean until next time, everybody, I hope you learn a lot and can make use of a lot of this stuff because I know I will. All right, bye now, bye bye

Transcript source: Provided by creator in RSS feed: download file

Navigating SQL Data Changes: Tools and Techniques for Data Recovery - RUBY 645

Episode description

Transcript