Ben Walks A Tree

Matt Godbolt

00:21

Hey, Ben.

Ben Rady

00:22

Hey Matt.

Matt Godbolt

00:23

How you doing, my friend?

Ben Rady

00:24

Uh, I'm working. I'm working today.

Matt Godbolt

00:26

You're working? Yeah. Why

Ben Rady

00:28

Would you do such a thing? Doing the, doing the day job? I am staring at a bunch of, uh, comments in my code that are like, Hey, Ben. Hey, later, Ben.

Matt Godbolt

00:39

Yeah. Future Ben.

Ben Rady

00:40

Comments, written by earlier Ben. Yeah. Past Ben. Hey, future Ben. Uh, you need to solve this problem. So, oh, um, you know, solve this problem. And now I'm looking at these comments being future, Ben going, yeah, I guess I gotta solve this problem now.

Matt Godbolt

00:54

Well, I tell you what, that past Ben has a lot of confidence in you. Misplaced? Was it misplaced confidence?

Ben Rady

01:03

Uh, hopefully not.

Matt Godbolt

01:04

So, this is an interesting, actually in sort of the taxonomy of comments. There are like the say what you're doing, which is al almost always a mistake to, hey, you know, add three, you know, the comment says add three to the blah variable. And you like, then it does, you know, and then you know that underneath it blah variable plus equals four is what it's inevitably going to say, right? 'cause in the comment rots, um, so, you know, one typically says, you know, like, well, if you can't make it obvious by writing the code, uh, so that it's reads like what you want it to say, then you've probably got a problem with your code. But then there's the, the why explanation, which is, you know, like, Hey, I'm doing this because of this thing that's not obvious from the context or the code or whatever. Those are, okay, but then this is the insert code here. This doesn't do anything yet. Version of the comment.

Ben Rady

01:51

Yes. Yes.

Matt Godbolt

01:52

So this, this, what is the problem?

Ben Rady

01:53

This is my, uh, uh, this is my task tracking style of comment. I actually do this thing where I, we might have talked about this actually, where I have a special check in, um, branch builds for TODO comments. So TODO means very, something very special in this project. And it is basically like I've queued up some work that I need to do, and you can't merge to the main branch if you've got any of these things left in the code.

Matt Godbolt

02:17

Interesting. That's a good discpline to have.

Ben Rady

02:19

Yeah. And it's actually, it's actually also a great way to collaborate because you can sit down with somebody and kind of like put some TODO comments in the code and be like, oh, we need to fix this. We need to fix this. We need to fix this. And not lose track of the fact that as a group, you need to do this work.

Matt Godbolt

02:32

That's interesting. I've definitely, I've worked on projects before where we've not allowed todos to be merged into Maine. You know, like, like essentially we can't pro promote anything to production if it has any todos in it. But nowadays I subscribe more to the provided the to-do has a tracking issue associated with where like the, the finesse of like, why you, it's okay to defer it for now and it can be scheduled and put back on like, someone's work to-do list that actual list, then it's okay. But it might be nice to have like both worlds where you're like, this is a long term to do, like, hey, to do, we know that this doesn't scale, but for the next year it's likely to be, okay, eventually we're gonna have to replace this order n squared algorithm with something cleverer. That's the kind of thing you can put in a, to-do comment, uh, obviously until you forget about it

Ben Rady

03:29

It will not accidentally go into production without having looked at this. Yeah.

Matt Godbolt

03:34

Right. Like, so we, on the other end, I have pre-commit hooks that say, if it says do not commit anywhere in my repository, then. That you can't do that, but do not merge is something I have written, but I have nothing that prevents it from being merged into main. It's just like a convention there, and you hope that when you're diffing things you'll go, oh, whoops. Yeah, I should, so maybe we should enforce that. I mean, anyway, that's not why you wanted to talk about this.

Ben Rady

03:55

So, yeah. So what is this comment? What is this TODO comment? So basically the problem that I have here is I have a sort of unknown you can think of as directory structure. This is actually in S3, but you know, we're we're, We're we're, we're treating keys like directories in, in S3, and that all works just fine.

Matt Godbolt

04:10

Which is a directory. As they do really.

Ben Rady

04:16

Yes. It, it all works just fine. So you have this sort of unknown directory structure. Right. Yeah. Um, so there's some root path, some root key that you know, but underneath that you have some heterogeneous mix of files and what you're trying to,

Matt Godbolt

04:34

And, and sub directories as well?

Ben Rady

04:35

Yes. Many like files and sub directories, potentially many layers of sub directories. Right. And, and you have no idea what the structure is. Uh, and what you're looking for is, uh, files, and for the sake of simplicity, let's just say that they're all one type, let's say CSV files. So you're looking for CSV files that, uh, have the same base path. So one of the, the sub directories in the hierarchy of sub directories that is common contains underneath it a homogeneous set of CSV files all with the same schema. Right. So they can be in different sub directories. And an example of this would be like separated out by date. Right. So maybe, yeah, you have something where it's like, you know, you've got a base path, and then under that you've got a, a a a a subdirectory per day.

05:30

And then each, in each of those days you have, you know, a handful of CSV files and you want to be able to detect that that base path has all these different files and they all have the same, uh, schema, and there are no other files in there that have a different schema. So part of this problem that I've already solved is a sort of schema detection. So I have, or I already have, uh, an algorithm where you can, you know, take basically a, a stream and send it in there and be like, Hey, this is a CSV file. I want you to inspect the headers. I want you to inspect the actual values in the CSV file to make sure that they're, you know, detect what the types are,

Matt Godbolt

06:09

Like noneness or otherwise of, of them, or Yeah. Date times or strings or whatever I get, right?

Ben Rady

06:13

Yeah. Yeah. It's sort of like all the, the, it devolves into string, right? It's sort of like, if I can't figure out what this is, I'm just gonna call it a string, right?

Matt Godbolt

06:20

Got, it.

Ben Rady

06:20

But if I can parse it as, you know, a datetime or a float or a 64 bit integer or a 32 bit integer, like that's what the type is gonna be.

Matt Godbolt

06:29

Yeah. That's the derived type. Yeah. But you infer it. Yeah. Anyway, but that's the solved part of the problem.

Ben Rady

06:31

Yeah, that's already done. So that works. So now what I have is I have, uh, the algorithm I need to implement here is you're given a, a, a bucket, you know, an S3 bucket, you're given a prefix within that bucket, and now you have to scan that prefix and come up with a list of schemas where the schema contains the results from the inspector. So like the columns and the types, right. But more importantly, the paths and the, the, the tricky bit here is if you, if you have something, it's entirely possible that you scan this and you come up with an empty list because of the, the structure that you're given is just a hodgepodge of heterogeneous file types, and there is no root key that gives you a homogeneous set, then there's nothing to do here. Because what we're producing from this is basically, uh, uh, a, a queryable table where you can give it that path prefix and say, everything underneath this has this schema, has this type.

Matt Godbolt

07:38

Okay. So you are trying to find the highest point, treating it like a graph with leaf, leaf nodes, where the files are, you wanna find the highest point where all children leaf nodes that you can find from that point have the same schema mm-hmm.

Ben Rady

07:59

Yes.

Matt Godbolt

08:00

Is that fair to say? So like, if you had, um, if, uh, so a directory A and A directory B, and A then under A is all these dated directories, so A/2023/04/14 Slash some other thing slash whatever thing. And then there's a file csv, and if you could say like, well, all of the files for all the days, for all the, the measures under this route, under A have the schema where it's a datetime and a value.

Ben Rady

08:29

Yeah. There's three columns with these names and these types, and they're all the same. Yes, exactly.

Matt Godbolt

08:34

Across the whole of that. Right. And then, but conversely, on the B bucket, there's a, there's a different set of things because it's maybe a completely different, uh, set of data, but you don't know that information a priority.

Ben Rady

08:43

Exactly.

Matt Godbolt

08:45

So you just want just, um, yeah, that's an interesting problem. And there are some sort of S3 specifics that I can think of that would either make this easier or harder, and depending on how efficient it needs to be, that's also a question to, to consider mm-hmm. You know, parallelism or otherwise. Um,

Ben Rady

09:04

Yeah.

Matt Godbolt

09:04

Uh, so for, I'm gonna think out loud at you and see where we go from there. But.

Ben Rady

09:14

Sounds good.

Matt Godbolt

09:14

interrupt me if I'm, if I'm gibbering too much, so logically you can just ask S3, give me a dump of everything you have, and it will just start from a, and give you every single giant long path name of everything. Because really under the hood, it's just a key value store. Yes. Where the key is the path and the value is the contents of the, the thing,

Ben Rady

09:36

So that I have implemented, I have basically a recursive search from the prefix where it will walk recursively and give you, all right,

Matt Godbolt

09:44

Well, that's the trick. You don't Yeah. You don't have to do that recursion, because if you

Ben Rady

09:48

It, oh no, I'm not doing it. Yeah. I'm, I'm just saying the S3 client is doing that for me. Yeah.

Matt Godbolt

09:52

And it's not even doing it. Right. The client's not doing it either. Right. The, the, the way S3 works is if you give it a delimiter, it will use that as a point to stop. Which will then mean that necessarily you'll hit like, well, each level of the directory, but if you don't give it a delimiter at all, it just doesn't flat out listing the, the Yep. The keys with no, so there's no sort of hierarchy except the hierarchy that you put on top of it by saying, by the way, the forward slash is is magic to me. Right,

Ben Rady

10:18

Right. Right. You're just listing all the keys that start with that prefix. It's really nothing more than that. Exactly.

Matt Godbolt

10:24

So you, but you've, so, so you could go the route where you do that, which is efficient from the point of view of just paging through the results from S3, and you get this giant list of things, but it's also, you then kind of want to re infer afterwards the structure of the directory is given this list of things. Um, so you could treat it like a tree. And so for every one of those, uh, nodes, you, you construct a tree in memory for each directory and the directory has children and the directory, the children may be other directories, or they may be files and files have schemas. Right. Yeah. Or you could just treat it as a schema. Right. Um, and then it maybe flows naturally from that. So if you've got it in all in memory, which may be not feasible given the scale that you are, you're working at,

Ben Rady

11:14

Uh, I think it's okay to have it all in memory, actually. Yeah. I think that would be fine.

Matt Godbolt

11:19

Uh, but then, yeah, you could write the, um, the sort of recursive thing of say, give me the set of your children. Oh, let's think about this. One question you could ask is, you know, how do you have homogeneous children? And that can be recursively applied, you know, like, if you don't have, you, you say, like, if all my children have a, the same schema that, that are files, if I ask the same question of my directory children, what is your common,

11:53

Um, schema, and it's the same as my file children, assuming that you are allowing to mix files and directories, which maybe you're not in this instance, maybe the leaves are strictly at the very, very bottom of the tree, but you could generalize both. Right. But then essentially the A directory is, am I homogeneous? Um, do I have a homogeneous schema? Is the answer to are all my files the same? Yeah. And if I have directories, are they also homogeneously the same schema? And then it sort of naturally falls out, and you could memoize that so that it's pretty quick and easy. Um, and then obviously you, what you do is you start at the very, very top of the graph and you say, are you homogeneous? And if it says yes, you're done. Right? There you are, everything's the same at this point. And if it's not, you say, okay, well then let me find all your children and ask them, are you homogeneous? And as soon as you find one that says, yes, it is, that is the leaf, that is the, a root of a new tree that you never need to inspect further after that. Now obviously it has to recurse down.

Ben Rady

11:53

Uh, So you build up the tree, and then you do a breadth first search across that tree looking for nodes whose children are homogeneous, basically.

Matt Godbolt

13:04

Yes, and you can stop at that, but obviously you still are visiting all the nodes because in order to ask the question, are you homogeneous? Right. Internally, it's going all the way down and back up again. But if you memorize it, yeah. Uh, then you don't have to hit the nodes more than once. So that would be a top down way of doing it. Mm-hmm.

Ben Rady

13:20

Is there a way to do this where you sort of do like an order N pass through the keys? 'cause I, you know, sort of the piggybacking off of the, what the API from S3 is giving us is sort of like, all right, I'm gonna iterate through all these keys. Right? Yeah. And then along the way, I'm gonna take each key, I'm gonna do the inspection of the underlying data at that point, which is, that is actually the most expensive part computationally.

Matt Godbolt

13:48

of this whole, yeah.

Ben Rady

13:49

right. Like, you know, there might be a million keys, but like reading all that data is the thing that's gonna take a long time. Right? Yeah. And inspecting the schemas. Um, but you, you sort of inspect the schemas as you go. Um, and my thought here was that maybe there's some efficiency to be gained from once you've detected a heterogeneous sub key stop scanning, there's, there's no point in continuing to do this super expensive schema inspection once you've detected something where it's like, oh yeah, this, this directory has two different types of files in it. It's got a thousand files, but there's no point in checking the other 998 because these two are not the same,

Matt Godbolt

14:26

Are different. Yeah. That's very true. Yeah. So you could definitely, I mean, if, if, if, I guess that's an error con, an error condition or an under specification when we were talking about this to start with, which was, you know, I was assuming that you, you could partition the world at some point. You could find these top rooted keys, but if you find a node for which it, it's very leaves have mixtures, then that node is bad. A bad node. Like you say, you say, well, well, I can't do anything with this.

Ben Rady

14:50

Right. Right. And yeah.

Matt Godbolt

14:52

But yeah, to answer your earlier question, to do this incrementally, I think, I think you, I think, so. Effectively you given one long keystring name, and the results of what schema this is, you can of course create speculatively or look up if they already exist each of the nodes in the graph by splitting on the slash and then saying, okay, well let's find a node that's like called Bob, and then the one that called Ian, and then the one that called Tracy. And then here is the 01/04/03. And then there's, here's, you know, moo.csv, which is where I am. And if there's a node there, if you've just newly created a node, then definitionally, you'll say, well, at the moment, the set of all schemas that I've seen is what I've just passed this schema to be.

15:34

And if not, then, um, I, um, and this is obviously a leaf node, uh, rather a a, a leaf directory has no further directories under it. Um, you're, you're, you can write if you, if you're writing the second, um, directory, sorry, second schema for that node, you know, like you said that okay, this, this node has children, direct children

15:58

That are, stop this, this is silly now. Right. Any more things that, that land in this node, it's already poisoned, don't bother looking at it anymore. Yeah. Um, but if not, then, um, you've got, you've constructed the graph, except the only thing you haven't done is you, you haven't filled in the directories that only have directories as children's schema sets. But that's fine, because that's what the second part of the algorithm's gonna do. Once you finish going through it, the breadth first says, uh, I don't have to look at the data anymore. Uh, 'cause I already have the list of, of all the nodes, the di like the, the, the leaf nodes, and whether or not they are poison because they have a mixture, or they have exactly one schema type.

16:39

and then their parent, and you could potentially do this online as well, where you say, okay, every time I update a nodes schema set.

Ben Rady

16:47

That's what

Matt Godbolt

16:47

You can update his parents to say like, Hey, also by the way, this, this is the second

Ben Rady

16:51

Thing. Yeah. Like, I was thinking about this and I'm like, I could do this by literally building a tree data structure. Right. Like, like, you know, but I almost wonder if you could do this with like a hash map where when you process a single key, you say, I'm gonna create an entry in the hash map for, for this each directory and each Yes. And each of its parents. Right. If I ever, we were

Matt Godbolt

17:21

Doing is you're storing a tree in a hash map, it kind of implied, right?

Ben Rady

17:24

Yeah, yeah, yeah. But

Matt Godbolt

17:25

Where there's no explicit nodes that, that, like, you can go up and down because the way that you go up is to just like truncate your key and look at the hash entry for the bit that's like, yeah. So that would work as well. Right? Right. Obviously the entry were there would be like the set of schemas. Well,

Ben Rady

17:38

It's just the one schema, right. Because if you ever detect a conflict, you, you, well, you'd have to mark it as basically like no good. Right. So it's either the, the schema or a, a marker that says like this, this sub key is no good. Right.

Matt Godbolt

17:53

Right. Okay. But that wouldn't let you answer the question. Like in my original example, if you had the A and the B top level nodes, they would both, the route would be poisoned to say, no, I have a mixture. Which,

Ben Rady

18:04

But that's fine because you'd have an entry in the map for a, that was good, and you'd have an entry in the map for B that was good.

Matt Godbolt

18:09

Okay. But you can't, you, if you just have those entries. Yeah. Uh, how does you, how do you, right. What, so you're having filled in this hash map, how do you interrogate it? You need to know what the keys are, and you don't actually know what the keys are anymore.

Ben Rady

18:24

Uh, right. Uh,

Matt Godbolt

18:25

If you had the tree, you know what the node of the, the re of the tree is and you, and you know it from each node what children it had.

Ben Rady

18:30

Yeah.

Matt Godbolt

18:30

But you can't do that in the hash map without knowing the list of all those,

Ben Rady

18:34

I mean, maybe, maybe I'm missing something here, but my thought was like, okay, at the end of that sort of happy path example, I would have A as an entry and I would have B as an entry, and I would have an entry for each one of B's sub directories. Yeah. And each one of a's sub directories. Right? Right.

Matt Godbolt

18:53

So you've got a map, so in my example, you have A and B and all the, the, the, the dated directories and all those things underneath it. Yeah. So you've got a hash map with many, many thousands of entries where you were just going to walk through the hash map in the random order, the hash map's gonna give you out the, the, the nodes that are in that hash map. Yeah.

Ben Rady

19:09

You're just, and then from there, you look at the keys of the hash map and you find the, the shortest,

Matt Godbolt

19:13

Right.

Ben Rady

19:14

Like the highest level paths.

Matt Godbolt

19:16

Right. Right. Right. So effectively that's how you Yeah. That, that's the bit I was missing was the fact that I, it's not necessarily that straightforward to go from here's a hash map of all of the keys to what's the shortest key? Well, now I have to go over every single entry in the hash map.

Ben Rady

19:29

Yeah. It's like, it's like two

Matt Godbolt

19:31

An order

Ben Rady

19:31

N Yeah. It's like order N twice. It's, it's two N, right? Yeah.

Matt Godbolt

19:35

As opposed to order N and then, uh, uh, login effectively, or n log in for not even end log in. No, it's just log in. I think for having, if you built the tree as you were going along instead of the hash

Ben Rady

19:46

Map.

Matt Godbolt

19:47

Yeah. Yeah. Then you just walk at the top and you get like the answer, is it either the top node or that one of its second children? Either. I think if they're equivalent, um, it depends how many leaves you end up throwing into the hash map. I mean, I guess you don't have to store the, the actual files in the hash map. That's the trick trick here. You don't need to store their schema. You only aggregate 'em at the directory level.

Ben Rady

20:09

Mm-hmm.

Matt Godbolt

20:11

And so you are, you are talking about. Yeah.

Ben Rady

20:13

Yeah. And I think it would let you sort of abandon early where it's like when you check a key, you say, okay, is this, is this parent already in the map marked as bad? 'cause if it is, I don't need to inspect it. Right. Right.

Matt Godbolt

20:28

There you are, you turn an an order, uh, end operation into an N login.

Ben Rady

20:32

Yeah. Uh,

Matt Godbolt

20:33

I think no. Is that right? 'cause like for every N you have to then go, okay, there are potentially log n parents of me.

Ben Rady

20:41

Oh yeah. And I

Matt Godbolt

20:42

Have to check them as I go.

Ben Rady

20:43

Yeah. But I mean, anything that lets you, anything that lets you skip the inspection process is gonna be worthwhile here.

Matt Godbolt

20:49

Oh, but you, the, the inspection process would've happened when you populated the map. Yeah. The reading bit out of the end is, is the trivial part, right?

Ben Rady

20:56

Mm-hmm.

Matt Godbolt

20:57

So I think our algorithm then putative or otherwise is grab the big old list of full keys from from the S3 APIs streaming through mm-hmm. Construct or look up, as you say, in this hash map, have I done some subset of these things already? And maybe if you find it's poisoned already, you just skip it. Mm-hmm.

Ben Rady

21:36

. Yeah.

Matt Godbolt

21:37

But assuming that so far you haven't found, uh, a, a node that has at least two schemas already, you go, well, now I need to see whether or not this schema either matches the one schema that is there. Or if there is no schema there, I need to populate the entry with this schema. So now you know your cursed to go and do a long lookup and read the data and pass through it and whatever, chunk chunk, you come back with a new schema. And this, if, if there was nothing there, you say, well, this is the schema of this. And then you walk up the graph and you say, to all of the nos there, Hey, this is the schema. Or you check against the, the schema, uh, to make sure that it's the same schema as any of the parents. And as soon as you hit two, you know, well, this is poisoned. Now we don't need to do this anything under this point in the tree again.Which obviously means you're gonna run through the a p I kind of going, um, list, list, list list, list list. But very quickly you'll go, yeah, this is, I don't need to look at it. I don't need to look at the expensive part.

Ben Rady

22:31

mm-hmm.

Matt Godbolt

22:33

. Um, okay. So that's, that's the walking process. And then at the very end, you look over the whole map and you find that, and now this is like where I said it gets more tricky. You go through all the keys and you try and find the shortest key that has not been poisoned mm-hmm.

Ben Rady

22:52

Matt Godbolt

22:53

But then if you wanna find the set of the shortest keys that are like separable and not poisoned, it's harder to do that. I think from like just walking through the keys, because you have to essentially infer, again, the graph structure out of like breaking the key on a slash in order to say, Hey, is this, 'cause it's not even shortest. Right. You know, you could have two directories with longer names, one's got a longer name than the other, and you're like, it's not, it's

Ben Rady

23:13

The shortes, it's shorter. It's like number of levels. So

Matt Godbolt

23:16

I don't know if you're saving anything by having it in the hash rather than just, just accepting the tree is gonna be part of your, your life. It's not that difficult. And you can answer the question a little bit. Yeah.

Ben Rady

23:24

I mean, the only reason I'm leaning, I would lean toward a tree is 'cause it's like an out of the box data structure. And I, I need to do I think very little to, I mean, I guess the values are probably gonna be reasonably complex anyway. This is, that's like,

Matt Godbolt

23:38

I don't know that they are even Right. But, but that's that. Yeah. Um, the other thing that's, that's worth thinking about this, if we're just talking in general generalities, is that even scanning over that many files takes a long time. And if you were to do multiple parallel scans, one starting at a one, starting at B one starting at C or whatever the usual tricks that one does to scan over S3s like list function faster.

Ben Rady

24:09

mm-hmm.

Matt Godbolt

24:09

Then I think both of these approaches still work. You might, if you have multiple threads, and if you're thread safe, like data structure in or map mm-hmm.

24:30

This point in the node is passed. But like, as long as you can handle that situation, you only ever do N pieces of extra work in the worst case where you have end threads, right? But on the, in the good case you are, you are mostly able to to to, um, uh, yeah. Parallelize on that. Although, to be honest, probably just scanning through is fast enough and then you, then you're gonna be stuck doing the actual parsing and that can be multithread. 'cause you just have a work queue and say, Hey, here's some things I need you to do. Go and get when you come back to me with the results, then I can, I can fill in the tree. Yeah. Yeah.

Ben Rady

25:04

Yeah. We were talking the other day actually about my efforts to, uh, scale this system horizontally. And that is one of the prime use cases of that work is being able to like, basically delegate work out on a per key basis. Like, Hey, could you scan the schema for the CSV file for me? It's probably gonna take you like three minutes, but, you know

Matt Godbolt

25:24

I see that's

Ben Rady

25:24

Separate that out and then, you know, we'll do that work that way.

Matt Godbolt

25:27

That's, that's perfect. Um, yeah. What a fun, fun question.

Ben Rady

25:31

I, I was gonna say, so, because this is two compliment, the other thing that we can talk about here is the test cases,

Matt Godbolt

25:36

of course

Ben Rady

25:38

So like what tests, What tests would you write for this? And more importantly, maybe what order would you write those tests in? Um, because like the most basic test, if I was gonna sit, I, I am about to sit down and read. So

Matt Godbolt

25:52

, right? This literally I interrupted you from doing this. Yeah,

Ben Rady

25:53

Yeah. Um, but the, the test I'm probably gonna start with is if there are no files in the bucket, return me an empty list.

Matt Godbolt

26:00

Right? Right. This is sort of a zombie approach, zombie's approach there. Zero one many.

Ben Rady

26:05

Mini. Yeah, exactly.

Matt Godbolt

26:06

I forget B is

Ben Rady

26:07

We can do this. B boundary

Matt Godbolt

26:11

S Yeah. B boundaries interface.

Ben Rady

26:13

Uh, I'm just making words

Matt Godbolt

26:14

Yeah. Anyway, remember

Ben Rady

26:16

The

Matt Godbolt

26:16

Trick is the Z O M part is, you know, test with nothing. Test with one test with a few of them. Yeah.

Ben Rady

26:23

Yeah, yeah.

Matt Godbolt

26:24

So yeah, the first thing you do is say, Hey, empty directory.

Ben Rady

26:27

Ah, so yeah. So like, you know, should return and empty, I'm just typing this out list if there are no files, right. Uh, and that's easy enough to write. I just make an empty bucket. Um, using my fancy fake, my fake fake. But real in memory bucket, it's not fake. It's a real thing.

Matt Godbolt

26:47

Are you actually, are you actually creating anything for an empty directory, though? I know you can. In S3, you can create a, essentially an empty file that is the file that ends with the slash that causes it to be kind of, oh, I didn't even

Ben Rady

26:59

Know you could do that. That's terrible.

Matt Godbolt

27:01

That's what I think. That's what happens if, if you're like in the S3 api, sorry, if you're in the a w s console and you make a directory and you then see this directory that's empty. Because otherwise if you make a di an empty directory Yeah. The directory doesn't exist. It's like, you know, it's like that thing in source control where you can try and create a directory and you check it and it's like, no, the file doesn't, there's nothing there. 'cause there's nothing. We're tracking our files not directories.

Ben Rady

27:21

Right, right, right, right. So

Matt Godbolt

27:22

I don't even know how you'd write that test

Ben Rady

27:24

. Well, I mean, I could just make an empty bucket. Right, right.

Matt Godbolt

27:26

Okay. So that's like a just completely empty bucket. Yeah.

Ben Rady

27:30

Mm-hmm.

Matt Godbolt

27:31

And then assert that it returns nothing.

Ben Rady

27:33

Yeah. And then assert that I get back an empty list. And then of course the next obvious one is I'm gonna make one file and I'm gonna make sure that, you know, like it, I don't even, so here's an interesting thing here mm-hmm.

Matt Godbolt

27:51

Into the way that I'm doing it, aren't you,

Ben Rady

27:58

Rookie, the rookie mistake when writing that kind of test would be to, um, basically unnecessarily test through into the schema inspection algorithm. It, the schema inspection algorithm is already written. There's already tests for that. It, and through, you know that it works, right.

Matt Godbolt

28:16

Transitivity of like, as long as you call the schema inspection routine with the right path.

Ben Rady

28:22

Right. Right.

Matt Godbolt

28:23

Trust that it works. Right. And so don't care about that bit. Don't, yes. Yeah. All right. That, so I pass the rookie test.

Ben Rady

28:28

I'm gonna write the test. that's like, that's like, I'm gonna write a file into this thing, and then I'm gonna have it run the, the whole scanning inspection algorithm, and I should get back one schema. I'm gonna, these are the things I'm going to assert and these are the things I'm not going to assert. I'm going to assert that I got one. Yep. I'm gonna assert that the path into the file is what I expected. Yep. I'm not gonna assert the number of columns. I'm not gonna assert the types and names of those columns. Right. Right. Because that behavior's already tested now,

Matt Godbolt

29:01

So that's interesting. I would've mock that out. Right. I dunno if that's where you're going with this. No, I would've said like, okay, I'm gonna completely and utterly I would, so with my C++ mindset here. I would pass in the like, um, and something which implements the interface of test this path. And then I would give it a fake one and say, Hey, yeah, yeah. All I'm gonna do is I'm gonna record that you said that you wanted me to check this schema, and then I'm gonna return you back the schema, Bob, or whatever, anything. Because at that level, I don't care what you're doing with that information because I'm already trusting that it's gonna work. Whereas you are actually talking potentially about calling the schema code, but deliberately not looking at the results of it because you don't care about the results of it. You trust that they work so equivalent. Um, and you don't have to fake anything out, and you don't have to have an interface or anything, and you are already faking out like the S3 at like a very S3-y level.

Ben Rady

29:49

Level. Yeah, yeah. Right. But

Matt Godbolt

29:50

Yeah. Okay.

Ben Rady

29:52

So yeah. And that's, so that is, that is a very sort of interesting point of debate and I think there's lots of people that have different opinions on that. And I used to be more in the camp of like, I'm gonna fake out the schema inspection algorithm. And these days, the way I think about it, it is like, if I can write this so that the test is fast and the thing that I'm mocking out, you know, or I would consider mocking out, the real one doesn't have any outside interactions. It runs quickly. It does the thing, uh, the, the phrase that I've heard a lot of through this is a sociable test or testing through is another way that people describe this. Okay. Yeah. But it's basically like some number of hops, one or two, or maybe three three's a little much, but one or two hops down, uh, into the sort of, you know, uh, dependency graph of, of the thing that you're testing generally fine. So long as you don't make any so long as it has, doesn't have any other bad things, it's not making it unreliable, or anything like that.

Matt Godbolt

30:47

Right. And to be clear, the, the sort of like the, the, the fake-able interface you're passing around here for S3 means that it's not like you are making TCP connections. Exactly. You don't have to and all that stuff. So you've already kind of said, okay, like that's, that's the boundary that I'm prepared to draw. Right. And if I go one or two steps away from that sort of Law of Demeter-y breaking-ly mm-hmm. Then that's okay. I'm testing more of my system, but I'm also not going to rely on it because that is not the purpose of these tests. Exactly. And furthermore, if I break my schema detection system in some subtle way, that doesn't matter for these tests, I don't want these tests to unnecessarily fail.

Ben Rady

31:23

Exactly. Exactly. You don't ever want the scanner, the sort of the, you know, tree walker scanner code tests to fail because like, oh yeah. Um, this csv file doesn't have a comma in the right place or something like that. Right. Like, that's just, that's gonna be a not informative test. You're gonna be very confused as to how you possibly broke that. Right. Right. Um, so yeah, so it's the, the, okay, the, the assertions that you pick here, basically you're gonna determine how much, uh, you're gonna be able to use these tests to refactor versus how much you're basically just pouring concrete around your code

Matt Godbolt

32:16

Wow.

Ben Rady

32:17

It, it's

Matt Godbolt

32:18

Really very emotional. I know it's,

Ben Rady

32:20

Listen,

Matt Godbolt

32:20

You can't see him, but he's almost crying here,

Ben Rady

32:24

I hear that and I'm like, you don't understand what any of those words mean. Any of them. None of the words in that sentence were correct. Um, the, the, the, the trick here is like when you write, when you overspecify, right? And, and it sort of like comes almost from a good place where you're just like, I wanna make sure my code works and I'm gonna write lots of good tests and I'm gonna have lots of assertions and all these other things. And so, but the problem is when you overspecify you are, you're not sort of doing the core job of what the tests are there for the, the tests are therefore to assert that you have the behavior that you want, but they should be agnostic about the behavior that at that level of testing, you don't care about,

Matt Godbolt

33:05

You don't care about, right?

Ben Rady

33:05

Right. And this is exactly that. Right? This tree walker scanner thing shouldn't care at all about whether or not the schema inspector is doing its job correctly. And if you overspecify those tests, you're gonna make it harder when you come back later and you're like, oh, we need to change the way that the schema inspector works. Right? And you go and you change it, and then you have all of these failing unrelated tests and you're like, why are these tests failing? Well, because you overspecified, right? Yeah. So, um, the, the, the dial here that you have to turn and, and you really want to get it in exactly the right spot if you can, is to specify the things that you care about for this code that is under test and not overspecify this thing if you're gonna do this sort of sociable style. Now, the other alternative here is what you were just saying, where it's like, no, I'm just gonna mock that out so you can change the schema inspector all you want. It's not gonna break these tests at all. And I have a hard guarantee that that's not true, and there's lots of situations in which you might actually wanna do that. Um, but the, the, what I'm about to do, um, embark on here in the next, you know, 20 minutes and we're done with this 20 minutes.

34:10

Is trying to set that dial exactly where I want it so that I have the confidence that the code that is un, that I'm writing right now works without having over specified into this thing that I already know works that I already wrote tests for. Yeah. That I've actually already used in other contexts. Yeah. Yeah. And I'm not coupling those things together.

Matt Godbolt

34:30

So, um, what one file, so the, when you said that, or before you went off on this, this ta not even tangent, this is a very relevant thing. It's

Ben Rady

34:37

A major soapbox moment here.

Matt Godbolt

34:38

Yeah. It was though, and I'm glad. You look very happy.

Ben Rady

34:41

Get off my lawn you stupid kids.

Matt Godbolt

34:43

But, um, the thing that immediately sprung to mind after like one file is one file with varying number of sub directories mm-hmm. In the root, and then 12 different things, because again, that's like a zero one many kind of thing where it's like, well, everybody knows the edges, the boundaries. That's the b part of zombies, isn't it? Yeah. Boundaries. Like, well, if, if it's in a subdirectory, that makes sense. I'm sure I've tested that, but like who thinks to put one of these files directly in the root, right? There's probably not what you're ever planning on doing. It'd probably be a mistake to do it, but like that is where I'm, my instinct tells me a body will be buried there and you'll, you'll find something interesting there. Um, just like, you know, when having a file that's empty, which probably doesn't affect this in this particular case because you've already tested your, uh, schema tracker. But like, that would be another thing I'd think of as, oh yeah, someone made a file, but it's not got anything in it. That's kind of like an, an edge. Well, that would be an edge case for your, for your schema validator.

Ben Rady

35:39

Yeah. Yeah. And so the interesting thing about that test is, uh, I was having a conversation about this with our mutual friend, Mr. Pat Farley yesterday actually, Patrick Farley, um, uh, of the, of the London Farleys of the London Farleys, um, uh, talking about how it is possible, especially if I, um, if I, if I'm not sort of like lazy enough

Matt Godbolt

36:27

Would hope so if you did it right, but

Ben Rady

36:28

Well, maybe, so that's, that's the thing is, is if what I did is I took a step too far when I was making the implementation and I basically wrote untested code, right? So I have this test for should handle one file, and I go in there and I write, I don't write the most basic thing that I can think of. I don't write a simple thing. I write like, oh, well this is what this algorithm needs to be, so I'm just gonna type it all out. Right? What I've done is I've written untested code because almost certainly I could change some of that code or delete it or simplify it, and my test would still pass. Right? Um, and I might notice that if I go and I write, that should handle varying sub directories and it passes. And the question that you could ask yourself at that point is, have I written untested code before and now I've tested it, and now I probably wanna make sure that the test, the new test that I wrote actually covers those cases, and I might actually go in and like mutate that code and, and break it and make sure that it fails.

37:26

Or is that test actually unnecessary? Obviously that case is something we need to handle, but if that test doesn't drive out any new code, do I need it? Or maybe more realistically, do I need the first one that I wrote have, can I now delete the should handle one file? Because it's impossible to have an implementation that handles varying sub directories that doesn't also handle one file.

Matt Godbolt

37:50

Right? Right. So that's a really interesting question because there's, there's, there's like the, the code that I have just written, the implementation code that I know that I wrote in a particular way, and I'm testing that with knowledge, like very much of like how I'm driving out, you know, if you were, we're going for a hundred percent code coverage or that kind of nonsense that we don't really put much faith in, but you're like, oh, I have to handle this case because in my implementation I treat them differently or I don't treat them differently. And that would either drive these things or like, oh, I, I didn't get a hundred percent coverage. Because the if statement that deals with the, oh, if it's in the root, it doesn't get covered. So there's that kind of part, but there's sort of like a moral feeling of like, well, I have been doing this for 20 years and I know where the kind of things where if I make a change to the implementation, I bet you I break this because of that, and then that's why those are more speculative tests. Yeah. And maybe you have a stronger opinion about that because of course, reasonable people could disagree about like, well, no one will ever do that thing. Right. Right. You're like, well, what if they do, or what if they change the implementation in a relatively straightforward way that might take advantage of either knowing X or Y? Right.

Ben Rady

38:56

Exactly.

Matt Godbolt

38:56

Yeah. Oh, it's the one file and it's all those more than one file kind of thing. Yep,

Ben Rady

38:59

Yep, yep. And that is, so that is just a judgment call

Matt Godbolt

39:01

Where it's an art form, isn't it? Right. Exactly. Exactly. This is art, not, not science.

Ben Rady

39:05

It's 'cause 'cause you're, you're speculating on what the behavior of future humans will be. That's basically what it is. Right. Like, I'm gonna add these tests here, not because they drive out any code. Right. I could, I could, uh, you know, delete these tests and I would still have all my code very well thoroughly, the existing code tested very well, well thoroughly tested, but I'm adding these tests because I'm scared of future people, maybe future Ben, uh, who had some comments left for him, uh, coming in and changing this code in what would be a reasonable way. But breaking something that wouldn't then be covered, right? Um, and that, like you said, it's an art form, it just comes from experience really. It's just comes from making mistakes on that decision.

Matt Godbolt

39:48

Yeah, and going oh,

Ben Rady

39:51

Then going, ah, I should have done it. That would've been right.

Matt Godbolt

39:51

I mean, I say like, you know, you're writing a string library Yeah. And you're testing, you know, adding two strings together should return, you know, the strings, blah, blah. And you just know yourself that like, oh, I, I've realized that, uh, the empty string, I should probably make sure that I test adding an empty string to a string. So I'm speaking from experience here, this kind of stuff. And like, uh, you know, well, how could you implement string concatenation where you break the empty string concatenation? And the answer is, well, if you're writing an optimized version in assembly, I guarantee

40:29

And so that's the kind of thing where like you are definitely locking in, you're hedging for the future of like, well, this is this, this is like, uh, uh, the kind of mistake I can see myself making down, down the line. And, uh, and or it's, it's almost part of the documentation at some level as well of like, well, these things should work. Yeah. You know, the, the interface allows for you to do this. It's surprising, but it does allow for you to do this. And maybe I should just leave this here as like a, an a boundary condition of the interface to say, this is fine.

Ben Rady

40:59

Yeah. And one of the other things that you need to consider with that in addition to sort of guarding against future mistakes is the thing that we were talking about earlier about like, writing those kinds of tests can also lead you to accidentally overspecify.

Matt Godbolt

41:13

Right? Right. Yeah.

Ben Rady

41:15

Like, like you're, you're testing things that are either behavior that you don't necessarily really care about, um, or testing through into other dependent modules or classes or functions.

Matt Godbolt

41:25

That's a really fascinating one actually. The testing things you don't care about. I have definitely been guilty of that before where I have thought to myself, I really wanna make sure that this thing works, but I don't really know what the answer should be. Right. And by writing a test, I've kind of mandated that that is what the answer needs to always be.

Ben Rady

41:40

Exactly. Exactly.

Matt Godbolt

41:41

And maybe I didn't wanna do that. Right.

Ben Rady

41:43

Right,

Matt Godbolt

41:43

Right, right. But then I also don't wanna have a path through my code that I, you know, but then that maybe that's a, a question for like, well, fix the path through your code, throw an exception instead, you know? Yeah, yeah, yeah. I don't know what do something else to say like, Hey, you're in unchartered, uncharted territory here, don't do this.

Ben Rady

41:57

Right, right. Right. I mean, like, one of like an obviously wrong example of that, like bounding this problem a little bit, like an obviously wrong example of that would be like having something that, that produces a set and then you make an assertion about the order. Right?

Matt Godbolt

42:11

Yep.

Ben Rady

42:11

Right. It's like you don't care about the behavior, you don't care about the order of the set. Why are you writing a test for it? It's like, well, because I wanted to know what it is, right? Yeah. Or I want it to specified, or what is it? It's like, no, like we are intentionally leaving that unspecified and that has value, the fact that it's unspecified has value, right? Yeah. Yeah. Yeah. It allows us to change our mind later. So like, getting that just right, it's

Matt Godbolt

42:31

The hash map thing you talked about. Right, exactly. Is a perfect example of that. Like when you write the code, the test that's gonna iterate through the hash map and, and then build its world from the hash map entries in the order that they just happen to landed in the hash, then you're definitely way into the world of like, well, I don't even know what sequence these things are gonna happen in, so I can't be too critical of, of the Yeah, that's, yeah. Interesting. Well, you better go and write this code now. We've been talking about it for

Ben Rady

42:58

I gotta get back, back to work.

Matt Godbolt

43:01

You gotta actually go back. Yeah. I was gonna crack the whiff on you

Ben Rady

43:11

I, I, I like that idea a lot. Cool.

Matt Godbolt

43:12

Well, we'll, uh, hear about it next time, perhaps.

Ben Rady

43:16

All right. Next time.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript