Practical Binary Analysis: Build Your Own Linux Tools for Binary Instrumentation, Analysis, and Disa

Speaker 1

00:00

All right, let's dive into binary analysis. You're interested in getting down to the nitty gritty of how programs work at a low level. Huh, We've got some PDF excerpts and code examples from practical binary analysis to guide us.

Speaker 2

00:13

Sounds like a plan. That's an excellent resource for this kind of deep dive.

Speaker 1

00:18

So no more comfy high level languages, right, We're going straight to the bare metal machine code. The PDF mentions that binary analysis is a bit of a challenge, especially since we have to deal with code without type information. What exactly does that mean in practice?

Speaker 2

00:33

Well, think of it like this. Imagine trying to follow a recipe, but instead of ingredient names, you only have quantities like two of this, one of that, but no clue if it's flour, sugar, or eggs.

Speaker 1

00:43

Huh. So it's a bit of a guessing game exactly.

Speaker 2

00:46

Compilers strip away all those helpful labels the types to create the most efficient machine code possible. So our job is to figure out what those cryptic values represent.

Speaker 1

00:55

That sounds like a puzzle. The PDF also mentions that code and data can get all mixed up in the buying.

Speaker 2

01:00

Yeah, that can get pretty confusing. Compilers are all about optimization, so they sometimes mix up the instructions and data to make things run faster. It's not like the neat and organized code we see in high level languages.

Speaker 1

01:12

And on top of that, there's the issue of location dependence.

Speaker 2

01:15

Oh right, Every single instruction and piece of data has a specific address in the compiled binary, so if you shift things around even a tiny bit, those addresses become invalid, which can lead to all sorts of.

Speaker 1

01:27

Problems, like the whole program crashing.

Speaker 2

01:29

Yeah, crashes, unexpected behavior, you name it. It's like a delicate Jenga tower. One wrong move and everything falls apart.

Speaker 1

01:37

Okay, so we've got cryptic values, mix up code, and a super frital structure to deal with. This is where it starts to get interesting. The PDF focuses on BY eighty six assembly, which it describes as complex but good practice. What makes BY eighty six so tricky? Well.

Speaker 2

01:51

BY eighty six has been around for ages, and its instruction set has grown pretty complex over the years. It's got all sorts of instructions with varying legs, and these complicated ways of accessing data. Even some instructions overlap. Take RPMOVSB, for example, just one instruction can move huge chunks of data, which is why Malwaur authors love it so much. Makes their code much harder to analyze, you.

Speaker 1

02:15

See, So it's a bit of a beast to master.

Speaker 2

02:17

You could say that mastering by eighty six assembly is like learning to play a really complex musical instrument with all sorts of levers and buttons to figure out. But you know, it's also incredibly rewarding once you get the hang of it.

Speaker 1

02:28

Speaking of complex, there's also the whole thing with different assembly syntax is INTEL and AT and T. I've got admit they look like secret codes.

Speaker 2

02:36

Huh, yeah, it can seem that way. Think of them as two different dialects of the same language.

Speaker 1

02:40

Okay, that makes sense.

Speaker 2

02:41

Intel syntax is definitely more common, and it's usually considered more readable. That's why the PDF sticks with it. But don't worry, you don't need to be fluent in assembly to grasp the main concepts.

Speaker 1

02:52

That's a relief to hear. So the PDF walks us through the compilation process using the classic Hello World example. There's pre processing, compilation, assembly, and linking. It's quite a journey what's going on at each stage?

Speaker 2

03:08

Right? Each stage refines the code, getting it closer to what the computer can actually understand. It's like an assembly line for code.

Speaker 1

03:15

I like that analogy.

Speaker 2

03:16

So first we have preprocessing. It's like prepping all your ingredients before you start cooking, you know, substituting those placehold values with the real deal.

Speaker 1

03:24

Okay, So it's about getting everything ready exactly.

Speaker 2

03:27

Then comes compilation. This is where the magic happens. It's like translating a recipe into precise instructions for the CPU, like, okay, sheet the oven too, three point fifty, mix in two cups of flour, et cetera.

Speaker 1

03:38

It's the step by step guide for the computer.

Speaker 2

03:40

Then you got it. Assembly comes next. This stage converts those instructions into the actual binary language that the computer speaks. Think of it like converting a recipe from English into say, Japanese. Finally, linking brings everything together, including external libraries, and creates the final executable file.

Speaker 1

03:59

So the execute is the finished product, all assembled and ready to run. The PDF mentions symbols, comparing them to a table of contents for the binary. Why are these symbols so important?

Speaker 2

04:10

Well, symbols are like the bridge between the human readable world of programming and the machines world of memory addresses. They're like little sign posts that point to specific locations in the binary, helping us make sense of the disassembled code. Imagine trying to navigate a city with just street addresses and no street names. It would be a nightmare.

Speaker 1

04:30

So symbols help us decode what's happening in the binary exactly.

Speaker 2

04:33

They tell us which parts of the code correspond to which variables and functions, making the whole thing much easier to understand. The PDF then dives into the difference between object files and executables. I think listings one to eight and one ten illustrate this pretty well.

Speaker 1

04:47

Yes, what's the difference between those two.

Speaker 2

04:49

Think of object files like pieces of a puzzle. They contain compiled code but can't run on their own. And executable, on the other hand, is the complete puzzle ready to be loaded and run by the operating system.

Speaker 1

05:01

So the executable is the final product, the one we actually run on our computer. Figure one to two shows how the operating system lows and executable into memory. But it's not just a simple copy paste operation, is it.

Speaker 2

05:12

Nope, not quite. It's more like packing a suitcase. Strategically, you don't just throw things in randomly, right, you want to make sure everything fits and is easy to find when you need it. The operating system does something similar. It arranges code and data segments in memory to make everything run smoothly and efficiently.

Speaker 1

05:28

So it's about organizing things in a way that makes sense for the computer. Now, let's talk about file formats. The PDF mentions ELF and PE, which sound like they're the language barriers between different operating systems.

Speaker 2

05:41

Got it, ELF, which stands for Executable and Linkable Format, is the standard format for Linux systems. PE, which is short for Portable executable, is used by Windows. They both essentially package up the code and data for execution, but with the different structures and conventions.

Speaker 1

05:57

Ah, So it's like given operating systems speak different binary languages. The PDF delves into quite a bit of detail about these formats, referencing listings two, two, two five, and two eleven. What are the key differences between ELF and PE? From a practical.

Speaker 2

06:12

Standpoint, ELF is known for its flexibility in standardized sections, making it relatively easy to analyze. It's like a well organized library with clear labels on all the shelves. PE, on the other hand, is more tailored for Windows features, it's a bit more complex, almost like a sprawling mansion with hidden rooms and secret passageways.

Speaker 1

06:31

So ELF is the more straightforward one.

Speaker 2

06:33

You could say that it's definitely friendlier for analysis. Now there's this concept of lazy binding that I found pretty interesting. Is it like procrastination for programs? Hmm, not quite procrastination, but more like just in time efficiency. Lazy binding means a program doesn't resolve references to external libraries until it actually needs them. It's like looking up a phone number only when you're about to make the call, instead of searching through the whole directory beforehand.

Speaker 1

06:59

Uh So it's out saving time and resources. The PDF mentions PLT and GOOT in this context referencing listing two seven. What do those acronyms stand for and what roles do they play in this lazy loading scheme? Right?

Speaker 2

07:13

So, PLT stands for Procedure Linkage Table. It's basically a table of placeholders for function calls to external libraries. Each placeholder is like a note saying, hey, we'll need to figure out where this function actually lives later. GOT or Global offset table, is like a directory that eventually gets filled with the actual addresses. Of those functions once they're needed.

Speaker 1

07:35

So the PLT points to the GOT and the GOT eventually points to the actual functions. Sounds a bit round about.

Speaker 2

07:42

It might seem that way, but it's all about efficiency. When the program first calls a function from an external library, it hits the corresponding PLT placeholder. This triggers a search party led by something called the dynamic linker, which finds the actual address of that function and updates the GOT accordingly. From that point on, any calls to that function and go directly to the correct address, skipping the whole lookup process.

Speaker 1

08:03

So it's a one time setup for efficiency. Clever. Now, let's shift gears and talk about the tools of the trade. Chapter five in the pdf introduces a whole toolbox of essential binary analysis tools. It's time to gear up and become digital.

Speaker 2

08:17

Detectives, definitely. Each tool gives us a different lens to view the binary through, helping us extract all sorts of valuable information.

Speaker 1

08:25

It's like we're gearing up to be code detectives with all these tools at our disposal. The PDF starts with strings. What's that all about?

Speaker 2

08:32

Strings? That's our first line of recon imagine sifting through mountains of binary data, just looking for anything, any snippet that looks like readable text. That's what strings does. It extracts any sequence of printable characters, giving us clues about what the program might be doing.

Speaker 1

08:48

So it's like searching for those needles in a haystack, except the needles or words hidden within the code. Exactly what about XDC What kind of insights can we get from that?

Speaker 2

08:57

Think of XD as our magnifying glass to zoom in and examine the raw bytes of the program. Instead of just ones and zeros, it displays the bytes and hexadecimal, so.

Speaker 1

09:06

It's a bit easier to digest than just a wall of ones and.

Speaker 2

09:09

Zeros, much easier. Still a bit cryptic, but hey, at least it's something.

Speaker 1

09:13

So with xxday, we're getting a byte level view of the program. What if we want to understand a program's social network, so to speak, The PDF talks about weld for.

Speaker 2

09:23

That right old helps us map out the program's dependencies. It tells us which shared libraries the program relies on. That can give us a good idea of what the program does and how it functions.

Speaker 1

09:35

So like checking someone's social media to see who they hang out with.

Speaker 2

09:38

Pretty much gives you a sense of their interests and activities.

Speaker 1

09:41

Right, definitely. The PDF then introduces these tools Strace and old Trace, which sound like we're getting into some serious digital surveillance. What do those tools do?

Speaker 2

09:50

Strays, Well, it's kind of like putting a suspect under surveillance. It intercepts and logs all the system calls made by the program, so we can see how the program is interacting with the operator system, what files it opens, any network connections, it makes all that juicy stuff.

Speaker 1

10:04

So we can spy on what the program.

Speaker 2

10:06

Is doing exactly. And then there's Altrace. It takes things a step further. It zeros in on the program's interactions with external libraries. We can see which functions the program calls and even what parameters it uses.

Speaker 1

10:19

So stras is for external affairs and all traces for those internal conversations.

Speaker 2

10:24

Huh, that's a good way to put it.

Speaker 1

10:25

These tools sound incredibly powerful, but are there any limitations we should be aware of?

Speaker 2

10:31

Of course, it's important to remember that these are dynamic analysis tools, so they're observing the program as it's running. This means they can only show us what the program does during a specific run, not everything that's capable of doing. Think of it like observing someone's daily routine. You might learn their habits, but you wouldn't know everything they're capable of, right.

Speaker 1

10:51

Right, we're only seeing a snapshot, not the whole picture. The PDF hints as some more advanced techniques for deeper analysis.

Speaker 2

10:58

Yeah, there are techniques likes symbolic execution and data flow analysis that can help us explore a program's potential behavior more thoroughly.

Speaker 1

11:06

Well, that sounds exciting. Speaking of deeper analysis, Chapter six introduces the concept of disassembly, comparing static and dynamic approaches. It's like the difference between studying a musical score and watching a live performance, isn't it.

Speaker 2

11:19

That's a great analogy. Static disassembly, using tools like obstump is all about analyzing the code without actually running it, like studying sheet music to understand the melody and.

Speaker 1

11:29

Structure, and dynamic disassembly.

Speaker 2

11:31

Dynamic disassembly, often using debuggers like GDB, lets us observe the instructions as they execute. It's like watching the musicians bring the music to life.

Speaker 1

11:41

So static is the blueprint and dynamic is the performance in action.

Speaker 2

11:45

Precisely, each approach has its own strengths and weaknesses. Static disassembly gives you a complete view of the code, but it can be fooled by those sneaky obfuscation techniques.

Speaker 1

11:54

Right. Those are designed to throw analyst soft track right exactly.

Speaker 2

11:58

Dynamic disassembly is more accurate, but you only see the code that actually gets executed during a specific run.

Speaker 1

12:04

Makes sense, so both techniques have their own place in the binary analysts toolbox. Now, Chapter seven throws us right into the world of bug hunting. Specifically, those sneaky off by one errors. Those always seem to cause trouble.

Speaker 2

12:19

Off by one errors. Those are the subtle coding mistakes that can have some major consequences. They often happen when a loop iterates one time too many, or when a program tries to access memory just outside the boundaries of an array.

Speaker 1

12:32

It's like accidentally stepping on a crack in the sidewalk, a small mistake that can lead to a stumble.

Speaker 2

12:36

And in software, those stumbles can create security vulnerabilities that attackers can exploit. The PDF shows how we can spot these errors and even surgically remove them using a tool like exit it.

Speaker 1

12:46

So it's like we're performing surgery on the code. Removing those nasty.

Speaker 2

12:50

Bugs exactly now. Speaking of vulnerabilities, the PDF then dives into the notorious heap overflows, using a program called heapoverflow dot S as an example.

Speaker 1

12:59

So pretty scary. Why are heap overflows so dangerous?

Speaker 2

13:02

Heap overflows happen when a program tries to write more data than it should into a section of memory called the heap. Think of it like overpacking a suitcase. Eventually something is going to burst.

Speaker 1

13:13

Makes sense in the.

Speaker 2

13:14

Case of heap overflows that bursting can overwrite important data or, even worse, allow attackers to inject their own malicious code and take control.

Speaker 1

13:23

That sounds like a recipe for disaster. How can we prevent these overflows from happening?

Speaker 2

13:27

Well, one way is to use tools that can detect and prevent them. The PDF showcases a cool method using LDP or eLOAD to inject a special library called heap check dot O. This library acts like a safety net, monitoring memory allocations and raising a red flag if something fishy is going on.

Speaker 1

13:44

So heapcheck dot os is like a watchdog from memory operations.

Speaker 2

13:47

Precisely and make sure everything stays within the bounds and prevents those nasty overflows from causing havoc.

Speaker 1

13:53

That's reassuring. Now. Chapter eight challenges us to roll of our sleeves and build our own disassembler using the Capstone framework. Why would we bother building our own when there are already so many disassemblers out there?

Speaker 2

14:05

Good question. While those general purpose disassemblers are great for everyday tasks, sometimes you need something more specialized, tailored to a specific need. Building a custom disassembler gives you the flexibility to handle unusual instructions, deal with obfuscated code, or even implement new analysis techniques.

Speaker 1

14:25

So it's about having the right tool for the job, especially when you're dealing with tricky binaries.

Speaker 2

14:30

Exactly, the PDF walks us through creating a simple linear disassembler using Capstone.

Speaker 1

14:36

What exactly does linear disassembly mean?

Speaker 2

14:38

Well, it's like reading a book from start to finish without skipping around. Linear disassembly analyzes instructions sequentially, assuming a straightforward flow of execution. However, as we move towards more complex analysis, we need something more powerful. That's where recursive disassembly comes in.

Speaker 1

14:55

Recursive dissembly YEP.

Speaker 2

14:58

Recursive disassembly takes into account the program's control flow, following all those jumps and conditional branches to explore every possible path the program could take.

Speaker 1

15:06

So it's like exploring a Choose your Own Adventure book, following all the different paths.

Speaker 2

15:10

You got it. It gives us a complete map of the program's execution logic.

Speaker 1

15:14

That's really cool. The PDF then takes recursive disassembly to the next level, using it to build a tool that can find ROP gadgets. What are OURP gadgets and why are they so interesting to security researchers?

Speaker 2

15:27

ROP gadgets They're like building blocks for attackers. They're short snippets of code within a program, usually ending with a return instruction, that can be chained together to execute arbitrary code.

Speaker 1

15:39

So attackers can essentially hijack a program's execution by stringing together these pre existing pieces.

Speaker 2

15:46

Of code exactly, they can bypass security mechanisms and do all sorts of nasty things. The tool described in the PDF allows security researchers to scan for these gadgets, identify potential vulnerabilities, and hopefully patch them before the bad guys can export them.

Speaker 1

16:00

So it's a tool for proactive security.

Speaker 2

16:02

You got it. Now. In chapter nine, things get really interesting. We're introduced to dynamic binary instrumentation using the PIN framework. This sounds like we're not just passively observing programs anymore. We can actually modify their behavior as they run. That's the power of dynamic binary instrumentation. It lets us insert our own code into a running program, effectively changing the rules of the game on the fly.

Speaker 1

16:26

Wow, it's like we're becoming code wizards. The pdf starts with the simple example of a profiler that counts the number of executed instructions and function calls. What is profiling and why is it useful?

Speaker 2

16:38

Profiling is essential for understanding how a program performs. By pinpointing the parts of the code that are executed most often, we can identify bottlenecks and make the program run faster and more efficiently.

Speaker 1

16:49

So it's like finding the traffic jams in a program and optimizing the flow exactly.

Speaker 2

16:53

And the PIN framework gives us the tools to do just that.

Speaker 1

16:57

What other tricks can we do with the PIN framework?

Speaker 2

16:59

Oh, all sorts of things. The pdf shows us how to create a simple unpacker to reveal the hidden code in packed binaries.

Speaker 1

17:06

What are packed binaries?

Speaker 2

17:07

There are programs whose code has been compressed or encrypted, making it much harder to analyze. It's like hiding a valuable object in a series of locked boxes.

Speaker 1

17:16

So how does PIN help us solve these puzzles?

Speaker 2

17:19

With PIN, we can instrument the unpacking process by tracking memory operations and system calls. We can essentially follow the trail as the packer decompresses or decrypts the original code. It's like having a secret decoder ring.

Speaker 1

17:33

Wow, so we can unlock those secrets now. Chapter ten takes us into the realm of dynamic taint analysis or DTA. It sounds like a way to track the flow of data within a program, but with a unique.

Speaker 2

17:47

Twist, you're on the right track. Imagine pouring dye into a river and watching how it spreads. DTA works in a similar way. We mark specific data, usually something from an untrusted source like user input, as tainted.

Speaker 1

17:59

Okay, so we're label the potentially dangerous data exactly.

Speaker 2

18:02

Then we track how that tait spreads through the program as it runs.

Speaker 1

18:05

So we're following the trail of breadcrumbs to see where that potentially dangerous data might end up precisely.

Speaker 2

18:10

The PDF uses a network server program as an example to show how DTA can help detect and prevent something called control flow hijacking attacks.

Speaker 1

18:19

What are those and why are they so dangerous?

Speaker 2

18:21

Control flow hijacking attacks, Well, they allow attackers to redirect a program's execution to their own malicious code. It's like someone taking over a train and steering it off the tracks.

Speaker 1

18:32

That's a scary thought.

Speaker 2

18:33

DTA helps us monitor how tainted data influences the program's control flow, so we can stop these attacks before they can do any real damage.

Speaker 1

18:41

So it's like a security system that can spot intruders and prevent them from gaining.

Speaker 2

18:45

Control exactly now. Chapter eleven zooms in on using DTA to detect another type of vulnerability, format string vulnerabilities.

Speaker 1

18:53

Those sound tricky, Why are they so dangerous?

Speaker 2

18:55

Format string vulnerabilities happen when a program isn't careful about using user or supplied input in functions that expect a specific format. The classic example is the print function. Attackers can exploit these vulnerabilities by crafting malicious input that tricks the program into doing things that shouldn't be doing, like executing arbitrary code or leaking sensitive information.

Speaker 1

19:17

So attackers can essentially change the rules of the game by messing with how the program handles output.

Speaker 2

19:23

You got it. The PDF shows how we can use a DTA framework called LIBDFT to build a detector that can identify and prevent these format string attacks.

Speaker 1

19:33

So LIBDFT is like a specialized security guard watching out for anything suspicious related to format strengths exactly.

Speaker 2

19:39

It gives us a way to keep a close eye on how potentially dangerous data flows through the program and make sure it's not being misused in ways that could compromise security.

Speaker 1

19:48

That's impressive. We've covered so much ground in this deep dive, from the basics of binary analysis all the way to dynamic binary instrumentation and tained analysis. It's amazing to see how we can unravel compile code and really understand how software ticks.

Speaker 2

20:03

It's quite a journey, isn't it, and there's always more to learn.

Speaker 1

20:06

But the journey isn't over yet, right. The PDAF hints it even more advanced techniques, like symbolic execution that sounds almost like science fiction.

Speaker 2

20:15

It is pretty mind blowing.

Speaker 1

20:16

Chapter twelve lays the groundwork for symbolic execution and introduces us to this thing called Z three, a constraint solver. What exactly is a constraint solver? And how does it fit into all of this?

Speaker 2

20:27

Okay, So imagine a constraint solver, like a mathematical detective. You give it these logical formulas and it uses all sorts of fancy algorithms to figure out if those formulas can be true or not.

Speaker 1

20:38

So it's like a logic puzzle solver exactly.

Speaker 2

20:40

Now, in the world of symbolic execution, we represent data in the program as these symbolic expressions. They're kind of like placeholders, and Z three helps us determine if certain conditions can be met as the program runs.

Speaker 1

20:53

So instead of dealing with concrete values like one, two, three, we're working with these symbolic representations that could be anything. That's the idea, and Z three helps us reason about how these symbolic values might behave as the program executes.

Speaker 2

21:08

Precisely. The PDF also mentions opaque predicates, which sound like they're designed to make life difficult for analysts.

Speaker 1

21:16

They do sound a bit intimidating. What are those all about?

Speaker 2

21:19

Opaque predicates? Oh, they're basically little bits of code that are deliberately accuscated to throw off static analysis. Like think of them as locked doors with hitting keyholes.

Speaker 1

21:28

So they're meant to keep us out, you.

Speaker 2

21:30

Could say that. But with symbolic execution and a powerful constraint solver like Z three, we can often crack those open and figure out what they're really.

Speaker 1

21:38

Up to, so we can outsmart those tricky programmers. Now, Chapter thirteen introduces us to Triton, which is described as a dynamic symbolic execution framework. So now we're not just analyzing the code statically anymore. We're simulating its execution with the symbolic value.

Speaker 2

21:56

You got it. Triton lets us run the program in a kind of virtual sandbox where we can watch those symbolic values flow through the code as it executes, so we can.

Speaker 1

22:05

See how those values interact and influence the program's behavior exactly.

Speaker 2

22:09

It's a really powerful way to explore all the possible execution path the program could take, even those that are rarely encountered in the real world.

Speaker 1

22:17

That's pretty cool. The pdf gives an example of backward slicing using Triton. What is backward slicing and how does it help us understand a program?

Speaker 2

22:25

Okay, imagine you're trying to trace the origin of a rumor. Right, you start with the person who spread it and ask where did you hear that from? Then you go to that person and so on until you find the original source.

Speaker 1

22:37

We're following the trail back to the beginning exactly.

Speaker 2

22:40

Backward slicing and treton is similar. We start with a particular point of interest in the program, like a specific register value, and work our way backward tracing the flow of data that led to that value, so we can figure.

Speaker 1

22:53

Out the chain of events that led to a certain outcome.

Speaker 2

22:55

That's the idea. The PDF also mentions that Triton can help us achieve better codeverage during dynamic analysis.

Speaker 1

23:02

Right. Why is that important?

Speaker 2

23:03

Well, traditional dynamic analysis can only observe the code that's actually executed during a specific run, right, But with symbolic execution, we can explore all those possible paths, even the ones that are rarely taken. It's like having a map that shows you all the hidden trails and back alleys.

Speaker 1

23:20

We can see the whole picture, not just the main roads.

Speaker 2

23:22

Exactly, and with that we can uncover potential issues that we might have missed otherwise.

Speaker 1

23:28

Wow, that's incredibly powerful. We learned so much in this deep dive from those basic concepts of binary analysis all the way to this cutting edge symbolic execution stuff. It's amazing to see how we can peel back those layers of compiled code and really grasp how software works at its core.

Speaker 2

23:44

It is truly fascinating and there's always more to explore, more techniques to discover.

Speaker 1

23:48

It feels like we've gained a super power, like the ability to see the matrix. Thank you so much for guiding us through this incredible world.

Speaker 2

23:55

The pleasure was all mine, and remember the best way to learn is by doing so. Grab your tools, pick a binary that piques your interest, and start exploring. That vast world of binary analysis is waiting to be discovered.

Speaker 1

24:08

And to all the listeners out there, happy analyzing,

Transcript source: Provided by creator in RSS feed: download file

Practical Binary Analysis: Build Your Own Linux Tools for Binary Instrumentation, Analysis, and Disassembly

Episode description

Transcript