Writing a C Compiler: Build a Real Programming Language from Scratch

Speaker 1

00:00

Welcome to the deep dive. Today, we're tackling something fundamental yet often unseen in the world of coding. The compiler. Think of it as the architect behind the scenes, taking your human readable instructions and blueprinting them into the precise machine language your computer understands precisely.

Speaker 2

00:19

And for this deep dive, we're not just talking about compilers conceptually. We're getting into the initial stages of building our own. Our mission is to really dissect the core transformations required to compile even the most well the simplest see programs.

Speaker 1

00:33

And to guide this architectural exploration, we'll be using excerpts from Writing a C Compiler as our foundational text. It's our guide as we uncover how source code undergoes its initial metamorphosis into executable form. Okay, let's unpack this, let's do it. Our compiler construction starts with establishing a modular pipeline of four key stages.

Speaker 2

00:53

That's right. Yeah, even for the simplest stuff like maybe not even Hello World, maybe just returning a number, will be setting up a four pass structure. The first of.

Speaker 1

01:01

These is the lex lexer or tokenizer.

Speaker 2

01:04

Right or tokenizer, Yeah, same thing.

Speaker 1

01:06

So the lexer's fundamental task is to scan our c code and break it down into its essential components, right like identifying the individual building blocks before you start assembling them exactly.

Speaker 2

01:16

These smallest meaningful units are the tokens. Think about curly braces, defining scope, the summit, colon, ending statements, see words like int right, int return, and then the identifiers you create for functions, variables. All of these are distinct tokens. The lexer read your code character by character and groups them intelligently.

Speaker 1

01:37

I see, so a basic line like into main return thirty two would be segmented into tokens int main return thirty two taxes.

Speaker 2

01:45

Enter exactly that sequence.

Speaker 1

01:47

Okay, what's the next step after this initial segmentation.

Speaker 2

01:51

Following tokenization gives the parser? It takes that linear sequence of tokens.

Speaker 1

01:55

Just a list basically just a list.

Speaker 2

01:57

And imposes a hierarchical structure. It constructs what's known as an abstract syntax tree or at an AST.

Speaker 1

02:04

Okay, so instead of just a flat list, we get a structured representation that shows how these tokens relate to each other.

Speaker 2

02:10

Precisely. Think of it as moving from say a list of ingredients to an actual recipe.

Speaker 1

02:15

Oh nice analogy.

Speaker 2

02:17

The AST is this tree like structure that embodies the grammatical rules of C and reveals the program's operational flow. It's a format that lets the compiler analyze and well understand the codes intent much better than just that stream of.

Speaker 1

02:33

Tokens, right, because just having a list of words doesn't tell you the sentence structure of the meaning exactly. So, for our example in Maine return thirty two, what would the AST look like in its simplest form.

Speaker 2

02:46

In a simplified view, the root might be a program node, okay. Descending from that, you'd have a function node for Maine. Inside the function there'd be a return node, and finally, connected to return, a constant node holding the value.

Speaker 1

02:59

There So it visually organizes the program's components and their relationships. Function contains a return statement which returns a.

Speaker 2

03:07

Constant exactly captures that structure.

Speaker 1

03:09

Interesting, Now the compiler has this structured tree. What happens next in the transformation.

Speaker 2

03:14

This is where the translation to a lower level language begins. The code generation pass takes the AST the tree we just spilled, yes, that tree, and translates it into assembly language instruction assembly.

Speaker 1

03:26

Okay, that's much closer to the hardware, right exactly.

Speaker 2

03:29

And it's important to understand at this stage the compiler isn't directly writing out like a human readable assembly text file. Oh okay, it's still manipulating data structures internally. It's creating an in memory representation of these assembly instructions.

Speaker 1

03:46

So the creation of the actual DOTS file comes later.

Speaker 2

03:49

Yes, that's the job of the fourth initial pass code emission, right. This pass takes the in memory assembly representation what just created and finally writes it out, serializes it into a file, usually with the dot extension.

Speaker 1

04:03

And that S file is what the assembler and linker use later on.

Speaker 2

04:07

That's the one they take that and produce your final executable program.

Speaker 1

04:10

It seems like a considerable number of steps for such a basic program, you know, return thirty two. Why not a more direct approach.

Speaker 2

04:17

That's a really fair question. Again, it might seem like overkill for these tiny examples, but establishing this multipass architecture right from the start gives us huge advantages later. By separating concerns lexical analysis here, syntax there, cogen over here, modularity exactly, modularity, we create a more maintainable system. Imagine trying to translate I don't know, a complex novel straight into another language without first understanding the grammar and structure.

Speaker 1

04:45

Yeah, that would be a mess.

Speaker 2

04:46

It would be incredibly hard. This separation lets us handle more complexity later without having to rip everything up and start again. It's really about building a scalable foundation that.

Speaker 1

04:55

Makes a lot of sense planning for future complexity. Okay, speaking of assembly language, can we actually peek under the hood see what a real compiler like GCC generates for a simple C program.

Speaker 2

05:06

That's an excellent idea. It helps ground this discussion. For a simple program like in main return two saved us say return two dot C. You can use GCC with some specific commandline flags gccss and f and O. A secret is unwine tables FCF protection none, return two dot e t whoa.

Speaker 1

05:24

Okay, lots of flags there, but the key is mesh s ash.

Speaker 2

05:27

This is the main one telling it to stop after compilation and output assembly. The others just simplify the output a bit for our purposes.

Speaker 1

05:33

Got it, And what's the output look like?

Speaker 2

05:35

This command generates a file probably return two dots. With the assembly for that C program, you'll likely see something really simple like dot globalmine, dot main, dot movell two dollars percent ax red.

Speaker 1

05:48

Okay, that's definitely not c what's going on.

Speaker 2

05:50

Here, Let's break it down. The syntax is AT and T assembly syntax common on Linux and mac os. That first line dot global main that starts with a period, right, That means it's an assembler directive. It's an instruction for the assembler itself, not the CPU docl WI main just makes the main symbol visible outside this.

Speaker 1

06:08

File, a symbol like a label or a name for a place in memory.

Speaker 2

06:11

Precisely. Here, Maine is a symbol representing the starting address of our main function. The compiler doesn't know the final address yet. The linker figures that out later. Ah.

Speaker 1

06:19

The linker's job resolving symbols exactly.

Speaker 2

06:23

It resolves them assigns actual memory locations. If code refers to Maine, the linker patches in the real address. That's called relocation.

Speaker 1

06:31

So Maine on the next line is just marking the spot the start of the code exactly.

Speaker 2

06:35

It's the label. Then move two dollars percent ax. That's a real instruction.

Speaker 1

06:40

MOLL move long thirty two bit.

Speaker 2

06:43

Yep thirty two bit integer two dollars means the literal value.

Speaker 1

06:47

Too an immediate value, right, and percent.

Speaker 2

06:49

X is a register, a small fast storage spot inside the CPU. So this instruction puts the value too into the percent ax register.

Speaker 1

06:57

Okay, but y percent x specifically convention.

Speaker 2

07:01

In many standard ways, functions call each other calling conventions, the percent x register is designated to hold the function's return value.

Speaker 1

07:08

Oh okay, So because our C code returns two, we put.

Speaker 2

07:10

Two in percent acts so whoever called main can find the result.

Speaker 1

07:13

There makes sense. And the last line writ.

Speaker 2

07:16

Just means return tells the CPU to go back to where main was called from. So yeah, those four lines are the complete assembly for a tiny C program.

Speaker 1

07:23

That's surprisingly direct.

Speaker 2

07:25

Cool.

Speaker 1

07:26

So when we compile a C program, even with our own simple compiler, what's the typical sequence of operations overall?

Speaker 2

07:31

Right? So while our first compiler focuses mainly on that compilation to assembly.

Speaker 1

07:35

Step step two, in the usual process, Yeah, the standard.

Speaker 2

07:38

C process has a few phases. First, there's pre processing.

Speaker 1

07:42

Handling hashtag include and macros and stuff.

Speaker 2

07:44

Exactly Commands like GCCE do this. It often outputs a DITI file.

Speaker 1

07:49

Then comes compilation proper our focus generating the didass assembly file correct?

Speaker 2

07:53

Then in a full setup you have assembly and linking usually just GCC assembly file. Oh, output file.

Speaker 1

08:00

Takes the dot S file, makes machine code, links libraries.

Speaker 2

08:03

And gives you the final executable. Right. Our initial compiler will sort of stub out that last step, relying on the system's assembler and linker. Gotcha.

Speaker 1

08:10

And for our own compiler driver, the command line tool we're building, how should.

Speaker 2

08:14

That behave, good question. It should take the path to a C source file like your compiler paths to program dot C. If it works, success, it should create an executable in the same directory, same name, but no dot C. So pats a program and exit with code zero. And if it fails, non zero exit code and crucially no output files, no executable, clean failure.

Speaker 1

08:36

Clear rules. And I saw mentions of lex and parse options in the notes.

Speaker 2

08:40

Uh yeah, those are mostly for testing and debugging. Our compiler lexmis just run the lexer.

Speaker 1

08:45

And stop check tokenizing, right, and.

Speaker 2

08:48

Parse runs the lexer and parser builds the AST then stops. Neither should create any output files, they just check those stages internally.

Speaker 1

08:56

Okay, that makes sense for development. All right, we've got a solid high light picture. The four passes the standard GCC flow. Let's dive deeper into the lexer and pulser. Now they're the first big hurdle in building our own right absolutely.

Speaker 2

09:09

Chapter two of the guide digs into building these starting with the lexer. As we said, its job is finding tokens, and one simplifying assumption we make early on is that our c files only use ASKI characters.

Speaker 1

09:22

Just standard ask you for now sensible starting point. How do we actually test if our lexer is doing the right thing.

Speaker 2

09:28

The guide provides a test compiler tool, which is super helpful. It comes with test programs. In test chapter two, you'll find directories like invalid.

Speaker 1

09:35

Lex programs that should fail the lexer.

Speaker 2

09:37

Exactly, bad tokens, weird characters, and then invalid parts and valid directories. For later stages. You test the lexer specifically using dot test compiler path toward compiler chapter two stage lex.

Speaker 1

09:50

Okay, So that command runs our compiler inlex only mode against those test cases and checks if it accepts or rejects them correctly.

Speaker 2

09:57

That's the idea. It verifies the pass feel behavior. It doesn't necessarily check the exact stream of tokens for the valid files.

Speaker 1

10:04

Though, ah so for that level of detail, we'd need our own unit tests.

Speaker 2

10:08

Precisely, you'd write tests to feed it valid code and assert that the token list matches exactly what you expect, and feed it invalid code to check the error messages.

Speaker 1

10:17

Got it? Any key implementation tips for the lexer itself, Yes.

Speaker 2

10:20

A couple of important ones. First, when you see something that looks like an identifier, a sequence of letters, numbers, underscores like maine or my variable right your logic. Maybe your rejects will probably also match keywords like int or return. The efficient way is first recognize it as a generic identifier. Then check if that identifier happens to be on the list of reserved keywords.

Speaker 1

10:41

Ah, don't try to make the initial pattern, distinguish them, identify, then.

Speaker 2

10:45

Classify exactly two steps. The other thing is, don't rely only on white space to split tokens. Oh right, think about main. That's three tokens main and no white space separating them. If you just split on spaces, you'd get it wrong.

Speaker 1

10:59

Point Okay, So the lexer spins out tokens. Then the parser steps in to build the ast exactly.

Speaker 2

11:06

The parser takes that flat stream and gives it structure hierarchy based on the C grammar. The AST is the data structure holding that hierarchy.

Speaker 1

11:15

We saw the simple AST for return thirty two. What about something slightly more complex, like an if statement. How does the hierarchy show up there?

Speaker 2

11:23

Okay, good example. Let's say you have if ab return two plus two. Right, the top AST node might be an if node. This if node would have say, two main children, one for the condition.

Speaker 1

11:34

Ab, which itself might be structured.

Speaker 2

11:36

Oh yeah, that condition could be a binary op node for with this own children for the variable A and the variable b.

Speaker 1

11:41

Okay, and the other child of the if.

Speaker 2

11:43

That would be the then block return two plus two. That could be a return node, and its child would be another binary op node for the plus with two constant children both holding to wow.

Speaker 1

11:54

Okay, So the tree really mirrors the nesting and the logic if conditioned then and the condition the then part have their own little subtrees.

Speaker 2

12:02

Decisely, it captures that structure directly, which is what the next stages need now. To define these AST structures formally and importantly in a language neutral way, the guide introduces something called ASDL.

Speaker 1

12:13

Asdl zephyr abstract syntax description language.

Speaker 2

12:16

That's the one. It's just a formal way to write down what our AST nodes look like.

Speaker 1

12:20

Okay, So what does the ASDL look like for our super simple C subset in chapter two?

Speaker 2

12:25

It's pretty minimal. It's like program program function definition, function definition, function, identify your name, statement, body, return next, sovisp.

Speaker 1

12:34

Okay, let's decode that a program is just a program node containing one function definition. Yep, A function definition is a function node. It has a name, which is an identifier type, and a body which is a statement type.

Speaker 2

12:47

Right, and those words name and body are just field names helpful labels.

Speaker 1

12:51

Gotcha. Then a statement can only be a return node containing an x expression for now, yes, and an x can only be a constant node holding an int.

Speaker 2

13:00

That's it for chapter two. Identifier and int are like built in ASDL types.

Speaker 1

13:05

So when we implement this, say in Python or Rust or drava, will create classes or data types that match this ASDL structure exactly.

Speaker 2

13:15

Functional languages might use algebraic data types. OP languages might use abstract classes and inheritance. The guide mentioned some idioms and points to more reading if you want to go deeper into implementation strategies.

Speaker 1

13:25

Okay, but the ASDL defines the structure, but it doesn't tell the parser which tokens in what order make up say a function definition. Right, it doesn't mention the ind keyword or the parentheses or braces.

Speaker 2

13:37

That is a crucial distinction. You're absolutely right. The AST is abstract. It leaves out the syntactic sugar like semicolons embraces. The parser needs a concrete map of the token sequences.

Speaker 1

13:45

Which is where the formal grammar comes.

Speaker 2

13:47

In, exactly, using a notation called backus nair form or BNF BNF.

Speaker 1

13:52

Okay, what's the BNF for this simplecy it mirrors.

Speaker 2

13:55

The ASDL pretty closely. The program function I identify return expeed a statement return expediment, and then it clarifies the terminals identifier an identify your token and in a constant.

Speaker 1

14:09

Token then okay, So things in angle brackets like this are non terminals. They correspond to our AST node types.

Speaker 2

14:15

Yes, grammatical categories.

Speaker 1

14:16

And things in quotes like this are terminals. The actual tokens the lexer gives.

Speaker 2

14:20

Us exactly the literal tokens we expect to see. The bn F spells out the exact sequence and int token. Then an identifier token then art rcedo a statement right, and the.

Speaker 1

14:32

Question mark definitions are just clarifying what kind of token identifier and in refer to. So the BNF is the parser's rulebook for matching token sequences to build the AST nodes defined by the asdo.

Speaker 2

14:43

You've got it perfect summary. The guide also shows how you'd extend bn F like adding an if statement rules ifpanis statement l statement the brackets mean the l's part is optional neat Okay.

Speaker 1

14:54

So we have tokens, the ASDL defining the target AST and the BNF grammar as a rule book. How does the parser actually do the parsing? What's the technique?

Speaker 2

15:04

The guide introduces a common technique called recursive descent parsing.

Speaker 1

15:08

Recursive dissent sounds intriguing.

Speaker 2

15:10

The basic idea is simple, For each non terminal symbol in the BNF grammar, like program function statement, you write a corresponding parsing function.

Speaker 1

15:20

Okay, a function for each rule.

Speaker 2

15:22

Pretty much, and these functions often call each other, mirroring the structure of the grammar. That's the recursive part.

Speaker 1

15:27

Ah okay, So how would parse statement work based on our simple grammar?

Speaker 2

15:32

Well, the rule is statement return x biller. So the par statement function would first look for a return token. Okay, if it finds one, it consumes it. Then it needs to parson x, so it would call another function, maybe parsex.

Speaker 1

15:45

Which would handle parsing the integer constant in our.

Speaker 2

15:47

Case, right parsiicus would return the constant ast node, then parsated, but looks for the final token, consumes that, consumes that, and if everything worked, it bundles up the constant node returned by parsis inside a new return ast node and returns that got it.

Speaker 1

16:03

The guide showed some pseudocode with an expect helper function.

Speaker 2

16:06

Yeah. Expect is useful. It basically means check if the next token is x, consume it if yes, raise an error if no.

Speaker 1

16:13

And these functions consume tokens as they go, So if parse program finishes and there are still tokens left over.

Speaker 2

16:19

That usually means there's extra stuff that doesn't fit the grammar. A syntax error makes sense.

Speaker 1

16:23

The guide mentioned predictive parsers and backtracking briefly too.

Speaker 2

16:26

Yeah. For more complex grammars where a rule might have multiple options like if versus return for statement, the parser might need to peek ahead at the next token to decide which path to take predictive or try one path and backtrack if it fails.

Speaker 1

16:41

But for our simple start, direct recursive descent works well.

Speaker 2

16:45

And testing the parser same tool.

Speaker 1

16:47

YEP test compiler path where you're compiler chapter two stage pars. It checks against the invalid pars and valid tests again. Writing your own tests to check the structure of the output AST is super helpful for debugging and the implementation tips where write a pretty printer for the ASD definitely helps visualize the tree and give good error messages.

Speaker 2

17:05

Crucial expected but found return online five column ten is way better than just syntax error.

Speaker 1

17:12

Absolutely okay. So source DAN lexer, the DAN pokins, the met parser, DAN cast. We have the tree. What's next?

Speaker 2

17:20

Now we hit cogeneration. This pass takes that c language AST.

Speaker 1

17:24

The one the parser is built.

Speaker 2

17:25

Exactly and transforms it into our target by sixty four assembly instructions, but again not as text yet. We represent the assembly program as another internal data.

Speaker 1

17:35

Structure first another AST, an assembly AST precisely.

Speaker 2

17:39

The guide calls it that. To keep things clear, it has its own ASDL definition two.

Speaker 1

17:43

Okay, what does the assembly ASDL look like?

Speaker 2

17:45

It's also quite simple for now. Program function definition function identify our name, instruction instructions instructions op src, opern and dst ret oper im in register.

Speaker 1

17:58

Okay, interesting parallels. Program has a function definition. A function has a name, but instead of a C statement body, it has instruction a list of instructions.

Speaker 2

18:07

The astrisk means a list or sequence.

Speaker 1

18:09

And the instruction types are mauve or reht and their operations can be immediate, a constant or a register.

Speaker 2

18:16

That's it for now, and initially, the only register we care about is percent ax for the return value. The code generator walks the cast and for each node it figures out the equivalent assembly instructions and builds up this assembly AST.

Speaker 1

18:30

The guide had a table mapping cast nodes to assembly AST constructs like return in C becomes a mauv register than are ret in assembling.

Speaker 2

18:39

Right, and constant int in the cast becomes in in the assembly AST.

Speaker 1

18:44

So it's a translation step building a new tree that represents the assembly code needed exactly.

Speaker 2

18:48

And you can see how one C statement return maps to two assembly instructions movel and ret. That becomes more common as things get complex.

Speaker 1

18:56

Okay, assembly AST constructed in memory. The final step of the initial four passes code emission.

Speaker 2

19:01

Take that assembly AST we just built and write it out to the s text file.

Speaker 1

19:04

Finally the text file.

Speaker 2

19:06

Finally the text file. And since the assembly AST structure closely matches actual assembly syntax, this cast is usually quite straightforward. You just traverse the assembly AST and put the text for each node.

Speaker 1

19:16

Another table in the guide showed the formatting function name instructions becomes dot global name, then name than the text for each instruction each On.

Speaker 2

19:26

The new line, Yeah, mov A sarvisctst becomes move, srcdst, rep becomes REHT, register becomes percent x, mint becomes NT.

Speaker 1

19:35

Just translating the assembly AST nodes into their standard text representation.

Speaker 2

19:40

Pretty much a direct translation.

Speaker 1

19:41

Yeah, and remembering that mac OS needs the underscore prefix on the global name like Maine.

Speaker 2

19:46

Right, that platform detail needs to be handled by the emitter.

Speaker 1

19:49

Okay. And then to test the whole thing lexer parser codegen code emitter, we run test compiler without the stage flags.

Speaker 2

19:55

Exactly test compiler path toward your compiler. Chapter two. That command will one run your compiler on the test c files to generate S files. Two use the system's GCC. We're Clang to assemble and link those STS files into executables. Three. Run those executables. Four compare their exit codes to the exit codes produced by compiling the original C files directly with GCC.

Speaker 1

20:21

So it verifies the end to end behavior. Does our compiler produce assembly that results in a program doing the same thing at least in terms of exit code as GCC.

Speaker 2

20:30

That's the goal for this stage. It's the final check that all four passes are working together correctly for these simple programs.

Speaker 1

20:37

Wow. Okay, so in this deep dive, we've really laid out the blueprint for a compiler's first steps. Yeah, lexingcode into tokens, parsing tokens into that crucial abstracts and text tree, generating an intermediate assembly representation from that tree, and finally emitting that assembly into a text file. It's fascinating to see how these stages transform the code step by step.

Speaker 2

20:57

Absolutely and like we said, while it seems like a for just returning a number, this structured multi pass approach is really the key. It's the foundation we need to start handling more complex sea features like operators, variables, control flow in our next deep dives.

Speaker 1

21:13

It definitely gives you a much deeper appreciation for what's happening when you just type gccmiprogram dot c okay. Thinking about these basic steps, how do they scale, How does this foundation handle the sheer complexity of say, the Linux kernel source code, and what are the really tricky sea features that will challenge this simple pipeline later on. Definitely something to chew on.

Speaker 2

21:33

Indeed, plenty more complexity ahead.

Speaker 1

21:36

Thanks for joining us for this deep dive

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript