Why am I writing a Rust compiler in C?

John Nunley · August 25, 2024

To bootstrap Rust, no cost is too great.

Perceptive Rustaceans may have noticed my activity has gone down as of late. There are a handful of different reasons for this. I’ve been the subject of a truly apocalyptic series of life events, including the death of a relative that rattled me to my core. I’ve had more responsibilities at work, leaving me with less time and energy to contribute. Maybe I’ve also lost a little bit of the college-kid enthusiasm that brought me to open source in the first place.

There’s another reason, too. I’ve been cooking up a project that’s been taking up most of my time. It’s certainly the largest project I’ve created in the open source world, and if I complete it, it will certainly be my crowning achievement.

I am writing a Rust compiler in pure C. No C++. No flex or yacc. Not even a Makefile. Nothing but pure C.

It’s called Dozer.

Wait, Why?

To understand why I’ve followed this path of madness, you first need to understand bootstrapping and why it is important.

Let’s say that you’ve written some code in Rust. In order to run this code, you need to compile it. A compiler is a program that parses your code, validates its correctness, and then transforms it into machine code that the CPU can understand.

Dependency Dog Dependency Dog: Yes, it’s significantly more complicated than that. Except when it’s less complicated than that. Compilers are tricky to even describe.

For Rust, your main compiler is rustc. If you don’t know, this is the underlying program that cargo calls when you run cargo build. It’s fantastic software, and frankly a gem of the open source community. Its code quality is up there with the Linux kernel and the Quake III source code.

However, rustc itself is a program. So it needs a compiler to compile it from its source code to machine code. Say, what language is rustc written in?

rustc is 97.3 percent rust

Ah, rustc is a Rust program. Written in Rust, for the purpose of compiling Rust code. But, think about this for a second. If rustc is written in Rust, and rustc is needed to compile Rust code, that means you need to use rustc to compile rustc. Which is fine for us users, since we can just download rustc from the internet and use it.

But, who compiled the first rustc? There had to be a chicken before the egg, right? Where does it start?

Actually, that’s fairly simple. Every new version of rustc was compiled with the previous version of rustc. So rustc version 1.80.0 was compiled with rustc version 1.79.0. Which was, in turn, compiled with rustc version 1.78.0. And so on and so forth, all the way back to version 0.7 if the compiler. At that point, the compiler was written in OCaml. So all you needed was an OCaml compiler to get a fully functioning rustc program.

There, problem solved! We’ve figured out how to create rustc from first principles! All is well, let’s go back to business.

Just one more thing. We still need a version of the OCaml compiler for all of this to work. So what language is the OCaml compiler written in?

OCaml is 84 percent OCaml

faceplant

Okay, okay, no worries! There is a project that can successfully compile the OCaml compiler using Guile, which is one of the many variants of Scheme, which is one of many variants of Lisp. Not to mention, Guile’s interpreter is written in C.

So this brings us, as all eventually things do, to the C programming language. We just compile it using GCC, and everything works out. So we just need to compile GCC, which is written using… C++?!

Okay, that’s a little unfair. GCC was written in C until version 5, and it’s not like there’s a shortage of C compilers written in C out there. For instance, consider TinyCC, which is written in C and handles not only compiling, but assembly and linking too.

…but that still doesn’t answer our question. What was the first C compiler written in? Assembly? Then what was the first assembler written in?

The Descent Principle

This is where we introduce the Bootstrappable Builds project. To me, this is one of the most fascinating projects in the open source community. It’s basically code alchemy.

Their Linux bootstrap process starts with a 512-byte binary seed. This seed contains what’s possibly the simplest compiler you can imagine: it takes hexadecimal digits and outputs the corresponding raw bytes. As an example, here part of the “source code” that’s compiled with this compiler.

31 C0           # xor ax, ax
8E D8           # mov ds, ax
8E C0           # mov es, ax
8E D0           # mov ss, ax
BC 00 77        # mov sp, 0x7700
FC              # cld ; clear direction flag
88 16 15 7C     # mov [boot_drive], dl

Note that everything after the pound sign is a comment, and all whitespace is stripped. Frankly, I’m not even sure this can be called a programming language. Still, it is technically analyzable, dissectable source code.

From here, this compiler compiles a very simple operating system, a barebones shell, and a slightly more advanced compiler. That compiler compiles a slightly more advanced compiler. A few steps later, you have something that roughly looks like assembly code.

DEFINE cmp_ebx,edx 39D3
DEFINE je 0F84
DEFINE sub_ebx, 81EB

:loop_options
    cmp_ebx,edx                         # Check if we are done
    je %loop_options_done               # We are done
    sub_ebx, %2                         # --options

Man, it’s weird to think of assembly code as being higher-level than anything else, right?

This is enough to get them to a very basic subset of C. Then they compile a slightly more advanced C compiler written in this subset. A few steps later they can compile TinyCC. From there they can bootstrap yacc, basic coreutils, Bash, autotools, and eventually GCC and Linux.

I’m not doing this justice, it’s a fascinating process. Every step is listed here.

Anyhow, you’ve essentially gone from “a binary blob small enough to be manually analyzed” to Linux, GCC, and basically everything else. But let’s start again from TinyCC.

Right now, Rust shows up very late into this process. They use mrustc, an alternative Rust implementation written in C++ that can compile rustc version 1.56. From here, they then compile up to modern Rust code.

The main issue here is that, by the time C++ is introduced into the bootstrap chain, the bootstrap is basically over. So if you wanted to use Rust at any point before C++ is introduced, you’re out of luck.

So, for me, it would be really nice if there was a Rust compiler that could be bootstrapped from C. Specifically, a Rust compiler that can be bootstrapped from TinyCC, while assuming that there are no tools on the system yet that could be potentially useful.

That’s Dozer.

The Plan

I’ve been working on Dozer for the past two months, putting my anemic free time to work on writing in a language that I kind of hate.

Dependency Dog Dependency Dog: That’s a little unfair. C has some elegant qualities to it. Reality truly is what you make of it. It’s just that I would not let this code anywhere near production.

It’s written with no extensions, and so far both TinyCC and cproc are able to compile it with no issues. I’m using QBE as a backend. Other than that, I assume no tools exist on the system. Just a C compiler, some very basic shell implementation, and nothing else.

I won’t get into the raw experience of writing a compiler in this blogpost. But so far, I have the lexer done, as well as a sizable part of the parser. Macro/module expansion is something I’m putting off as long as possible, typechecking only supports i32, and codegen is a little bit rough. But it’s a start.

I can successfully compile this code:

fn rust_main() -> i32 {
    (2 - 1) * 6 + 3
}

So, where to from here? Here’s my plan.

  • Slowly advance Dozer until it can compile some basic libc-using samples, then libcore, then rustc.
    • For the record, I’m planning on compiling rustc’s Cranelift backend, which is written entirely in Rust. Since we’re assuming we don’t have C++ yet, we can’t compile LLVM.
  • Create a cargo equivalent that can use Dozer to compile Rust packages.
  • Find out which sources in rustc are automaticaly generated and then strip them out. By the Bootstrappable project’s rules, automatically generated code is not allowed.
  • Create a process that can be used to compile rustc and then cargo, then use our compiled versions of rustc/cargo to re-compile canonical versions of rustc/cargo.

This will definitely be the hardest project I’ve ever undertaken. Part of me doubts that I will be able to finish it. But you know what? It’s better to have tried and lost than to never have tried at all.

Stay tuned for more Dozer updates, as well as an explanation of the architecture I have planned.

Twitter, Facebook

This website's source code is hosted via Codeberg