If you’re like me, you’ve probably asked “where does my compiler come from?”
The chain starts as follows: you have a program that’s written in some programming language. If it’s a compiled language (like C++ or Rust), you’ll need to compile it in order to run it.
$ cat > program.c <<EOF
#include <stdio.h>
int main() {
puts("Hello, world!");
return 0;
}
EOF
$ cc program.c -o program
$ ./program
Hello, world!
If it’s an interpreted language (like Python or JavaScript), there’s no need to compile it. Just pass the program to the interpreter.
$ python3 -c "print('Hello, world!')"
Hello, world!
The interpreter is usually a compiled program, however. So the actual chain of events, amortized across multiple parties, can be simplified to:
$ # At your distribution...
$ cc python3.c -o python3
$ upload_to_server ./python3
$ # At your computer...
$ apt install python3
$ python3 -c "print('Hello, world!')"
Hello, world!
No matter what we do, at some point a compiler has to compile some code.
Of course, this is glossing over a lot that can happen between these two extremes of “interpreted” and “compiled” languages. Some interpreters are optimized by turning them into really fast compilers.
However, the compiler is itself a program that’s written in some programming language. Therefore, at the start, you need a compiler to compile that compiler. But then that compiler itself needs a compiler. Then that compiler. So on and so forth.
This is the bootstrap chain; the collection of software that is required to build your software. Usually the bootstrap chain ends at a certain trusted root, like the Debian software repositories. For most organizational contexts, this is as far as your need to go.
But, any sufficiently good programmer will eventually get a little more curious. They want to pull on the bootstrap chain a little further than that. The majority of compilers are bootstrapped; they’re written in their own language and can be used to compile themselves.
The short version of the story is that the previous version of a compiler is used to compile the next version of a compiler. GCC 14 is used to compile GCC 15. GCC 13 is used to compile GCC 14. Rinse and repeat, all the way back to GCC 1 being used to compile GCC 2.
I can’t really find any history on how the first versions of GCC were compiled1, but I imagine they were compiled using the simpler C compilers of the time. If you were to follow those compilers back, you’d eventually get ones that were implemented in assembly language. Meaning, you’d need an assembler to assemble them.
Then, who assembles the assembler? Back in the 70’s, when this code was first being written, it was common for programmers to just not use assemblers. They would write out the intended assembly code, “assemble” it into the binary code by hand, then load that binary code into the computer. David Wheeler gets credit for inventing the first assembler, which would convert textual assembly code into binary.
For most people, that’s where the story ends. Developers toggled bytes into early computers to get assemblers, early assemblers built early compilers, then early compilers built the modern compilers and interpreters we enjoy today.
For us, it’s just the beginning.
Ken Thompson and Trusting Trust
You may notice that this entire extended “bootstrap chain” takes place over almost 80 years of computing progress. Specifically, large parts of the chain involve programs that have since been lost to time, or compilers that were always closed source. It makes you wonder: what would happen if someone in that chain was actively malicious? What if someone in the chain wanted to control every program at every step after?
Ken Thompson is a living legend.
He created Unix, and along the way also helped create the C programming language
alongside Dennis Ritchie, as
well as a million other little things like ed and UTF-8.
For his massive contribution to computer science, he won the Turing Award in 1983. Upon receiving the award, he demonstrated a possible attack on compiler infrastructure that could be used to compromise all software, everywhere.
He first demonstrated a fairly typical attack: a compiler that intentionally mis-compiled a program with a known source input. Effectively, the compiler changes the behavior of the program it’s compiling. For example, imagine we had a program that validates some input as part of a login program.
#include <stdio.h>
#include <string.h>
const char *password = "foobar";
int main() {
char buffer[80];
printf("Input password: ");
fgets(buffer, 80, stdin);
if (strcmp(buffer, password) == 0) {
puts("Password matches!");
return 0;
} else {
puts("Incorrect password!");
return 1;
}
}
Now, you go to compile this program using the system’s C compiler, cc. But,
let’s say the compiler is written like this:
const char *program_to_look_for = "int main() { ... }";
const char *program_to_replace_it_with = "int main() { ... }";
char *program_source = read_c_code_from_file();
if (strcmp(program_source, program_to_look_for) == 0) {
// Compiler our custom program instead.
program_source = program_to_replace_it_with;
}
do_the_actual_compilation(program_source);
For most programs, the compiler will behave normally as intended. But, it’s looking for your specific login program. If it sees it, it will replace the source code it should compile with something else. So even if you pass the compiler the correct source code, it can do whatever it wants with it.
Such a compiler could, for example, insert a backdoor into your login program.
$ good-cc login.c -o login
$ ./login
Input password: secret_backdoor
Incorrect password!
$ # Now, using our backdoored compiler:
$ cc login.c -o login
$ ./login
Input password: secret_backdoor
Password matches!
You can replace the “source code matching” code above with any other kind of heuristic. For example, analyzing the RTL for some kind of security subroutine, then subtly replacing it with a backdoored version.
This is somewhat disturbing, but can be easily avoided by just using a trusted compiler. Except, remember, compilers are themselves programs that need to be compiled by compilers. It’s possible for a compiler to tell when it’s compiling another compiler, then compile it with the backdoor code in it.
Imagine a backdoor programmed like this:
let program_source = read_program_code();
if (strcmp(program_source, c_compiler_source_code) == 0) {
program_source = backdoored_c_compiler_source_code;
} else if (strcmp(program_source, login_source_code) == 0) {
program_source = backdoored_login_source_code;
}
do_the_actual_compilation(program_source);
Effectively, this is a self-reproducing backdoor; a “virus” of sorts that infects every subsequent compiler. All a malicious actor has to do is insert this program early in the bootstrap chain. Then, every subsequent compiler in the bootstrap chain will have the backdoor as well. These compilers will compile the backdoor into other compilers, which will then compiler the backdoor in your program.
Impact
The terrifying thing about this virus is that there is no reliable way to detect it. There’s no real way to inspect binaries to ensure the bootloader isn’t there.
You could, in theory, hexdump the programs and see if there are any errant instructions in there. But the hexdumper could be backdoored as well, to silently not show you the backdoor. Ken Thompson goes to great lengths to indicate that the backdoor can effect any program that handles code; whether it’s the assembler, the dynamic loader, or even the hardware’s microcode.
You could debug the program and check to see exactly what it’s executing. But, the debugger could be backdoored. It could be silently skipping the instructions that insert the backdoor.
You might be saying “no worries, we have reproducible builds.
We can just compare the program across builds to ensure that the hash is the same.”
Except, this doesn’t work. How are you checking that the program is the same?
sha256sum? GnuPG? These can be backdoored too.
You would have to manually inspect the hard drive with an electron microscope to actually verify the exact contents. But even then, this isn’t a bulletproof defense. Modern CPU’s run microcode to translate the instruction code to the underlying CPU microarchitecture. That microcode can be compromised and made to mis-translate certain code.
This isn’t a theoretical attack, either. Viruses like this have already been discovered in the wild. Granted, in this case, this virus only infected Delphi environments and was mostly harmless.
But now, imagine what a nation state could do with this attack. PRISM has existed for some time now. In fact, there are individual software companies (and not that big of ones, either) that could put backdoors into popular open source software that could be replicated throughout the ecosystem.
This affects all software, closed and open source.
The Solution
The solution to the source of compiled software being in question is to have a way to create software without relying on other software. Back in the 70’s, the boot program for a computer was toggled into the front panel by a series of levers. This program would be the bare minimum needed to load the rest of the operating system from the disk.
This is still theoretically possible today. For example, it’s possible to take a floppy disk and a pencil, then make holes in the floppy disk with the pencil. The floppy disk reader can interpret the holes as binary, which effectively lets you program with a pencil.
You could use this strategy to write a compiler for a very, very basic programming language. Since you can trust the compiler you just wrote, you can trust anything that compiler has output. Then, you use this language to write a compiler for a slightly more advanced programming language.
You keep going until you create an assembler, then keep going after that until you have a C compiler. With a C compiler you can start compiling old GNU utilities until you arrive at GCC. Once you have GCC, you can compile anything, including Linux.
This is what the stage0 project has
been doing for some time. There is a hex interpreter that starts with
around 380 bytes of code. This is feasible enough to toggle into a computer,
then verified manually, by hand. Then after a couple of more advanced
hex interpreters, it brings up a macro assembler, which then assembles
a basic C compiler. This then compiles TinyCC,
which can compile a basic operating system and userland. This is enough
to get to GCC up, which then brings up Linux.
This is glossing over a lot. The full bootstrap chain is here. I also talk about bootstrappable builds a lot in this post.
Of course, this isn’t bulletproof. stage0 requires a BIOS or UEFI implementation
to run. BIOS and UEFI is written in code; code that can be very easily compromised.
If we’re going this far, who’s the say the microcode or boot ROM on the CPU isn’t
compromised? In fact, how do we know the RTL on the CPU isn’t backdoored? It’s not
like anyone has ever bothered to check.
Conclusion
I wanted to share a little of my computing paranoia in this post. Computers can never be 100 percent secure. Until someone manages to build a CPU in their garage out of nothing but electrical wire and duct tape, then bring up Linux on that CPU, it’s possible for any compiler to be compromised.
-
This reddit post seems to indicate that the first version of GCC was compiled with some early commercial compiler. But it’s unsources, and it’s not like I’d take the word of a Reddit post anyways. If you happen to have more context on the early history of GCC, please reach out to me. ↩
