I wrote my first compiler! A few weeks ago, actually… I’ve been procrastinating this blog post because I haven’t been too sure what to say on the subject. It’s a very simple compiler that takes Tiny BASIC code and outputs NASM Assembler that can be compiled into MS-DOS/FreeDOS executables.
TinyBASIC is a simple microcontroller language from back in the days where 4K was a normal amount of RAM. It was popular because it could fit in 3K and still run your reasonably useful programs. It looks like this:
REM FIBONACCI SEQUENCE GENERATOR
REM RELEASED INTO THE PUBLIC DOMAIN
REM SETUP THE VARIABLES — COUNTER, CURRENT VALUE AND LAST VALUE
10 LET I = 0
20 LET X = 1
30 LET Y = 0
REM DISPLAY THE CURRENT ITERATION AND VALUE
40 PRINT “FIB: “; I, X
REM ADD THE LAST VALUE TO THE CURRENT VALUE WHILE SAVING THE CURRENT VALUE
REM AS THE NEW LAST VALUE
50 LET Z = X
60 LET X = X + Y
70 LET Y = Z
REM INCREMENTING THE ITERATOR AND ABORTING WHEN IT’S TIME
80 LET I = I + 1
90 IF I <= 10 GOTO 40
Pretty neat, eh? It’s actually pretty easy to parse when you get in the hang of things. I read the first few chapters in Jack Crenshaw’s tutorial on compilers, Let’s Build a Compiler to get a basic expression parser running and did the rest on my own. Now, that code above compiles and runs perfectly!
The most important thing I learned from this, after all my previous attempts at writing compilers, is that parser generators are time-wasters from the depths of hell. It takes me three hours to set up something that looks like it might parse the code — not compile, but parse — that throws constant errors while I got a few working statements from my hand-written compiler in one.
The resulting code is much cleaner, too, because you don’t have a parser generator putting in tons of impossible-to-read tables and includes that it shouldn’t even need (Bison WTF is wrong with you?). LALR parsers, from what I understand, are the primary output of most of these parser generators. They seem to have a reputation for being almost unreadable and unwritable by human beings, and I have yet to find an upside to them other than that they are memory-efficient compared to some of the alternative results of parser generators. At the same time, I only use a few global integers in my compiler, and the code is smaller… especially considering that the compiler includes both a scanner and parser in the same code set.
I also found that writing the compiler by hand was much more enjoyable as immediate results were faster to procure and having full control and responsibility for the code felt more challenging. Considering I’m primarily a hobbyist, challenge is one of the most important parts of anything I do.
I targeted MS-DOS/FreeDOS because it has what is quite possibly the cleanest and simplest system call table that I’ve ever encountered (in other words, it was really easy to target). I used DosBox to emulate the MS-DOS kernel on my GNU/Linux machine. By request of a friend of mine, I tested my compiler on Windows and found that the resulting executables ran natively. Now that’s backwards compatibility!
All in all the project was pretty fun and the result was a neat little toy. This new skill should come in handy when it comes time to implement scripting into one of my game projects. Lua is for babies!
The source code is available online under the GNU General Public License version 3 or later. Leave a note in the comments if it helps you figure out how compilers work! If you are really interested in writing your own compiler, I highly recommend Jack Crenshaw’s tutorial. It’s very helpful and spans over topics from expression parsing to interpreters.
- TinyBASIC-C at GitHub
- Let’s Build a Compiler
- More on TinyBASIC
- NASM/YASM (required to build the results of Tiny BASIC all the way to binary format)
- DosBox (for people who want to try out the compiler on GNU/Linux)
- Wikipedia Entry on LALR Parsers
Stay tuned: I’m doing some finishing touches on a new release of Pizazz and have been doing some behind-the-scenes work on CANINE!