How to make a programming language
Feb 11 2017
What actually happens in an interpreter when it is evaluating your code? Presumably some very impressive voodoo understandable only to doctors of computer science. The kind of people who are fluent in C++ and can blast through the Euler Project in Haskell before their morning Weetabix.
Right? Wrong. It’s actually relatively straightforward and an interpreter can be built by mere mortals such as you and I. Let’s explore some basic language interpreter concepts in Elixir, a nice friendly high level language. No pointer arithmetic or segfaults here!
The first thing to know is that all code syntax can be expressed as a tree.
Take the expression below, in which +
is called with the numbers 1
and
2
.
This can be expressed as a tree with +
as the root nodes, and the numbers as
its child nodes.
1 + 2
call: +
├── number: 1
└── number: 2
We can do the same for more complex expression.
1 + 2 == 3
call: ==
├── call: +
│ ├── number: 1
| └── number: 2
└── number: 3
Here the ==
is the root node and it has two children nodes, a call
to +
and the number 3. The call to +
has two child nodes, the number 1
and the number 2.
All source code can be expressed as a tree data structure (an abstract syntax
tree to be precise). Let’s implement this in Elixir using a struct for
each type of node we’ve seen so far, as well as a boolean node, which would be the
result of ==
.
defmodule Number do
defstruct :value
def new(value), do: %Number{value: value}
end
defmodule Add do
defstruct :left, :right
def new(left, right), do: %Add{left: left, right: right}
end
defmodule Equals do
defstruct :left, :right
def new(left, right), do: %Equals{left: left, right: right}
end
defmodule Boolean do
defstruct :value
def new(value), do: %Boolean{value: value}
end
Now we can model 1 + 1
like this:
Add.new(Number.new(1),
Number.new(1))
And we can model 1 + 2 == 4
like this:
Equals.new(Add.new(Number.new(1),
Number.new(2)),
Number.new(4))
Now we can express programs in our code, providing those programs do nothing other than add numbers and check for equality. How do we go one step further and execute this tree? First we need to think about what happens when we step through code.
One way to think about this is to model it as the tree going through a series of reduction steps. A node tries to reduce its first child, and if it can then that’s one reduction step. If not it tries to reduce its next child, and so on until it has no more children to reduce, after that the only step is to reduce itself.
This all seems a bit abstract, so let’s look at 1 + 2 == 4
again.
First ==
tried tries to reduce its first child, which is the left hand side
expression 1 + 2
.
+
can be reduced, but first it needs to try and reduce its children. 1
and
2
are numbers, which can’t be reduced, so +
reduces itself it a number by
summing the value of its two children, making 3
.
That’s one reduction step. The expression is now 3 == 4
.
For the next step ==
tries to reduce its first child, 3
, but cannot as
numbers cannot be reduced. Then it tries to reduce its second child, 4
, but
again it cannot. Now there are no more children to reduce, it reduces itself
to false
by checking if the children are equal.
# Initial expression
1 + 2 == 4
# Reduce step 1
3 == 4
# Reduce step 2
false
Or in the Elixir AST:
# Initial expression
Equals.new(Add.new(Number.new(1),
Number.new(2)),
Number.new(4))
# Reduce step 1
Equals.new(Number.new(3),
Number.new(4))
# Reduce step 2
Boolean.new(false)
To teach the Elixir AST how to reduce I’m doing to define an Elixir protocol and then implement it for each of the structs (it’s a bit like an interface in an OO language).
This protocol will have a function reduce
, which takes a node and returns
either a reduced node, or the atom :noop
, signifying that it cannot be
reduced.
defprotocol Node do
def reduce(node)
end
Numbers cannot be reduced, so they return :noop
.
defimpl Node, for: Number do
def reduce(_number) do
:noop
end
end
result = Node.reduce(Number.new(1))
assert result == :noop
And the same for Boolean.
defimpl Node, for: Boolean do
def reduce(_boolean) do
:noop
end
end
result = Node.reduce(Boolean.new(true))
assert result == :noop
Add is more complex. It tries to reduce each of its children, and then reduces
itself if they both :noop
.
defimpl Node, for: Add do
def reduce(add) do
case Node.reduce(add.left) do
{:ok, reduced} ->
Add.new(reduced, add.right)
:noop ->
case Node.reduce(add.right) do
{:ok, reduced} ->
Add.new(add.left, reduced)
:noop ->
{:ok, Number.new(add.left.value + add.right.value)}
end
end
end
end
result = Node.reduce(Add.new(Number.new(1), Number.new(1)))
assert result == {:ok, Number.new(2)}
This code’s pretty grim, so I’ll use Elixir’s monadic with
to tidy it up a
bit. This code does exactly the same thing, except the happy path is written
first, and then the other paths are in the else
block.
defimpl Node, for: Add do
def reduce(add) do
with {:left, :noop} <- {:left, Node.reduce(add.left)},
{:right, :noop} <- {:right, Node.reduce(add.right)} do
{:ok, Number.new(add.left.value + add.right.value)}
else
{:left, {:ok, reduced}} ->
{:ok, Add.new(reduced, add.right)}
{:right, {:ok, reduced}} ->
{:ok, Add.new(add.left, reduced)}
end
end
end
And then lastly there’s Equals
, which is pretty much the same as Add
.
defimpl Node, for: Equals do
def reduce(add) do
with {:left, :noop} <- {:left, Node.reduce(add.left)},
{:right, :noop} <- {:right, Node.reduce(add.right)} do
{:ok, Equals.new(add.left.value == add.right.value)}
else
{:left, {:ok, reduced}} ->
{:ok, Equals.new(reduced, add.right)}
{:right, {:ok, reduced}} ->
{:ok, Equals.new(add.left, reduced)}
end
end
end
Can you spot the difference? It’s just this line in the middle.
{:ok, Boolean.new(add.left.value == add.right.value)}
Maybe there’s an abstraction to be extracted here. For now I’ll leave it as it is- duplication is better than the wrong abstraction.
Right, so if we can reduce expressions by a single step the only thing left to do is perform this step repeatedly until the expression can be reduced no more.
defmodule Program do
def run(expression) do
case Node.reduce(expression) do
{:ok, reduced} ->
run(reduced)
:noop ->
expression
end
end
end
And if we try it out…
ast = Equals.new(Add.new(Number.new(1),
Number.new(2)),
Number.new(4))
result = Program.run(ast)
assert result == Boolean.new(false)
Voila! A new programming language is born. Not a very useful one, but it’ll grow more powerful as more types of node are added, each bringing new behaviour to the language.
I’ve been exploring these ideas over the last month and I’ve now got a tiny little language written in Elixir, including a parser that turns the source code into the AST introduced in this post. The source for this project can be found on GitHub here.
If you’re interested in learning more about language and computation I recommend reading Tom Stuart’s Understanding Computation.
Tara! :)