Data Model
PLAN Data model
PLAN’s data model is specified abstractly as:
Each value is a pin x:<i>, a law x:{n a b}, an app x:(f g), or a nat x:@.
These four constructs are what give PLAN its name, and they abstractly represent the following data structure:
Every PLAN expression is a heap that consists of a merkle-DAG of subheaps. Every subheap is a tree structure containing absolutely pure n-ary functions, closures, natural numbers, and pointers to other subheaps. We represent our heap as a merkle-DAG because its hash-based nature gives us pointers that remain stable between VM runs, allowing us to support huge heaps by transparently paging subheaps in and out from disk. This allows users to e.g. keep their whole media library “in memory”.
Diagram forthcoming
There’s a lot to unpack in that description, so let’s go through the four constructs one by one.
Pins
Subheaps: content addressed DAG nodes and contiguous memory regions of normalized values
Pins are heaps, and nodes in the merkle-DAG of heaps. Formally, a pin <i>
is just a magic box that contains another PLAN value i
that is guaranteed to have been normalized. In practice, a pin is a hint to the virtual machine that it should hash and deduplicate i, and store it in a contiguous region in memory and on disk, similar to GHC’s Compact regions. This gives us several nice properties:
- Comparing two pins for equality is a constant-time operation, similar to when e.g. a string has been interned.
- Data access inside a pin is fast, since contiguous memory regions are cache-friendly.
- On disk and over the network, pins can be addressed by their hashes, which allows them to be loaded on demand, making the heap stable between runs. In this way they are similar to memory pages that might reference each other, but since pins are always normalized, these references are acyclic.
- PLAN values are immutable, so we need to reclaim memory somehow. Garbage collection requires a full heap-traversal, which is a problem with the huge heaps that we want to support. Pins mitigate this problem because each one is stored in a contiguous region on disk, which means that pins on disk only ever reference other pins, not values inside other pins. Because of this, a full garbage collection pass only needs to traverse the pin DAG, without looking at the data inside the pins. PLAN evaluation is lazy, but each state-machine transition does a full normalization, so long-lived space leaks are impossible, and the on-disk representation will never contain unevaluated thunks.
Laws
Supercombinators: pure n-ary functions
The law {n a b}
is a supercombinator of arbitrary but fixed arity a, with the body b and the name n. “Supercombinator” really just means “an actually pure function (or a constant)”. Even in supposedly pure programming languages, functions can typically call other functions that are “in scope”, which means that they can access implicit state – a very reasonable UX affordance. But PLAN is not a UI, it is a specification, so we can’t tolerate such ambiguity. To make a law accessible to another law, it must either be inlined or passed as an argument. This is what we mean when we say that laws are actually pure functions. The only environment a law has access to is itself and its arguments.
Apps
Applications: closures or thunks
Apps are applications of functions to arguments. They are binary trees or cons cells that are often left-heavy or left-skewed, such as (((f a) b) c). Syntactically they associate to the left, so we can also write (f a b c). As the variable names suggest, the leftmost node in an app is typically a function, while the rest are arguments to that function, some of which might be other apps.
A partially applied function, such as ({"fun" 2 body} arg1)
is a closure (notice the arity is 2), while a fully applied function such as ({"fun" 2 body} arg1 arg2)
is a thunk. Apps that are closures can be deconstructed and inspected, but once an app has graduated to a thunk, it will be reduced before it can be further manipulated.
Since all functions are actually pure, a thunk-app contains the entire environment needed to execute a function. Because apps are used as environments in this way, the interpreter typically improves memory locality by recognizing their left-associativity and optimizing them to arrays, instead of storing the formal tree structure.
Nats
Natural numbers: opaque data or opcodes
Nats are natural numbers. Most of the time this is just data. We encode integers, strings and MP3 files using natural numbers. Nats that are smaller than the machine’s word size get stored directly. Bigger nats are stored as pointers and benefit from structural sharing. The nats 0-4 can also code for opcodes: instructions to the virtual machine. Like laws, these operations are pure functions, but they are built in.