Compiler Internals (Old)#
Introduction#
This chapter is an overview of the libraries involved during
compilation, information was gathered while hacking on the compiler.
It focuses only on DFMC, the Dylan Flow Machine Compiler, located in
sources/dfmc
of the opendylan repository.
But first look how to get there: let’s consider the command-line compiler, invoked with
dylan-compiler -build hello-world
There is a fairly long call chain to perform the compilation, starting with command-line parsing in sources/environment/console/command-line.dylan, which looks like this:
do-execute-command(..., <build-project-command>)
(defined in sources/environment/commands/build.dylan)
build-project
(defined in sources/environment/dfmc/projects/projects.dylan)
compile-library
parse-and-compile
(defined in sources/project-manager/projects/compilation.dylan)
parse-project-sources
(defined in sources/dfmc/management/definitions-driver.dylan)
compile-project-definitions
(defined in sources/dfmc/browser-support/glue-routines.dylan)
compile-library-from-definitions
(defined in sources/dfmc/management/world.dylan called here
under the name ``dfmc-compile-library-from-definitions`` due
to a prefixed import.).
To explain these long call chains, we need some more understanding of the different libraries of the compiler:
environment is the public API
project-manager is a bunch of hacks for
finding the project source (via the registry)
calling the compiler and linker (to create a dll/so and executable) afterwards.
dfmc/browser-support and environment/dfmc are the glue from environment to DFMC.
The big picture is pretty simple: management drives the different libraries, some are the front-end (reader, macro-expander) translating into definitions; some intermediate language (conversion, optimization, typist) which work on the flow-graph; others are back-end, including linker. There is some support needed for the actual runtime, which is sketched in modeling (parts of which are put into the dylan runtime library), namespace (which handles namespaces and defines the dylan library and its modules).
First we need to introduce some terminology and recapitulate some conventions:
the unit of compilation is a single dylan library
the metadata of a library is stored in DOOD, the dylan object-oriented database
loose (development) vs tight (production): loose mode allows runtime updates of definitions, like adding generic function into a sealed domain, subclassing sealed classes - production mode has stricter checks
batch compilation: when invoked from command line, or building a complete library
interactive compilation: IDE feature to play around, adding a single definition to a library
DFMC is well structured, but sadly some libraries use each other, which they shouldn’t (typist, conversion, optimization).
In the remainder of this guide, we will focus on a simple example,
which prints Hello
x times:
define method hello-world (x :: <integer>) => ()
do(curry(format-out, "Hello %d\n"), range(to: x))
end
dfmc-management#
The library dfmc-management drives the compilation process, prints general information about what is happening at the moment (progress, warnings) and takes care of some global settings like opening and closing source records, etc.
The main external entry point is compile-library-from-definitions
in
world.dylan. This requires that the source has already been parsed
(really? but it calls compute-library-definitions itself!).
It then calls in sequence compute-library-definitions
,
ensure-library-compiled
, ensure-library-glue-linked
,
ensure-library-stripped
and ensure-database-saved
, apart from
console output (warnings, stats).
The very first method, compute-library-definitions
, calls
ensure-library-definitions-installed
, which calls
update-compilation-record-definitions
, which mainly calls
compute-source-record-top-level-forms
. This passes the compilation
record to read-top-level-fragment
to get a fragment and then calls
top-level-convert-forms
.
The reader library defines the read-top-level-fragment
, the
definitions library the top-level-convert-forms
. Thus, a fragment
is read and converted into definitions.
The method ensure-library-compiled
computes, finishes and checks
the data models. Afterwards the code models are generated (the control
flow graph), then type inference is done and the optimizer is run.
Finally the heaps are generated. These are methods defined in
compilation-driver.dylan
, calling out to the modeling, conversion,
flow-graph, typist and optimizer libraries.
Finally the glue is emitted (in back-end-driver.dylan
) and the
database is saved, which contains metadata of each library (like type
information, code models, etc.).
Some global warnings for libraries are defined and checked for in the management library.
The unused file interface.dylan
compares module and binding
definitions, in order to judge whether the public API of a
library/module has changed between two versions. Usage of this would
allow lazy recompilation: only recompile if the API has changed of a
linked library.
dfmc-reader#
This library reads the Dylan source, tokenizes it, annotates source
locations, and builds a parse tree. The parser is in
parser.dylgram
, which uses app/parser-compiler
to generate
infix-parser.dylan
. The API used is read-top-level-fragment
and re-read-fragments
.
Error handling is on the token level, thus a mismatched end
is
noticed. Other sorts of errors are invalid token, integer too large,
character too large, ratios not being supported, end of input (while
more tokens were required).
Every <fragment>
, the base class of the abstract syntax tree, has
a compilation-record and a source-position.
So, for the above hello-world method, read-top-level-fragment
returns the following parse tree:
<body-definition-fragment>:
fragment-macro: <simple-variable-name-fragment>
fragment-name: #"method-definer"
fragment-modifiers: #()
fragment-body-fragment:
<simple-variable-name-fragment>:
fragment-name: #"hello-world"
<parens-fragment>:
fragment-left-delimiter: <lparen-fragment>
fragment-nested-fragments:
<simple-variable-name-fragment>:
fragment-name: #"x"
<colon-colon-fragment>
<simple-variable-name-fragment>:
fragment-name: #"<integer>"
fragment-right-delimiter: <rparen-fragment>
<simple-variable-name-fragment>:
fragment-name: #"do"
<parens-fragment>:
fragment-left-delimiter: <lparen-fragment>
fragment-nested-fragments:
<simple-variable-name-fragment>:
fragment-name: #"curry"
<parens-fragment>:
fragment-left-delimiter: <lparen-fragment>
fragment-nested-fragments:
<simple-variable-name-fragment>:
fragment-name: #"format-out"
<comma-fragment>
<string-fragment>:
fragment-value: "Hello %d\n"
fragment-right-delimiter: <rparen-fragment>
<comma-fragment>
<simple-variable-name-fragment>:
fragment-name: #"range"
<parens-fragment>:
fragment-left-delimiter: <lparen-fragment>
fragment-nested-fragments:
<keyword-syntax-symbol-fragment>:
fragment-value: #"to"
<simple-variable-name-fragment>:
fragment-name: #"x"
fragment-right-delimiter: <rparen-fragment>
fragment-right-delimiter: <rparen-fragment>
<semicolon-fragment>
NB: the type hierarchy for <body-definition-fragment> is: <definition-fragment>, <macro-call-fragment>, <compound-fragment>, <fragment>, <object>
dfmc-definitions#
Once the abstract syntax tree is generated (by the reader), it’s time
to convert this into definitions, which are the names in dylan. There
are several top-level definitions in dylan, namely: binding, class,
constant, (copy-down), domain, function, generic, macro, method,
module, namespace (library) and variable. Every definition has its
own class, inheriting from <top-level-form>
(defined in
common/top-level-forms.dylan). A top level form at least contains
information about its compilation record, source location, parent
form, sequence number and dependencies and referenced variables.
Additional information available are adjectives, the word defined, its
library, original library, top level methods.
As a side note, dependency tracking is also defined in
common/top-level-forms.dylan
.
The main entry point for the definition library is
top-level-convert(parent, fragment)
, defined in
top-level-convert.dylan
.
The building of definition objects relies heavily on the macro-expander, especially on procedural macros described in D-Expressions: Lisp Power, Dylan Style. Open Dylan extends the definitions with compiler, optimizer, primitive and shared-symbols, mainly used internally in the compiler.
Looking into define-method.dylan
, we can see a class
<method-definition>
. This is built by the parser, more
specifically there is a define &definition method-definer
, which
has two rules to match fragments, whereas the second rule is the error
case. The first matches any define method
syntax and calls
do-define-method
with the arguments. The method
do-define-method
defers the work to helper methods
parse-method-adjectives
and parse-method-signature
, and
instantiates a <method-definition>
object.
For our hello-world example, do-define-method
creates a single object:
<method-definition>
private-form-body: <body-fragment>
fragment-constituents: <prefix-call-fragment>
fragment-arguments:
<prefix-call-fragment>
fragment-arguments:
<simple-variable-name-fragment>
fragment-name: #"format-out"
<string-fragment>
fragment-value: "Hello %d\n"
fragment-function: <simple-variable-name-fragment>
fragment-name: #"curry"
<prefix-call-fragment>
fragment-arguments:
<keyword-syntax-symbol-fragment>
fragment-value: #"to"
<simple-variable-name-fragment>
fragment-name: #"x"
fragment-function: <simple-variable-name-fragment>
fragment-name: #"range"
fragment-function: <simple-variable-name-fragment>
fragment-name: #"do"
private-form-signature: <method-requires-signature-spec>
private-spec-argument-next-variable-specs: <next-variable-spec>
private-spec-variable-name: <simple-variable-name-fragment>
fragment-name: #"next-method"
private-spec-argument-required-variable-specs: <typed-required-variable-spec>
private-spec-type-expression: <simple-variable-name-fragment>
fragment-name: #"<integer>"
private-spec-variable-name: <simple-variable-name-fragment>
fragment-name: #"x"
private-form-signature-and-body-fragment: <sequence-fragment>
<parens-fragment>, <simple-variable-name-fragment>, <parens-fragment>, <semicolon-fragment>
private-form-variable-name-or-names: <simple-variable-name-fragment>
fragment-name: #"hello-world"
It is noteworthy that still no intra-library information is present, this is top-level Dylan code without any context. All macros are expanded.
Excursion into run-time and compile-time#
Some objects are defined in the compiler, but are injected into the Dylan world. How does this happen?
In the Dylan library you see // BOOTED:
comments here and
there. The source location of well-known basic types and functions is
dylan:dylan-user:boot-dylan-definitions
.
There is no definition of this specific method.
The method boot-definitions-form?
(in dfmc-definitions) checks
exactly for this name. The method top-level-convert-forms
behaves
differently if boot-definitions-form?
returns true, namely it
calls booted-source-sequence
(in boot-definitions.dylan) which
grabs the boot-record and returns it sorted as a vector.
But what is a boot-record after all? Well, its definition is all in boot-definitions.dylan, with the explanation “records the set of things that must be inserted into a Dylan world at the very start. Some things are core definitions, such as converters and macros, and these are booted at the definition level. The rest are expressed as source to be fed to the compiler.”
The constant *boot-record*
is filled by do-define-core-*. These
are called by dfmc-modeling. Namely, primitives (which names and
signatures are installed), macros, modules, libraries, classes.
Be aware that the actual implementation of the primitives is in the
runtime (either sources/lib/run-time/run-time.c
or the
runtime-generator generates a runtime.o containing those definitions),
but some crucial bits, like the adjectives (side-effect-free
,
dynamic-extent
, stateless
and opposited
) are in
dfmc-modeling and are used in the optimization!
The core classes are emitted from modeling with actual constructors. Be aware that the runtime layout is also recorded in run-time.h.
The dylan library and module definitions are in modeling/namespaces.dylan.
A noteworthy comment is that a compiler (comp-0, generation 0) loads the Dylan library (dylan-0), which contains the definitions (defs-0). When compiling itself (comp-1), first a fresh Dylan library (dylan-1) is built, which contains still the old booted definitions (defs-0). It emits new definitions (defs-1) and a new boot-record when dumping dfmc-definitions. Now the next generation compiler (comp-1) will use these new definitions in the next Dylan (dylan-2) library. Beware of dragons.
dfmc-macro-expander#
The deep magic happens here.
dfmc-convert#
Converts definition objects to model objects. In order to fulfill
this task, it looks up bindings to objects from other
libraries. Also converts the bodies of definitions to a flow
graph. Does some initial evaluation, for example limited(<vector>,
of: <string>)
gets converted to a <&limited-vector-type>
instance. Thus, it contains a poor-mans eval.
Also, creates init-expressions, which may be needed for the runtime. Since everything can be dynamic, each top-level form may need initializing. This happens when the library is loaded.
Also sets up a lexical environment for the definitions, and checks bindings.
Here, type variables are now recorded into the lexical environment, the type variables are passed around while the signature is checked.
After Dylan code is converted, it is in a representation which can
be passed to a back-end to generate code. Modeling objects have
corresponding compile and run time objects, and are prefixed with
an ampersand, e.g., <&object>
.
dfmc-modeling#
Contains modeling of runtime and compile time objects. Since some
calls are attempted at compile time rather than at runtime,
it provides these compile time methods with a mechanism to override
the runtime methods (define &override-function
). An example for
this is ^instance?
, compile time methods are prefixed with a ^
,
while compile and runtime class definitions are prefixed with &
,
like define &class <type>
.
Also, DOOD (a persistent object store) models and proxies for compile time definitions are available in this library, in order to load definitions of dependent libraries.
dfmc-flow-graph#
The flow graph consists of instances of the <computation>
class,
like <if>
, <loop-call>
, <assignment>
, <merge>
. The flow
graph is in a (pseudo) static single assignment (SSA) form. Every time any
algorithm alters the flow graph, it disconnects the deprecated
computation and inserts new computations. New temporaries are
introduced if a binding is assigned to a new value. Subclasses of
<computation>
model control flow, <temporary>
(as well as
<referenced-object>
) model data flow.
Computations are a doubly-linked list, with special cases for merge
nodes, loops, if, bind-exit and unwind-protect. Every computation
may have a computation-type field, which is bound to a
<type-variable>
. It also may have a temporary slot, which is its
return value. Several cases, single and multiple return values, are
supported. The temporary has a link to its generator, a list of
users and a reference to its value.
Additional (data flow) information is kept in special slots, test
in <if>
, arguments of a <call>
, etc. These are all
<referenced-object>
, or more specially <value-reference>
,
<object-reference>
, etc. <object-reference>
contains a binding
to its actual value.
<temporary>
and <environment>
classes are defined in this
library.
join-2x1
etc. are the operations on the flow graph.
dfmc-typist#
This library contains runtime type algebra as well as a type inference algorithm.
Main entry point is type-estimate
, which calls
type-estimate-in-cache
. Each library contains a type-cache, mapping
from method definitions, etc. to type-variables.
Type variables contain an actual type estimate as well as justifications (supporters and supportees), used for propagation of types.
converts types to <type-estimate>
objects
type-estimate-function-from-signature
calls type-estimate-body
if available (instead of using types of the signature), call chain is
type-estimate-call-from-site
-> type-estimate-call-stupidly-from-fn
-> function-valtype
contains hard-coded hacks for make
, element
, element-setter
(in type-estimate-call-from-site
)
typist/typist-inference.dylan:poor-mans-check-type-intersection if #f (the temp), optimizer has determined that type check is superfluous
dfmc/typist-protocol.dylan:151 - does not look sane!
define function type-estimate=?(te1 :: <type-estimate>, te2 :: <type-estimate>)
=> (e? :: <boolean>, known? :: <boolean>)
// Dylan Torah, p. 48: te1 = te2 iff te1 <= te2 & te2 <= te1
let (sub?-1, known?-1) = type-estimate-subtype?(te1, te2);
let (sub?-2, known?-2) = type-estimate-subtype?(te1, te2);
dfmc-optimization#
This library contains several optimizations: dead code removal, constant folding, common subexpression elimination, inlining, dispatch upgrading and tail call analysis.
Main entry point from management is really-run-compilation-passes
.
This loops over all lambdas in the given code fragment, converts
assigned variables to a <cell>
representation, renames temporaries
in conditionals, then runs the “optimizer”. This builds an
optimization queue, initially containing all computations. It calls
do-optimize on each element of the optimization-queue, as long as
it returns #f
. (The protocol is that if an optimization was successful,
it returns #t
, if it was not successful, #f
). For different types
of computations different optimizations are run. Default
optimizations are deletion of useless computations and constant
folding. <bind>
is skipped, for <function-call>
additionally
upgrade (analyzes the call, tries to get rid of gf dispatch) and
inlining is done. <primitive-call>
are optimized by analyze-calls
.
constant folds (constant-folding.dylan):
// The following is because we seem to have a bogus class hierarchy
// here 8(
// We mustn't propagate a constraint type above its station, since
// the constraint is typically local (true within a particular
// branch, say).
& ~instance?(c, <constrain-type>)
optimization/dispatch.dylan: gf dispatch optimization
optimization/assignment: here happens the “occurrence typing”
(type inference for instance?)…
<constrain-type>
is only for the instance? and conditionals hack