1. Introduction

OrangeC is an optimizing compiler and tool chain.  It was written partly with the intention of being a 
platform  for cross-compiling C language code to other architectures and operating systems.   Some 
of the features that make this possible are:

	-	Modular compiler which defers code generation until the last minute.
	-	A generic compiler backend interface allows customization of the code generation as well as various
		other architecture-specific compiler parameters
	-	An ADL (architecture description language) was used to specify assembler code generation, and it 
		is planned to expand it to allow specifying the compiler code generation as well
	-	The tools are mostly generic in nature, and share a common set of modules for dealing with
		object files
	-	A linker specification file tells the linker how to interpret command line switches to target
		a specific operating system file format or ROM image format
	-	Other linker specification files are used to specify memory layouts, in a way that is generic
		enough to be used to create either the layout to be used for a specific operating system or
		a ROM image file for embedded systems.
	-	the linker is generic, a post link program (downloader) is chosen to perform the final 
		translation to the target file format

That said, the current releases of OrangeC target the x86 and DOS/WIN32.  It is also possible to generate
ROM images with this system, if one is careful about how the RTL is used.

2. Overview

2.1 Distributed Programs

With the exception of the database package (SQLITE3) and parts of the run-time library associated
with MSDOS and the WIN32 headers, and lately the C++ header files, the programs in this package 
form new work.  (although most of the run time library is inherited from a previous project, CC386).

The programs in this package are as follows:

	-	OCC - C compiler
	-	OASM - assembler
	-	OBRC - browse compiler (used by IDE, enabled by compiler output files)
	-	OCL	- Shell program for running compiler, assembler, linker, etc...  the compiler also has much
		of this functionality built in
	-	OCPP - standalone preprocessor (there are two versions of this, one using the same code as in
		the compiler, and the other written in C++)
	-	OCCPR - code completion compiler (used by IDE, generated off main compiler sources)
	-	OGREP - generic grep utility
	-	OIMPLIB - WIN32 import librarian
	- 	OLIB - generic librarian
	-	OLINK - generic linker
	-	OMAKE - gnu style make utility
	-	ORC - win32 resource compiler
	- 	DLHEX - platform dependent image creator - hex files
	-	DLLE - platform dependent image creator - 32 bit DOS binaries (LE FORMAT)
	-	DLMZ - platform dependent image creator - 16 bit DOS binaries (somewhat limited in scope)
	-	DLPE - platform dependent image creator - WIN32 binaries
	-   DLPM - platform dependent image creator - 32 bit DOS binaries (home grown version)

Additional directories include:

	- 	ADL - architecture description language compiler
	-	CLIBS - the C runtime library source code and standard headers
	-	EXAMPLES - simple example code
	-	EXEFMT - an include directory for format specific header files
	-	HELP - the source and compiled code for the IDE help files
	-	OBJ - the source code for the object file format handlers (used by everything except C compiler)

2.2  Object file format

	The object file format is a touched-up form of the ASCII version of IEEE-695.   
This is the preferred format because it is human readable without a dump program.  

	Some tests were performed  with a binary version of the format but it didn't really enhance the 
performance.   

	The OBJ directory has generic routines to deal with this file format that are used by everything 
except the compiler. The compiler currently has hand-rolled code to deal with this file format.

2.3  C Compiler

	The C compiler is written in four parts:  the preprocessor (occ\preproc), the front end(occ\parser), 
the intermediate language and optimizer (occ\middle) and the code generator (occ\x86).   Additional
compiles of code based on the compiler source code exist in the occ\preproc directory (for the shipping 
version of the standalone preprocessor) and the occ\parse86 directory (for the code completion compiler).

There is a set of structures that define the code generator in unique increments, as well as define
various architecture specific constants such as integer sizes.   This will eventually be targeted by
the architecture description language.
 
Also, there is a somewhat sophisticated peephole optimizer implemented to clean up various artifacts.

2.4 Preprocessor

	there are two preprocessors, one in the ocpp directory and one in the occ\preproc directory.  The
one in the occ\preproc directory is essentially the same preprocessor as in CC386, and is the one favored
by the distribution.   This preprocessor was heavily vetted by a contributor.

2.5 Assembler

	The assembler (oasm) is similar in style to NASM, although no attempt has been made to make 
it 100% compatible with NASM.   The unique featue of oasm is that the code generation code was 
generated by the ADL compiler, with an architectural description file (oasm\x86.adl) as input.

2.6 Linker

	The linker (olink) takes as input various object files, as well as accepting platform specific files
that can be used in an arbitrary way to create the output.  The primary output of the linker is another
object file, which can be in either relative format (useful for further linking) or in absolute format
(useful for final image generation).  The linker can also generate an ODX file for use in symbolic 
debugging.

	It is possible to define the values of externals on the linker command line the same way that
symbols are defined on the command line for the compiler or assembler; for example if an image
may be built for different addresses this could possibly be controlled entirely with command line switches.

	The linker can take a linker specification (.spc) file to tell the linker exactly how to organize
the input object files to create the output image file.

	The file olink\app is an additional linker configuration file which associates the /T linker 
switch with creation of a target image file.   It associates the switch parameter with various 
externals,  a specification file, and a 'downloader' which is responsible for creating the final 
target image.   

	Each downloader has an associated specification file that tells how to lay out the code for use with
the downloader.   The linker typically calls the downloader as a post-link step when the /T switch
is used.


2.7 Librarians

	olib and oimplib are very similar, they both can create libraries.   oimplib is specifically used for
win32 development, related to dll import libraries.   There is nothing remarkable about these programs
other than that some of the code is shared with the linker (since the linker reads libraries).

2.8  Make utilities

	omake is a make utility inspired by GNU MAKE.   It is very similar to GNU MAKE in spirit and 
functionality, so much so that the gnu documentation could be used to learn it.  However, no 
attempt has been made to make it 100% compatible with GNU MAKE.

2.9	File search

	ogrep is a generic grep utility.  There is nothing really remarkable about it.

2.10	Resource compiler

	orc is the win32 resource compiler.  It is reasonably compatible with the microsoft win32 resource
compiler.

2.11	Downloaders

	Downloaders came about following a comment in the IEEE695 specification which talked about the
possibility of a post link tool that would take generic linker output and transform it into an
operating-system specific image.

	A downloader generally has an associated specification file, which is used to guide the linker in
terms of how to create a generic output file that has suitable augmentation that it may be used to 
generate an operating system-specific file.  Once the linker generates such a file, the downloader 
then transforms it into the target file format.

	Downloaders can take arbitrary command line switches that are configured through the linker
configuration file, and the linker can optionally pass files specified on its own command line to the
downloader (for example .RES files are specified on the linker command line for WIN32 images.
But they aren't processed by the linker, they are processed by DLPE).

the built in downloaders are as follows:

	- 	DLHEX - platform dependent image creator - hex files (ROM IMAGES)
	-	DLLE - platform dependent image creator - 32 bit DOS binaries (LE FORMAT)
	-	DLMZ - platform dependent image creator - 16 bit DOS binaries (somewhat limited in scope)
	-	DLPE - platform dependent image creator - WIN32 binaries
	-   DLPM - platform dependent image creator - 32 bit DOS binaries (home grown version)

	The default specification file for DLHEX is very simple, in many cases a more complex one 
may need to be  created as a project-specific file but this can be overridden on the linker command 
line.
	
2.12 Browse compiler
	
	obrc is the browse compiler used by the IDE.   It takes as input the .cbr files generated by the
the compiler, and generates the .OBR file used by the IDE.


2.13	Code Completion compiler

	occpr (occ\parse86) is the code completion compiler used by the IDE.   It takes as input one or more
C language source files, and maintains the project .ODS file.

2.14	Compiler shell

	ocl is a compiler shell.   It generally sequences calling the compiler, assembler, linker, etc...
A contributor originally designed it as an adjunct to CC386 because at the time CC386 didn't invoke the
assembler, linker, etc...  its other strong point is that it allows fine-tuned control over which MSDOS
extender is used in the image file.   It has been kept up to date and became a part of the Orange C
compiler package.

2.15 Run Time Library

	clibs, the runtime library, is mostly work that was generated during the development of CC386.  At
some point, the same contributor that vetted the preprocessor went to a lot of effort to vet the 
runtime library and it should be fairly close to the standard at this point.   Parts of the 
runtime library related to MSDOS(notably the dos extenders) are work that was performed by 
other authors.

3.  Detailed view of C compiler

3.1 Compiler overview

·         The Preprocessor (OCC\preproc)
·         The Front End (OCC\Parser)
·         The Intermediate Language and Optimizer (OCC\middle)
·         The Code Generator (OCC\x86)

The preprocessor and parser components work together to handle scanning the source files.  The parser
handles the C and C++ language constructs; the preprocessor handles preprocessor statements such as 
#define, #if, #ifdef, #pragma, etc...   Basically, the preprocessor is used as a first level pass at the
code base, which transforms the code by eliminating unused if or else blocks, replacing the macros defined
by the #defines with their equivalent values, and so forth.   Once that is done the parser works on the
resulting transformed program to translate it into data structures.

Note that while the preprocessor could be implemented as an independent pass over the source code that 
results in preprocessed source code, in this compiler the lexer calls the preprocessor on pretty much 
a line by line basis to perform the transformations piecemeal.   In this way we can avoid having a 
preprocessed version of the entire source code in memory.

But it is possible to use the standalone preprocessor to transform the source file into a preprocessed
intermediate file.   This intermediate file could then be used as if it were the original source code, e.g.
you can directly compile it as if it were the original source.   In practice this is only useful if you 
are designing a program with a very complex set of macros, and want to debug how the macros are used.
There is a known caveat where standalone preprocessors cannot 100% match the requirements the builtin
preprocessor has in terms of macro replacement.

After the preprocessor and parser have been used to scan the source files, a high-level representation will 
be in memory, forming a set of trees that completely describe the program internally.   

The preprocessor and parser can make the following files:  a '.i' file holding the source code after 
it has been preprocessed, a '.err' file holding the list of errors, and a '.lst' file holding a 
listing of the source code and a representation of variables and their types.  The '.i' file is the 
same as what the standalone preprocessor would output.

This high-level representation of the program is then converted to an intermediate language, which is 
similar to a generic assembly language.  Each line in the intermediate code has operands, possibly a 
destination, and possibly a branch address for branches.   Once the intermediate code is created, 
a set of optimizations is applied.   After the optimizations register allocation is performed.   For 
the optimizations and register allocation it is necessary to have just a little guidance about 
the target architecture...  things like the fact that on the x86 the ECX is the only thing that is 
used  as a variable for shift counting, EDX:EAX is used as a dividend and the divisor can't be 
in EDX, and so forth.  All this guidance comes from the backend, which has an architecture-specific
configuration file detailing the architecture.

After all that a step through is done for the intermediate code.   At this point it is possible to get a
file dump of the intermediate code (.icd file) (by using the /Y compiler switch).  But sometimes it is 
more interesting to get a dump of the intermediate code before optimization or somewhere during the 
optimization process to try to isolate where a problem lies.   At this point there is no facility for 
doing this without modifying compiler source code.

As the final step through of intermediate code is being done, code generation routines are also called and
the binary values for code are put in an internal database.   Also, code is output into an ASM file if 
the /S switch was specified.  After all the code for all the routines is generated, code is output into 
the OBJECT(.o) file if the /S switch wasn't specified.  The .o file will also have debug information 
if the /v switch was specified.  The compiler also makes a browse info file (.cbr) if the IDE requests it.

After all that is done, the compiler goes on to invoke any other programs that are required by the command
line switches, such as a linker or resource file compiler.

3.2 Source code layout

3.2.1 Preprocessor source code layout

The preprocessor source code is in \src\occ\preproc.  Some of these files are shared with the C compiler
to avoid code duplication.

The main header file for the preprocessor is 'pp.h'.  It has the structures and constants needed by the
preprocessor.

'ppmain.c' is the preprocessor main line; it is responsible for parsing the command line, opening input
and output files, etc...  (this file is not used by the compiler)

'osutil.c' is os-related stuff mostly used by ppmain.c (the compiler also has its own version of osutil.c)

'memory.c' is a blocked memory allocator. (the compiler also has its own version of memory.c)

'symtab.c' is used to handle hash tables for the list of preprocessor macros.  (the compiler also has its own
version of symtab.c)

'ppexpr.c' is a recursive descent expression parser which handles constant expressions for #if and et.al.

'preproc.c' is the main preprocessor, it handles all the preprocessor directives.   Probably the most complex
thing handled by this file is the macro substitution algorithms.  Nesting conditional directives also
gets a little tricky.  This file is also used by the compiler.

'ccerr.c' is an abbreviated version of the compiler error file that holds routines for displaying errors.

The main structure to be aware of in the preprocessor is the INCLUDES typedef.  It holds the current state
of the preprocessor, as well as the progress through each file.  When the preprocessor or compiler starts,
an instance of this structure is made to hold the main source file.   As each #include directive is
encountered, another instance of this structure is created and pushed on the front of a linked list of
these structures.  When the compiler is done parsing the file, it is closed and its INCLUDES structure
is popped off the front of the linked list.

3.2.2 Parser source code layout

the parser files are found in src\occ\parse.

the main compiler header is 'compiler.h', but most of the parser specific structures and definitions are
in the file 'c.h'.   Other header files of note are 'compiler.p' which holds function prototypes, and 
'ccerr.h' which holds the #defines for the over 400 errors the compiler may generate.

The compiler main line is in 'ccmain.c'.  As in the preprocessor, this uses a version of 'osutil.c' to
parse the command line and set up lists of input files.  

A local version of 'memory.c' is used for memory allocations.  A local version of ccerr.c holds the
text for the errors the compiler generates, as well as various routines for scanning for errors and 
formatting them.

'config.c' is the compiler configuration, which in this directory is just defaulted to some generic
values for testing.   The backend will have the actual version of 'config.c' compiled into the product
that describes the target processor.

'list.c' is used to generate a listing file which outlines how variables are used.

The main line for the actual compile process is a routine located in ccmain.c called 'compile'.  It does
some setup, then continually calls a routine called 'declare' in 'declare.c' to parse the source files.
Finally it does various things like cleaning up, generating output, dumping in-memory initializers, 
and so forth.

Everything in C starts from a declaration which is why compile() is implemented that way.

declare() is thus the root of a recursive descent parser...   I prefer that although I realize that many
others would rather not do it that way.

There are new kinds of declarations enabled by C++, for example base class handling, virtual functions,
and so forth.   The file 'declcpp.c' supports a lot of the C++ specific extensions.   There is also another
declaration file 'declcons.c' which is used to support instantiation of constructors and calls to
constructors, creation of default constructors, and so forth.  Declarations are usually associated with 
a symbol value and can be accessed by knowing the name of the symbol.  Other than the structure for the
symbol itself which holds attributes and data for the symbol, the main structure associated 
with declarations is a type list.  The type list can be a simple type such as 'int', a const or 
volatile modified type,  a simple type that has been turned into a possibly const-modified pointer 
or array, and so forth.   

A symbol is usually the result of a declaration (although some symbols are generated internally for
temporaries and so forth).  All symbols have an associated type list that indicates the symbol's type, 
that is stored in the database for the symbol.  Symbols themselves are stored in a hash table.
  
There are two main hash tables, the 'global' hash table which handles module-level symbols, 
and the 'local' hash table which stores lists of all the open blocks in the function currently being 
processed.  In the local hash table each block also has a related set of symbols.

Hash table conflicts are resolved in this system by having each bucket hold a linked list, and searching
the list after finding the bucket.

Some types of symbols also include a hash table.  For example function symbols hold a hash table with 
all the function arguments, structure name symbols include a hash table with the structure members, 
and namespace symbols include a hash table of all the symbols that are part of the namespace.   The
hashtables for symbols are stored as part of the type list.

Many smaller hash tables are actually implemented as one bucket hash tables, which essentially devolves 
them into linked lists.   The same interface is used whether the hash table is one bucket or
multi buckets, however, the code spends a lot of time iterating through hash tables (for example
function arguments or structure elements) and sometimes presumes the number of buckets to be one.   
At this time the only hash tables that have multiple buckets are the global hash table and the 
namespace hash tables.

The hashtable and symbol handling routines are located in 'symtab.c' - there is also some related
stuff in 'cpplookup.c' which is related to handling namespaces within the context of these hash tables.
Note that a namespace also has an associated symbol that is itself stored somewhere in a hash table for
another namespace.

Starting with declare() but really all the way through the parser comes the routines in 'lexer.c', which
convert the incoming source code into tokens, such as keywords, variable names, integer constants, and
so forth.   The keyword list starts out as a tree that can be searched with a binary search, but 
as it turns out it is faster to use a hash table because of the sheer number of keywords so early 
on the compiler goes through the keyword table and puts them in a hash table for future lookup.  

A routine getsym() is called to get the next token from the parser; it might call the preprocessor to 
get the next line of code and perform macro substitions on it.   Note that with the advent of C++ it became 
necessary to sometimes create a list of tokens and append them to a symbol for processing at some 
point in the future... this is supported by the getsym() implementation.   This happens for example
for function bodies declared within the context of a structure declaration, and for templates.
This implementation also allows backtracking as necessary...

tokens are stored in a structure called 'LEXEME'.

Another file used for declarations is 'template.c' which is the support for C++ templates.

From declarations things branch out into variable initializers, and function bodies.  The initializers are
handled in 'init.c', which parses the initializers, and also maintains a database of values that need
initialization.   In C most of the initialization is static and can be directly embedded into the data 
region of the program, but in C++ a concept came about of being able to call constructors and other functions
as part of the initialization process.  So 'init.c' also has the ability to cache and generate module-specific
startup routines that will call all this code as the executable is being started (prior to main())  
'init.c' might also be called for default constructor arguments, automatic variable initializations, 
or automatic structure initializations.  (automatic refers to local variables here).

The function bodies are handed off to 'stmt.c', which parses statements such as for, while, if, etc...  
One particular kind of statement is a 'block' statement, which itself holds the linked list of 
statements which consist of the contents of an enclosed block.   So the statement tree is a kind of 
unstructured tree which may have elements at the same level (statements in the same block) or 
elements at a lower level (an enclosed block).

Functions eventually boil down to parsing some type of expression (which can be a combination of 
arithmetic, function calls, structure assignments, and so forth).   Expressions can either be directly coded,
or they can be a selection statement such as an if or while clause.   Initializers also involve expressions;
in C these expressions are converted to a constant value after the fact but in C++ these expressions may
be coded into a module specific startup routine as noted above.

Expressions are handled by 'expr.c' for most C language stuff, and exprcpp.c for the C++ extensions.  
An expression tree holds each expression.   Expression trees are binary trees.  They may be referenced 
for example from an initializer symbol, or from some statement in the expression tree that requires 
an expression.

Most of the time expressions are run through the file 'constopt.c' which attempts to simplify the any constant
sub-expressions within the expression.   In some cases if the expression does not turn into a constant after
constopt.c is applied, it is a compile-time error.   For example, initializers must consist of constant
expressions in the C language.  constexpr expressions must also be constant in c++.

This compiler handles inline functions both implicitly and explicitly.   An example of an implicit use
is when a C++ function body appears inside a structure declaration.  The file 'inline.c' manages converting
stmt/expression trees into an inlined form, and does various housework involved in getting an explicit 
version of the inline function generated as necessary.

'types.c' is used for basic type matching, it can compare one type with another and also has a routine
for turning a type list into text for use in an error message.

Other core compiler files include cpplookup.c, lambda.c, rtti.c.   These are used in handling various
C++ constructs..  for example cpplookup.c handles the function matching routines for matching arguments to
an overloaded function instance.  'lambda.c' handles C++11 lambda expressions.  'rtti.c' handles database
generation for RTTI, which is used for the C++ typeid keyword, and for C++ exception handling.

'beinterf.c' holds various routines for getting things like variable sizes from the backend database.

'browse.c' holds the routines for creating the browse database outputs.  Unlike most other output files,
this has nothing to do with assembly language and is not passed to the intermediate code or backend.

'float.c' holds floating point emulation - this is so that the compiler can be run on various platforms
without getting involved in floating point vagaries for the platform.  (incidentally one of the contributors
stuck this code into the run-time library).

The remaining files in the parser include cppbltin.c, help.c mangle.c unmangle.c and output.c

'cppbltin.c' creates various prototypes for builtin c++ functions, notably the ones for new and delete 
but also a bunch of C++ helper functions called by the compiler for example for try/catch blocks.

'help.c' is just generic helper routines such as examiners for type list, type list classification, and
various other things.

'output.c' is helper routines for handling output to files.

'mangle.c' and 'unmangle.c' are the routines that deal with C++ name mangling.   'unmangle.c' is primarily
used in generating function names during error generation.  It is also copied into the linker directory, 
so that problems related to C++ function names can be stated in source code format.

The main structures to be aware of in the parser are the LEXEME structure, the HASHTABLE and HASHREC
structures, the SYMBOL and TYPE structures, the STATEMENT and EXPRESSION structures, the BLOCKDATA
structure, the INITIALIZER structure, the STRUCTSYM structure, and the FUNCTIONCALL and INITLIST 
structures.  There are a lot of other structures for more specialized operations, but these are the
backbones of the parser and are used all the time throughout.

As a brief summary, types are created in the TYPE structure as part of the declaration process.  Almost
immediately, a SYMBOL structure is created for the declaration and the TYPE is stored in it.  The symbol
is then stored in a HASHTABLE; HASHREC is basically the contents of each bucket of the hashtable.  each
HASHREC holds a SYMBOL and the pointer to the next HASHREC.  The HASHREC nodes are arranged in a linked 
list which represents the conflicts for that bucket.   For things that only have one bucket, the HASHREC 
nodes are usually ordered in the order they were created, and can be traversed as a linked list by 
getting the first element of the single bucket.

During the declaration of the symbol, it may turn out the declaration is for a function or structure.  Then,
each parameter or structure member is parsed as a nested declaration (through recursion) and the results 
end up in the HASHTABLE associated with the TYPE that is associated with the higher-level declaration's SYMBOL.

If the declaration turns out to be for a function with a function body, a BLOCKDATA structure is created.
BLOCKDATA structures aren't permanent, they are simply used to track the progress of compiling the current
(source level) block of code.   A BLOCKDATA structure will eventually get a STATEMENT tree through the
parsing process... and nodes in the STATEMENT tree will sometimes get EXPRESSION TREES.  Sometimes,
EXPRESSION trees will have a function call nested in them, then the expression node associated with the 
function call will get a FUNCTIONCALL structure.   The FUNCTIONCALL structure holds some TYPE and
SYMBOL structures that may be used in processing the function call.  It will also hold a linked list of 
INITLIST structures, each of which has a TYPE and EXPRESSION structure that describes the argument 
that needs to be pushed.

The top level STATEMENT structure for the function will be placed in the SYMBOL structure for the declaration
of the function where it can be retrieved later.

If the symbol being declared isn't a function, it may have an initializer.   The initializer will be parsed
and placed in an INITIALIZER structure and will then be associated with the symbol.   It might later be 
processed during creation of initialized data sections or dynamic startup routines in the case of C++.   
Or it might be utilized more directly when generating a function body.

3.2.3 Intermediate code generation and optimization

intermediate code generation and optimization may be found in the src\occ\middle directory

The main headers of the intermediate code generation are 'iexpr.h', and 'iopt.h'.  'iexpr.h' handles
most of the structures and definitions used by the code generation, and 'iopt.h' enhances that with
declarations specific to the optimizations.   But there may be fields in iexpr.h which reference 
optimization specific structures.

A basic component of the intermediate code is an intermediate code block, which is handled in 'iblock.c'.   
This is not necessarily the same as a block in the source code, as new intermediate code blocks are 
generated any time a branch may be taken.   As a block is being generated, the compiler can do a 
simple local optimization to reduce the amount of code duplicated within the block.

The code generation is however started at the function body level.   A function symbol is passed to 'istmt.c'.
The function symbol has a statement tree, which is then also parsed by istmt.c to generate intermediate code
instructions.   When expression trees are encountered in the statement tree, they are evaluated by
'iexpr.c'.   Note that to handle inline functions, there is a kind of expression node which is actually 
a placeholder for another statement tree.

'istmt.c' and 'iexpr.c' eventually boil down to generating individual lines of intermediate code, for example
for expressions or branches or block start and end tags or labels.   the code gen routines are in 'iblock.c',
and result in a doubly-linked list of intermediate code line structures, on a per function basis.

eventually 'iout.c' is called to generate the intermediate code (.icd file) and call the backend to
generate actual machine code.  It is 'iout.c' that uses the dispatch table provided by the backend, to
call specific backend functions related to code generation.

'ilocal.c' handles basic housekeeping such as the list of temporaries used, converts local variables
into temporaries, and so forth.

'iflow.c' creates the flow and dom trees required by the optimization algorithms.

'ilive.c' handles live variable analysis for the other optimization routines.

'issa.c' handles translations to and from the ssa representation required by some of the optimization 
routines

'ioptutil.c' handles various utilities for things like bit lists, brigg's sets, and so forth

'iconfl.c' handles routines for determining when two temporaries 'conflict' - that is the hold unique
values that are independent of each other.

'ialias.c' handles determining what pointers may be pointing to the same things, which is a required
part of some optimizations

'iloop.c' handles generating a database as to where loops are located, and how they relate to each other

The remaining routines all have to do with optimization.

'ipeep.c' is used to do generic peephole optimizations that can be done at the intermediate code level,
such as reordering of branches.

'irewrite.c' is the core routine for applying specific attributes of the target machine language to
the intermediate code, as an aid in optimization.  This is further extended upon in the backend code.

'iconst.c' propagates constants as far as it can to get rid of unnecessary variables.

'iinvar.c' handles moving constant expression out of inner loops.

'ilazy.c' is the global partial redundancy analyzer

'ireshape.c' modifies inner loop expressions that depend on outer loop variables so that the invariant
part of the expression will be evaluated outside the loop.

'istren.c'  modifies loops e.g. to simplify things like multiplying a loop index by a constant
into a sequence of adds done with each iteration of the loop

'irc.c'  is the register allocator

The main structures in the intermediate code are QUAD, IMODE, PHIDATA, TEMPINFO, BLOCKLIST, and BLOCK.

as the intermediate code is generated, BLOCK structures are created.   These are not the same as source
level blocks; basically an intermediate code block is made any time a change is made in the control flow,
e.g. at branch locations and targets.   The BLOCK structures are originally in the order they were
created, but eventually end up being organized in a tree of BLOCKLIST structures.   The blocklist tree
is used for various optimizations.

As blocks are parsed, each instruction ends up in a QUAD structure, and the variables and constants related
to the instruction are placed in IMODE structures.  The IMODE structures are placed into the QUAD
structures, and the QUAD structures are placed in a single linked list, which will eventually end up in
the function declaration's SYMBOL structure.   the BLOCK structures get pointers to the first and last
instructions being parsed for the related block.  

Finally, the TEMPINFO structure holds a lot of optimizer-related information for each temporary variable
that is passed through the intermediate code generator.  And the PHIDATA structure is used while the
code is in SSA format, which is an alternate representation useful in some optimizations.  It basically
holds all the aliases a single variable may have when a block is entered.

3.2.4 Backend

The backend for the x86 processor is in src\occ\x86.  It consists of files to generate the output code,
and the object file.

Note that all of this should eventually be replaced with modules generated by the ADL program.  Currently it
is all hand-rolled independently of the rest of the programs in the product.

The file 'config.c' specifies a sequence of architecture-specific definitions that guide how the
rest of the compiler operates, as well as has some handling for architecture specific command line
extensions and builtin preprocessor definitions for the architecture.  It also specifies a dispatch table
called from the intermediate code's 'iout.c' to generate sequences of assembly language code.

Two of the major responsibilities of this file other than its responsibility to form the linkage between
iout.c and the backend, are the databases related to variable size and alignment which are used by the
parser, and the databases that guide the register allocator. 

the main header file is 'be.h'.

One file in this list, 'rewrite.c', is actually an architecture-specific extension to the intermediate
code optimizer.   It rewrites the intermediate code for various reasons, one reason being to insert
temps in places where they would be required because of architecture vagaries (for example so that
the dividend can always be moved to EDX:EAX without conflicting with every other dividend that can
ever be moved there).  Another reason is to insert compiler helper functions in places
where inlining the given operation would be too costly (such as multiplying or dividing one 
long long by another long long on a 32-bit architecture, or a function for converting floating 
point numbers to integers).

another file in this list 'inasm.c' is actually called more or less directly from the parser (through
the backend interface table) to parse assembly language instructions and convert them to the internal format.

Other than that the interface from the intermediate code generally calls to one of the atomic functions in 
'gen.c'.  For example there is a function for function prologues and another for epilogues, a function 
for assigments, functions for branches, functions for copying or clearing blocks, functions for handling 
parameters, and so forth.   Basically, many of the functions in 'gen.c' correspond to the code generation 
for one line of the intermediate code.

'gen.c' eventually generates one or more lines of assembly code; the interface for generating each line of
code is in 'peep.c'.  'peep.c' also has routines that are called after a function is generated to do peephole
optimizations, some of which can be quite complex.  The assembly language code is once again placed on a doubly
linked list.

after all the code for a function is generated and the peephole optimizer is run, there is a stepthrough of
the generated list of instructions.   The instructions are passed off one by one to 'outasm.c' which either
writes them as text to an assembly language file or passes them off to 'outcode.c'.  'outcode.c' is
basically an assembler that dumps the instructions into a list of 'sections' holding binary data, 
augmented by fixups and labels.

Note that one of the things in the list of assembly language instructions is markers for each line of the
original source code.   This came from the intermediate code, where it was generated out of the statement
tree.

Another thing to note is that constant variable initializations and space allocations also pass through
the backend dispatch table, however they bypass gen.c and go directly to outasm.c and then outcode.c if
necessary.

When assembling, assembly language code is generated on the fly.   But if not assembling a separate
pass after all code is generated is made to call 'objieeea.c'.  This dumps the code and data segments 
to the object file.   As necessary, this also calls 'dbgieee.c' to insert debug information into the 
output file.

Once any necessary output files are created, the frontend then calls into 'invoke.c' which does things
like invoke the resource compiler and assembler (when files of those formats are specified on the command
line) then the linker is invoked with all the files resulting from the other steps.

The main structures in the x86 backend are AMODE and OCODE.   For each function, a linked list of OCODE
structures is created.  Each OCODE structure holds an instruction or label.  Operands to an instruction
are in the AMODE structure and are stored in the related OCODE structure.

At the end of code generation, the OCODE statements are traversed.  Either an assembly language file will
be created, or the instructions will be assembled into memory, for later dumping to an object file.

3.3 Maintenance issues

While it isn't necessarily a good thing from a maintenance standpoint to have the compiler have what 
is essentially an assembler built into it, and use an entirely different code base from the assembler at
that, there are a couple of challenges that make it hard to avoid.   First, the debug information relies
on knowing specific locations of code routines and variables (in relation to the actual generated code), 
and the intermediate code doesn't specifically support that at this point.  Second, the compiler does
have an inline assembler which does syntax-check the incoming code.

There are a couple of solutions to this problem - first I could migrate the compiler backend to use
the same code as the assembler.  This would include the stuff that was generated by the ADL program, 
as well as inheriting the object file routines from the src\obj directory.   This is the direction I
planned to go down as it would also allow me to extend the ADL program to cover the compiler as well.
But perhaps a simpler approach is to just augment the intermediate code with debug information, then 
move to the C++/11 standard for inline assembly and finally have some type of post processor program 
read through the intermediate code.  In this case inline assembly is specified as string constants and 
not parsed by  the compiler itself.  Which has a minor drawback in that then syntax errors won't be 
caught during the compile, but that might be a good compromise.

3.4 Intermediate code format

The intermediate code is a kind of high-level assembly language.  It is intended that it is simple enough
that during code generation, each intermediate code statement can be replaced with one or more
assembly language instructions of the target architecture, which may sometimes be involved in setting up
registers and/or performing the function.  Intermediate code assignments can often be replaced by 
one instruction...  things like an add would also likely only be replaced by a single instruction.  A
conditional may take multiple instructions, at least one to perform the comparison and then one to
perform the branch.

The other reason for the intermediate code is it enables many optimizations in a very generic way.

3.4.1 Variable names and constants

A variable can either be a temporary variable, or an in-memory variable.   Temporary variables are created
for intermediate results that don't have a corresponding source language variable associated with them.
For example, the C language specifies that the integral arguments to '+' be converted at least to int 
(or to the higher of the two types if one or the other is larger than int).  The conversion is specified
implicitly so it doesn't have a corresponding source level variable, and gets a temporary register during
code generation.  Temporary registers are used in optimization, and eventually the register allocator 
tries to fit them into registers.   Temporary variables that can't be allocated to a register are 
spilled to the stack in a minimal fashion.

Prior to optimization, the compiler moves all local variables that it can into temporary registers.  There
are some rules governing what can and can't become a temporary, for example if the address of a variable is
taken it has to have an address in memory and can't be a temp.

Temporary variables are in the form:
	T#(<reg>).<type>
	
where # is an arbitrary temporary number, <reg> is the register the allocator assigned, and <type> is one
of the types enumerated in section 3.4.1.2.

In memory variables are in the form:
	<mangled name>:<location>.<type>
	
where mangled name is the variable name with an underscore in front of it (in C++ the mangled name 
becomes much more complex) <location> is the location as specified in section 3.4.1.1, and <type> is the
variable's type.  (mangled name can be a number for the ABS location).

There is also a third type of variable, the RV variable.  It has the format:

	RV.<type>.
	
When on the right side of an assigment, it refers to the return value from the preceding gosub.  When on
the left side of an assignment, it refers to the return value for the function currently being compiled.

Constants are in the form
	#<value>.<type>		- for a decimal constant
	#$<value>.<type>	- for a hexadecimal constant
	#<variable name>	- for a variable address
	#<label>.<location> - for a label
	
where <value> is the numeric value and <type> is the type to use for the constant.

3.4.1.1 Variable locations

The following locations exist for in-memory variables:
	LINK	- the variable is on the stack.
	RAM		- the variable is somewhere in data memory
	PC		- the variable is somewhere in the program's code space
	TLS		- the variable is in thread local storage
	ABS		- the variable is referenced with an absolute address.

3.4.1.2 Variable Types
	N		- no type (may happen for constants)
	BOOL	- type bool
	A		- address
	FA		- far address (not currently used)
	SA		- segment number (not currently used)
	UC		- unsigned char
	C		- signed char
	U16		- 16 bit unsigned val
	US		- unsigned short
	S		- short
	U32		- 32 bit unsigned val
	UI		- unsigned int
	I		- int
	UL		- unsigned long
	L		- long
	ULL		- unsigned long long
	LL		- long long
	F		- float
	D		- double
	LD		- long double
	IF		- imaginary float
	ID		- imaginary double
	ILD		- imaginary long double
	CF		- complex float
	CD		- complex double
	CLD		- complex long double
	BIT		- bit (not currently used)
	
3.4.2 section/public/external

The section statement starts or continues a region of code, similar to a section or segment in 
assembler.  It has the following syntax:

	SECTION <sectionname>
	
where <sectionname> is the name of the section, for example code or data.

The public statement declares that a variable in the file is to be shared with other files.  It has the
following syntax:

	PUBLIC <name>
	
where <name> is the variable or function name.

The external statement declares that a variable or function is undefined in the file and should be retrieved from the
link process.   It has the following syntax:

	EXTERNAL <name>:<qualifier>
	
where <name> is the variable or function name, and <qualifier> can be either PROC or DATA to indicate 
whether the name is a function name or a variable name.

3.4.3 prologue and epilogue

The prologue and epilogue are placeholders for things that happen at the beginning and end of a function.
For example the prologue may push variables, set up a function context, or reserve stack space, 
and the epilogue will probably undo whatever the prologue did.

The prologue syntax is:

	PROLOGUE	#,#
	
Where the first constant is a bitmap indicating what registers to push, and the second constant is the
amount of stack space to reserve in bytes.  The size is garnered by asking the backend the size of each
stacked variable.

The epilogue syntax is:

	EPILOGUE	#

where the constant is what registers were pushed in the prologue.	

3.4.4 block start/end

block start and end markers usually won't result in code, they are just there for diagnostic purposes.  
The syntax is:

	BLOCK #			- # is an arbitrary block number
	BLOCK END
	
3.4.5 labels

	labels are similar to assembly language labels, there is a label name followed by a constant.  Most
labels will be generated automatically by the compiler and have a representation similar to:

L_53:

where the 53 is an arbitrary number assigned to this particular label.

3.4.6 data and storage

Data statements create initialized data in memory, whereas storage statements reserve space without
putting anything in particular into it (most reserved memory will nonetheless be initialized to zero).

Data statements have the following syntax:

	DC.<type>	<value list>
	
where <type> is one of the variable types specified in section 3.4.1.2 and <value list> is a comma
separated list of values.  <value list> can also be a string in some instances.

Storage statements have the following syntax:

	DS.<type>	<size>
	
where <type> is one of the variable types specified in section 3.4.1.2 and <size> is the number of
those elements to allocate.

3.4.7 Assignments/casts

Assignments and casts move one variable into another.   They have the following syntax:

	<dest variable> = <src variable or constant>
	
In an assigment, the types of the source and destination will be the same; in a cast the types of the
source and destination will be different.   The intention is that an assignment simply moves the contents
of the source to the destination verbatim, while a cast performs some type of conversion to the source to
allow it to fit into the destination.   The conversion might be a truncation, a sign extension, or a 
conversion between floating and integer values.

It is intended that the only time a conversion is required is when a specific cast is specified;
other operations such as binary expressions or conditionals which involve two or three variables should
by definition have all the variables being the same type.

3.4.8 unary expressions

Unary expressions are similar to assignments, that perform an operation such as negation or complementing.

For example:

	<dest variable> = -<src variable>
	<dest variable> = ~<src variable>

3.4.9 binary expressions

Binary expressions are similar to assignments, that perform a binary operation such as an addition.  For 
example:

	<variable> = <src1 variable> + <src2 variable>
	<variable> = <src1 variable> - <src2 variable>
	
Some operators such as divide perform differently depending on whether the operation is meant to be 
signed or unsigned.  In this case there are generally two forms of the operator, for example:

	<variable> = <src1 variable> U* <src2 variable>
	<variable> = <src1 variable> S* <src2 variable>

3.4.10 conditionals and set statements

conditionals and set statements evaluate a condition.  In the case of a conditional, a branch to 
some label may be taken if the evaluation succeeds; in the case of a set statement, a boolean variable
is set to either true or false depending on whether the evaluation succeeds.

condtionals have the form:

	CONDGO  <label>:<location> ; <left variable> <condition> <right variable>
	
branches to the label specified, if the condition is true.  Here <label> is the label name, <location>
is the location as specified in 3.4.1.1, and <condition> is some condition such as equality or inequality.

For example:

	CONDGO	L_32:PC ; T7(eax) == #0.A
	
branches to L_32 if T7 is null.

set statements are very similar but replace the CONDGO with an assigment and elide the label:

	<dest variable> = <src1 variable> <condition> <src2 variable>
	
3.4.11 unconditional branches

An unconditional branch has the form:

	GOTO	<label>:<location>
	
which has a label similar to the branch label in conditionals

3.4.12 subroutines

	There are four parts to a subroutine call:  the call itself, any paramaters being pushed, a post-call
stack adjustment as required by the C language, and a return statement in the function being called.

The call has the form:

	GOSUB	<label>:<location>
	
which has a label similar to the branch label in conditionals

Parameters are pushed with either PARM or PARMBLOCK.   PARM is for pushing single values, and PARMBLOCK
is for pushing a block of structured values.  These have the form:

	PARM	<variable name>
	PARMBLOCK	<structured variable address>

The post-call cleanup has the form:

	PARMADJ	<constant>
	
where <constant> is the size of the stack frame created for the function, in bytes.  The system gets this
size by asking the back end the size of the involved variables.

The return statement has the form:

	RET	<constant>
	
where constant is the amount of space to clean up from the caller's stack frame.   This is not used by the
C language but is commonly used in the windows STDCALL or PASCAL type of stack frames.  Normally it will be
0 for the C or C++ language code.

3.4.13 switch handling

There are two statements involved with switches, COSWITCH which starts a switch statement and SWBRANCH
which specifies one of the case statements.

COSWITCH has the form:

	COSWITCH(<variable>,<cases>,<range>,<default label>:<type>
	
where <variable> is the selection variable, <cases> is the number of case(SWBRANCH) statements in the list,
<range> is the range between the topmost case and the bottommost case, and <default label> is the label to
branch to if none of the case statements matches the selection variable.
	
SWBRANCH has the form:

	SWBRANCH(<value>, <branch label>:<type>

where <value> is a constant to compare to, and <branch label> is the label to branch to if the comparison
succeeds.

COSWITCH and SWBRANCH can be implemented however the backend deems, but for example the x86 backend
prefers a jump table but will use other algorithms based on the compactness of the jump table.

3.4.14 structured assignments and clears

Structured data can be assigned to each other or cleared as a whole; rather than dictate the copy 
or clear algorithm the compiler just has generic statements for it.

The BLOCK= statement assigns one variable to another, and has the form:

	<dest variable address> BLOCK= <src variable address>(<size>)
	
where <size> is the size of the block in bytes.
	
The BLOCKCLR statement zeros out a block, it has the form:

	BLOCKCLR <variable address>(<size>)

where <size> is the size of the block in bytes.
	
3.4.15 stack based memory allocation

The STACKALLOC statement allocates or frees memory from the stack.  It is typically used for
alloca() or C99 variably sized arrays.

Because STACKALLOC is reversible in some contexts but not reversible in other contexts, there are two other
stack related statements, LOADSTACK and SAVESTACK.   These essentially load and store the current stack
pointer into a variable.

The STACKALLOC statement has the form:

	STACKALLOC <variable_or_const>
	
where the argument is the size in bytes to allocate.  if the argument is negative space is freed instead of
allocated.

LOADSTACK and SAVESTACK have the form:

	SAVESTACK <variable>
	LOADSTACK <variable>
	
where the variable is of type A  and the stack pointer is stored into it or loaded from it.  These variables
are always compiler generated on an as-needed basis.


3.4.16 Contexts

PUSHCONTEXT and POPCONTEXT are used to push and pop the context for functions tagged as 'interrupt'.  
They don't take any arguments, and typically push/pop all registers that the code generator can touch.

3.4.18 atomic functions

There are a handful of statements related to atomic functions.  These include:

ATOMIC FENCE
ATOMIC FLAG FENCE
ATOMIC TEST AND SET
ATOMIC CLEAR
CMPSWP

It is beyond the scope of this document to go into how these work.

3.5 Backend interface

The backend interface is a set of structures and functions that defines the target architecture and 
the context of the compile well enough that the rest of the compiler can reasonably target the architecture.
Functions accessible in the target architecture are accessed through a dispatch table in one of the
structures.  The backend interface performs the following general functions:

	- supply command line 'usage' text
	- supply stubs for target-specific command line arguments
	- supply target-specific preprocessor definitions that should be defined before the compile
	- supply sizes and alignments for the various built-in types
	- supply a list of which built-in types don't get natively read/written atomically
	- set up parameters for the floating point emulation
	- supply parameter offsets (big endian mode) and minimum paramater size
	- specify stack paramaters and minimum stack frame for function call parameters
	- support multiple assemblers
	- add additional code generation parameters
	- supply an interface for debug information to be passed into the backend
	- supply the dispatch table interface
	- supply functions to be called during intermediate code optimization
	- supply functions for things like the assemblely language preamble
	- supply functions to be called to output data and labels
	- supply functions to be called to convert intermediate code to assembly language code