1. Introduction OrangeC is an optimizing compiler and tool chain. It was written partly with the intention of being a platform for cross-compiling C language code to other architectures and operating systems. Some of the features that make this possible are: - Modular compiler which defers code generation until the last minute. - A generic compiler backend interface allows customization of the code generation as well as various other architecture-specific compiler parameters - An ADL (architecture description language) was used to specify assembler code generation, and it is planned to expand it to allow specifying the compiler code generation as well - The tools are mostly generic in nature, and share a common set of modules for dealing with object files - A linker specification file tells the linker how to interpret command line switches to target a specific operating system file format or ROM image format - Other linker specification files are used to specify memory layouts, in a way that is generic enough to be used to create either the layout to be used for a specific operating system or a ROM image file for embedded systems. - the linker is generic, a post link program (downloader) is chosen to perform the final translation to the target file format That said, the current releases of OrangeC target the x86 and DOS/WIN32. It is also possible to generate ROM images with this system, if one is careful about how the RTL is used. 2. Overview 2.1 Distributed Programs With the exception of the database package (SQLITE3) and parts of the run-time library associated with MSDOS and the WIN32 headers, and lately the C++ header files, the programs in this package form new work. (although most of the run time library is inherited from a previous project, CC386). The programs in this package are as follows: - OCC - C compiler - OASM - assembler - OBRC - browse compiler (used by IDE, enabled by compiler output files) - OCL - Shell program for running compiler, assembler, linker, etc... the compiler also has much of this functionality built in - OCPP - standalone preprocessor (there are two versions of this, one using the same code as in the compiler, and the other written in C++) - OCCPR - code completion compiler (used by IDE, generated off main compiler sources) - OGREP - generic grep utility - OIMPLIB - WIN32 import librarian - OLIB - generic librarian - OLINK - generic linker - OMAKE - gnu style make utility - ORC - win32 resource compiler - DLHEX - platform dependent image creator - hex files - DLLE - platform dependent image creator - 32 bit DOS binaries (LE FORMAT) - DLMZ - platform dependent image creator - 16 bit DOS binaries (somewhat limited in scope) - DLPE - platform dependent image creator - WIN32 binaries - DLPM - platform dependent image creator - 32 bit DOS binaries (home grown version) Additional directories include: - ADL - architecture description language compiler - CLIBS - the C runtime library source code and standard headers - EXAMPLES - simple example code - EXEFMT - an include directory for format specific header files - HELP - the source and compiled code for the IDE help files - OBJ - the source code for the object file format handlers (used by everything except C compiler) 2.2 Object file format The object file format is a touched-up form of the ASCII version of IEEE-695. This is the preferred format because it is human readable without a dump program. Some tests were performed with a binary version of the format but it didn't really enhance the performance. The OBJ directory has generic routines to deal with this file format that are used by everything except the compiler. The compiler currently has hand-rolled code to deal with this file format. 2.3 C Compiler The C compiler is written in four parts: the preprocessor (occ\preproc), the front end(occ\parser), the intermediate language and optimizer (occ\middle) and the code generator (occ\x86). Additional compiles of code based on the compiler source code exist in the occ\preproc directory (for the shipping version of the standalone preprocessor) and the occ\parse86 directory (for the code completion compiler). There is a set of structures that define the code generator in unique increments, as well as define various architecture specific constants such as integer sizes. This will eventually be targeted by the architecture description language. Also, there is a somewhat sophisticated peephole optimizer implemented to clean up various artifacts. 2.4 Preprocessor there are two preprocessors, one in the ocpp directory and one in the occ\preproc directory. The one in the occ\preproc directory is essentially the same preprocessor as in CC386, and is the one favored by the distribution. This preprocessor was heavily vetted by a contributor. 2.5 Assembler The assembler (oasm) is similar in style to NASM, although no attempt has been made to make it 100% compatible with NASM. The unique featue of oasm is that the code generation code was generated by the ADL compiler, with an architectural description file (oasm\x86.adl) as input. 2.6 Linker The linker (olink) takes as input various object files, as well as accepting platform specific files that can be used in an arbitrary way to create the output. The primary output of the linker is another object file, which can be in either relative format (useful for further linking) or in absolute format (useful for final image generation). The linker can also generate an ODX file for use in symbolic debugging. It is possible to define the values of externals on the linker command line the same way that symbols are defined on the command line for the compiler or assembler; for example if an image may be built for different addresses this could possibly be controlled entirely with command line switches. The linker can take a linker specification (.spc) file to tell the linker exactly how to organize the input object files to create the output image file. The file olink\app is an additional linker configuration file which associates the /T linker switch with creation of a target image file. It associates the switch parameter with various externals, a specification file, and a 'downloader' which is responsible for creating the final target image. Each downloader has an associated specification file that tells how to lay out the code for use with the downloader. The linker typically calls the downloader as a post-link step when the /T switch is used. 2.7 Librarians olib and oimplib are very similar, they both can create libraries. oimplib is specifically used for win32 development, related to dll import libraries. There is nothing remarkable about these programs other than that some of the code is shared with the linker (since the linker reads libraries). 2.8 Make utilities omake is a make utility inspired by GNU MAKE. It is very similar to GNU MAKE in spirit and functionality, so much so that the gnu documentation could be used to learn it. However, no attempt has been made to make it 100% compatible with GNU MAKE. 2.9 File search ogrep is a generic grep utility. There is nothing really remarkable about it. 2.10 Resource compiler orc is the win32 resource compiler. It is reasonably compatible with the microsoft win32 resource compiler. 2.11 Downloaders Downloaders came about following a comment in the IEEE695 specification which talked about the possibility of a post link tool that would take generic linker output and transform it into an operating-system specific image. A downloader generally has an associated specification file, which is used to guide the linker in terms of how to create a generic output file that has suitable augmentation that it may be used to generate an operating system-specific file. Once the linker generates such a file, the downloader then transforms it into the target file format. Downloaders can take arbitrary command line switches that are configured through the linker configuration file, and the linker can optionally pass files specified on its own command line to the downloader (for example .RES files are specified on the linker command line for WIN32 images. But they aren't processed by the linker, they are processed by DLPE). the built in downloaders are as follows: - DLHEX - platform dependent image creator - hex files (ROM IMAGES) - DLLE - platform dependent image creator - 32 bit DOS binaries (LE FORMAT) - DLMZ - platform dependent image creator - 16 bit DOS binaries (somewhat limited in scope) - DLPE - platform dependent image creator - WIN32 binaries - DLPM - platform dependent image creator - 32 bit DOS binaries (home grown version) The default specification file for DLHEX is very simple, in many cases a more complex one may need to be created as a project-specific file but this can be overridden on the linker command line. 2.12 Browse compiler obrc is the browse compiler used by the IDE. It takes as input the .cbr files generated by the the compiler, and generates the .OBR file used by the IDE. 2.13 Code Completion compiler occpr (occ\parse86) is the code completion compiler used by the IDE. It takes as input one or more C language source files, and maintains the project .ODS file. 2.14 Compiler shell ocl is a compiler shell. It generally sequences calling the compiler, assembler, linker, etc... A contributor originally designed it as an adjunct to CC386 because at the time CC386 didn't invoke the assembler, linker, etc... its other strong point is that it allows fine-tuned control over which MSDOS extender is used in the image file. It has been kept up to date and became a part of the Orange C compiler package. 2.15 Run Time Library clibs, the runtime library, is mostly work that was generated during the development of CC386. At some point, the same contributor that vetted the preprocessor went to a lot of effort to vet the runtime library and it should be fairly close to the standard at this point. Parts of the runtime library related to MSDOS(notably the dos extenders) are work that was performed by other authors. 3. Detailed view of C compiler 3.1 Compiler overview · The Preprocessor (OCC\preproc) · The Front End (OCC\Parser) · The Intermediate Language and Optimizer (OCC\middle) · The Code Generator (OCC\x86) The preprocessor and parser components work together to handle scanning the source files. The parser handles the C and C++ language constructs; the preprocessor handles preprocessor statements such as #define, #if, #ifdef, #pragma, etc... Basically, the preprocessor is used as a first level pass at the code base, which transforms the code by eliminating unused if or else blocks, replacing the macros defined by the #defines with their equivalent values, and so forth. Once that is done the parser works on the resulting transformed program to translate it into data structures. Note that while the preprocessor could be implemented as an independent pass over the source code that results in preprocessed source code, in this compiler the lexer calls the preprocessor on pretty much a line by line basis to perform the transformations piecemeal. In this way we can avoid having a preprocessed version of the entire source code in memory. But it is possible to use the standalone preprocessor to transform the source file into a preprocessed intermediate file. This intermediate file could then be used as if it were the original source code, e.g. you can directly compile it as if it were the original source. In practice this is only useful if you are designing a program with a very complex set of macros, and want to debug how the macros are used. There is a known caveat where standalone preprocessors cannot 100% match the requirements the builtin preprocessor has in terms of macro replacement. After the preprocessor and parser have been used to scan the source files, a high-level representation will be in memory, forming a set of trees that completely describe the program internally. The preprocessor and parser can make the following files: a '.i' file holding the source code after it has been preprocessed, a '.err' file holding the list of errors, and a '.lst' file holding a listing of the source code and a representation of variables and their types. The '.i' file is the same as what the standalone preprocessor would output. This high-level representation of the program is then converted to an intermediate language, which is similar to a generic assembly language. Each line in the intermediate code has operands, possibly a destination, and possibly a branch address for branches. Once the intermediate code is created, a set of optimizations is applied. After the optimizations register allocation is performed. For the optimizations and register allocation it is necessary to have just a little guidance about the target architecture... things like the fact that on the x86 the ECX is the only thing that is used as a variable for shift counting, EDX:EAX is used as a dividend and the divisor can't be in EDX, and so forth. All this guidance comes from the backend, which has an architecture-specific configuration file detailing the architecture. After all that a step through is done for the intermediate code. At this point it is possible to get a file dump of the intermediate code (.icd file) (by using the /Y compiler switch). But sometimes it is more interesting to get a dump of the intermediate code before optimization or somewhere during the optimization process to try to isolate where a problem lies. At this point there is no facility for doing this without modifying compiler source code. As the final step through of intermediate code is being done, code generation routines are also called and the binary values for code are put in an internal database. Also, code is output into an ASM file if the /S switch was specified. After all the code for all the routines is generated, code is output into the OBJECT(.o) file if the /S switch wasn't specified. The .o file will also have debug information if the /v switch was specified. The compiler also makes a browse info file (.cbr) if the IDE requests it. After all that is done, the compiler goes on to invoke any other programs that are required by the command line switches, such as a linker or resource file compiler. 3.2 Source code layout 3.2.1 Preprocessor source code layout The preprocessor source code is in \src\occ\preproc. Some of these files are shared with the C compiler to avoid code duplication. The main header file for the preprocessor is 'pp.h'. It has the structures and constants needed by the preprocessor. 'ppmain.c' is the preprocessor main line; it is responsible for parsing the command line, opening input and output files, etc... (this file is not used by the compiler) 'osutil.c' is os-related stuff mostly used by ppmain.c (the compiler also has its own version of osutil.c) 'memory.c' is a blocked memory allocator. (the compiler also has its own version of memory.c) 'symtab.c' is used to handle hash tables for the list of preprocessor macros. (the compiler also has its own version of symtab.c) 'ppexpr.c' is a recursive descent expression parser which handles constant expressions for #if and et.al. 'preproc.c' is the main preprocessor, it handles all the preprocessor directives. Probably the most complex thing handled by this file is the macro substitution algorithms. Nesting conditional directives also gets a little tricky. This file is also used by the compiler. 'ccerr.c' is an abbreviated version of the compiler error file that holds routines for displaying errors. The main structure to be aware of in the preprocessor is the INCLUDES typedef. It holds the current state of the preprocessor, as well as the progress through each file. When the preprocessor or compiler starts, an instance of this structure is made to hold the main source file. As each #include directive is encountered, another instance of this structure is created and pushed on the front of a linked list of these structures. When the compiler is done parsing the file, it is closed and its INCLUDES structure is popped off the front of the linked list. 3.2.2 Parser source code layout the parser files are found in src\occ\parse. the main compiler header is 'compiler.h', but most of the parser specific structures and definitions are in the file 'c.h'. Other header files of note are 'compiler.p' which holds function prototypes, and 'ccerr.h' which holds the #defines for the over 400 errors the compiler may generate. The compiler main line is in 'ccmain.c'. As in the preprocessor, this uses a version of 'osutil.c' to parse the command line and set up lists of input files. A local version of 'memory.c' is used for memory allocations. A local version of ccerr.c holds the text for the errors the compiler generates, as well as various routines for scanning for errors and formatting them. 'config.c' is the compiler configuration, which in this directory is just defaulted to some generic values for testing. The backend will have the actual version of 'config.c' compiled into the product that describes the target processor. 'list.c' is used to generate a listing file which outlines how variables are used. The main line for the actual compile process is a routine located in ccmain.c called 'compile'. It does some setup, then continually calls a routine called 'declare' in 'declare.c' to parse the source files. Finally it does various things like cleaning up, generating output, dumping in-memory initializers, and so forth. Everything in C starts from a declaration which is why compile() is implemented that way. declare() is thus the root of a recursive descent parser... I prefer that although I realize that many others would rather not do it that way. There are new kinds of declarations enabled by C++, for example base class handling, virtual functions, and so forth. The file 'declcpp.c' supports a lot of the C++ specific extensions. There is also another declaration file 'declcons.c' which is used to support instantiation of constructors and calls to constructors, creation of default constructors, and so forth. Declarations are usually associated with a symbol value and can be accessed by knowing the name of the symbol. Other than the structure for the symbol itself which holds attributes and data for the symbol, the main structure associated with declarations is a type list. The type list can be a simple type such as 'int', a const or volatile modified type, a simple type that has been turned into a possibly const-modified pointer or array, and so forth. A symbol is usually the result of a declaration (although some symbols are generated internally for temporaries and so forth). All symbols have an associated type list that indicates the symbol's type, that is stored in the database for the symbol. Symbols themselves are stored in a hash table. There are two main hash tables, the 'global' hash table which handles module-level symbols, and the 'local' hash table which stores lists of all the open blocks in the function currently being processed. In the local hash table each block also has a related set of symbols. Hash table conflicts are resolved in this system by having each bucket hold a linked list, and searching the list after finding the bucket. Some types of symbols also include a hash table. For example function symbols hold a hash table with all the function arguments, structure name symbols include a hash table with the structure members, and namespace symbols include a hash table of all the symbols that are part of the namespace. The hashtables for symbols are stored as part of the type list. Many smaller hash tables are actually implemented as one bucket hash tables, which essentially devolves them into linked lists. The same interface is used whether the hash table is one bucket or multi buckets, however, the code spends a lot of time iterating through hash tables (for example function arguments or structure elements) and sometimes presumes the number of buckets to be one. At this time the only hash tables that have multiple buckets are the global hash table and the namespace hash tables. The hashtable and symbol handling routines are located in 'symtab.c' - there is also some related stuff in 'cpplookup.c' which is related to handling namespaces within the context of these hash tables. Note that a namespace also has an associated symbol that is itself stored somewhere in a hash table for another namespace. Starting with declare() but really all the way through the parser comes the routines in 'lexer.c', which convert the incoming source code into tokens, such as keywords, variable names, integer constants, and so forth. The keyword list starts out as a tree that can be searched with a binary search, but as it turns out it is faster to use a hash table because of the sheer number of keywords so early on the compiler goes through the keyword table and puts them in a hash table for future lookup. A routine getsym() is called to get the next token from the parser; it might call the preprocessor to get the next line of code and perform macro substitions on it. Note that with the advent of C++ it became necessary to sometimes create a list of tokens and append them to a symbol for processing at some point in the future... this is supported by the getsym() implementation. This happens for example for function bodies declared within the context of a structure declaration, and for templates. This implementation also allows backtracking as necessary... tokens are stored in a structure called 'LEXEME'. Another file used for declarations is 'template.c' which is the support for C++ templates. From declarations things branch out into variable initializers, and function bodies. The initializers are handled in 'init.c', which parses the initializers, and also maintains a database of values that need initialization. In C most of the initialization is static and can be directly embedded into the data region of the program, but in C++ a concept came about of being able to call constructors and other functions as part of the initialization process. So 'init.c' also has the ability to cache and generate module-specific startup routines that will call all this code as the executable is being started (prior to main()) 'init.c' might also be called for default constructor arguments, automatic variable initializations, or automatic structure initializations. (automatic refers to local variables here). The function bodies are handed off to 'stmt.c', which parses statements such as for, while, if, etc... One particular kind of statement is a 'block' statement, which itself holds the linked list of statements which consist of the contents of an enclosed block. So the statement tree is a kind of unstructured tree which may have elements at the same level (statements in the same block) or elements at a lower level (an enclosed block). Functions eventually boil down to parsing some type of expression (which can be a combination of arithmetic, function calls, structure assignments, and so forth). Expressions can either be directly coded, or they can be a selection statement such as an if or while clause. Initializers also involve expressions; in C these expressions are converted to a constant value after the fact but in C++ these expressions may be coded into a module specific startup routine as noted above. Expressions are handled by 'expr.c' for most C language stuff, and exprcpp.c for the C++ extensions. An expression tree holds each expression. Expression trees are binary trees. They may be referenced for example from an initializer symbol, or from some statement in the expression tree that requires an expression. Most of the time expressions are run through the file 'constopt.c' which attempts to simplify the any constant sub-expressions within the expression. In some cases if the expression does not turn into a constant after constopt.c is applied, it is a compile-time error. For example, initializers must consist of constant expressions in the C language. constexpr expressions must also be constant in c++. This compiler handles inline functions both implicitly and explicitly. An example of an implicit use is when a C++ function body appears inside a structure declaration. The file 'inline.c' manages converting stmt/expression trees into an inlined form, and does various housework involved in getting an explicit version of the inline function generated as necessary. 'types.c' is used for basic type matching, it can compare one type with another and also has a routine for turning a type list into text for use in an error message. Other core compiler files include cpplookup.c, lambda.c, rtti.c. These are used in handling various C++ constructs.. for example cpplookup.c handles the function matching routines for matching arguments to an overloaded function instance. 'lambda.c' handles C++11 lambda expressions. 'rtti.c' handles database generation for RTTI, which is used for the C++ typeid keyword, and for C++ exception handling. 'beinterf.c' holds various routines for getting things like variable sizes from the backend database. 'browse.c' holds the routines for creating the browse database outputs. Unlike most other output files, this has nothing to do with assembly language and is not passed to the intermediate code or backend. 'float.c' holds floating point emulation - this is so that the compiler can be run on various platforms without getting involved in floating point vagaries for the platform. (incidentally one of the contributors stuck this code into the run-time library). The remaining files in the parser include cppbltin.c, help.c mangle.c unmangle.c and output.c 'cppbltin.c' creates various prototypes for builtin c++ functions, notably the ones for new and delete but also a bunch of C++ helper functions called by the compiler for example for try/catch blocks. 'help.c' is just generic helper routines such as examiners for type list, type list classification, and various other things. 'output.c' is helper routines for handling output to files. 'mangle.c' and 'unmangle.c' are the routines that deal with C++ name mangling. 'unmangle.c' is primarily used in generating function names during error generation. It is also copied into the linker directory, so that problems related to C++ function names can be stated in source code format. The main structures to be aware of in the parser are the LEXEME structure, the HASHTABLE and HASHREC structures, the SYMBOL and TYPE structures, the STATEMENT and EXPRESSION structures, the BLOCKDATA structure, the INITIALIZER structure, the STRUCTSYM structure, and the FUNCTIONCALL and INITLIST structures. There are a lot of other structures for more specialized operations, but these are the backbones of the parser and are used all the time throughout. As a brief summary, types are created in the TYPE structure as part of the declaration process. Almost immediately, a SYMBOL structure is created for the declaration and the TYPE is stored in it. The symbol is then stored in a HASHTABLE; HASHREC is basically the contents of each bucket of the hashtable. each HASHREC holds a SYMBOL and the pointer to the next HASHREC. The HASHREC nodes are arranged in a linked list which represents the conflicts for that bucket. For things that only have one bucket, the HASHREC nodes are usually ordered in the order they were created, and can be traversed as a linked list by getting the first element of the single bucket. During the declaration of the symbol, it may turn out the declaration is for a function or structure. Then, each parameter or structure member is parsed as a nested declaration (through recursion) and the results end up in the HASHTABLE associated with the TYPE that is associated with the higher-level declaration's SYMBOL. If the declaration turns out to be for a function with a function body, a BLOCKDATA structure is created. BLOCKDATA structures aren't permanent, they are simply used to track the progress of compiling the current (source level) block of code. A BLOCKDATA structure will eventually get a STATEMENT tree through the parsing process... and nodes in the STATEMENT tree will sometimes get EXPRESSION TREES. Sometimes, EXPRESSION trees will have a function call nested in them, then the expression node associated with the function call will get a FUNCTIONCALL structure. The FUNCTIONCALL structure holds some TYPE and SYMBOL structures that may be used in processing the function call. It will also hold a linked list of INITLIST structures, each of which has a TYPE and EXPRESSION structure that describes the argument that needs to be pushed. The top level STATEMENT structure for the function will be placed in the SYMBOL structure for the declaration of the function where it can be retrieved later. If the symbol being declared isn't a function, it may have an initializer. The initializer will be parsed and placed in an INITIALIZER structure and will then be associated with the symbol. It might later be processed during creation of initialized data sections or dynamic startup routines in the case of C++. Or it might be utilized more directly when generating a function body. 3.2.3 Intermediate code generation and optimization intermediate code generation and optimization may be found in the src\occ\middle directory The main headers of the intermediate code generation are 'iexpr.h', and 'iopt.h'. 'iexpr.h' handles most of the structures and definitions used by the code generation, and 'iopt.h' enhances that with declarations specific to the optimizations. But there may be fields in iexpr.h which reference optimization specific structures. A basic component of the intermediate code is an intermediate code block, which is handled in 'iblock.c'. This is not necessarily the same as a block in the source code, as new intermediate code blocks are generated any time a branch may be taken. As a block is being generated, the compiler can do a simple local optimization to reduce the amount of code duplicated within the block. The code generation is however started at the function body level. A function symbol is passed to 'istmt.c'. The function symbol has a statement tree, which is then also parsed by istmt.c to generate intermediate code instructions. When expression trees are encountered in the statement tree, they are evaluated by 'iexpr.c'. Note that to handle inline functions, there is a kind of expression node which is actually a placeholder for another statement tree. 'istmt.c' and 'iexpr.c' eventually boil down to generating individual lines of intermediate code, for example for expressions or branches or block start and end tags or labels. the code gen routines are in 'iblock.c', and result in a doubly-linked list of intermediate code line structures, on a per function basis. eventually 'iout.c' is called to generate the intermediate code (.icd file) and call the backend to generate actual machine code. It is 'iout.c' that uses the dispatch table provided by the backend, to call specific backend functions related to code generation. 'ilocal.c' handles basic housekeeping such as the list of temporaries used, converts local variables into temporaries, and so forth. 'iflow.c' creates the flow and dom trees required by the optimization algorithms. 'ilive.c' handles live variable analysis for the other optimization routines. 'issa.c' handles translations to and from the ssa representation required by some of the optimization routines 'ioptutil.c' handles various utilities for things like bit lists, brigg's sets, and so forth 'iconfl.c' handles routines for determining when two temporaries 'conflict' - that is the hold unique values that are independent of each other. 'ialias.c' handles determining what pointers may be pointing to the same things, which is a required part of some optimizations 'iloop.c' handles generating a database as to where loops are located, and how they relate to each other The remaining routines all have to do with optimization. 'ipeep.c' is used to do generic peephole optimizations that can be done at the intermediate code level, such as reordering of branches. 'irewrite.c' is the core routine for applying specific attributes of the target machine language to the intermediate code, as an aid in optimization. This is further extended upon in the backend code. 'iconst.c' propagates constants as far as it can to get rid of unnecessary variables. 'iinvar.c' handles moving constant expression out of inner loops. 'ilazy.c' is the global partial redundancy analyzer 'ireshape.c' modifies inner loop expressions that depend on outer loop variables so that the invariant part of the expression will be evaluated outside the loop. 'istren.c' modifies loops e.g. to simplify things like multiplying a loop index by a constant into a sequence of adds done with each iteration of the loop 'irc.c' is the register allocator The main structures in the intermediate code are QUAD, IMODE, PHIDATA, TEMPINFO, BLOCKLIST, and BLOCK. as the intermediate code is generated, BLOCK structures are created. These are not the same as source level blocks; basically an intermediate code block is made any time a change is made in the control flow, e.g. at branch locations and targets. The BLOCK structures are originally in the order they were created, but eventually end up being organized in a tree of BLOCKLIST structures. The blocklist tree is used for various optimizations. As blocks are parsed, each instruction ends up in a QUAD structure, and the variables and constants related to the instruction are placed in IMODE structures. The IMODE structures are placed into the QUAD structures, and the QUAD structures are placed in a single linked list, which will eventually end up in the function declaration's SYMBOL structure. the BLOCK structures get pointers to the first and last instructions being parsed for the related block. Finally, the TEMPINFO structure holds a lot of optimizer-related information for each temporary variable that is passed through the intermediate code generator. And the PHIDATA structure is used while the code is in SSA format, which is an alternate representation useful in some optimizations. It basically holds all the aliases a single variable may have when a block is entered. 3.2.4 Backend The backend for the x86 processor is in src\occ\x86. It consists of files to generate the output code, and the object file. Note that all of this should eventually be replaced with modules generated by the ADL program. Currently it is all hand-rolled independently of the rest of the programs in the product. The file 'config.c' specifies a sequence of architecture-specific definitions that guide how the rest of the compiler operates, as well as has some handling for architecture specific command line extensions and builtin preprocessor definitions for the architecture. It also specifies a dispatch table called from the intermediate code's 'iout.c' to generate sequences of assembly language code. Two of the major responsibilities of this file other than its responsibility to form the linkage between iout.c and the backend, are the databases related to variable size and alignment which are used by the parser, and the databases that guide the register allocator. the main header file is 'be.h'. One file in this list, 'rewrite.c', is actually an architecture-specific extension to the intermediate code optimizer. It rewrites the intermediate code for various reasons, one reason being to insert temps in places where they would be required because of architecture vagaries (for example so that the dividend can always be moved to EDX:EAX without conflicting with every other dividend that can ever be moved there). Another reason is to insert compiler helper functions in places where inlining the given operation would be too costly (such as multiplying or dividing one long long by another long long on a 32-bit architecture, or a function for converting floating point numbers to integers). another file in this list 'inasm.c' is actually called more or less directly from the parser (through the backend interface table) to parse assembly language instructions and convert them to the internal format. Other than that the interface from the intermediate code generally calls to one of the atomic functions in 'gen.c'. For example there is a function for function prologues and another for epilogues, a function for assigments, functions for branches, functions for copying or clearing blocks, functions for handling parameters, and so forth. Basically, many of the functions in 'gen.c' correspond to the code generation for one line of the intermediate code. 'gen.c' eventually generates one or more lines of assembly code; the interface for generating each line of code is in 'peep.c'. 'peep.c' also has routines that are called after a function is generated to do peephole optimizations, some of which can be quite complex. The assembly language code is once again placed on a doubly linked list. after all the code for a function is generated and the peephole optimizer is run, there is a stepthrough of the generated list of instructions. The instructions are passed off one by one to 'outasm.c' which either writes them as text to an assembly language file or passes them off to 'outcode.c'. 'outcode.c' is basically an assembler that dumps the instructions into a list of 'sections' holding binary data, augmented by fixups and labels. Note that one of the things in the list of assembly language instructions is markers for each line of the original source code. This came from the intermediate code, where it was generated out of the statement tree. Another thing to note is that constant variable initializations and space allocations also pass through the backend dispatch table, however they bypass gen.c and go directly to outasm.c and then outcode.c if necessary. When assembling, assembly language code is generated on the fly. But if not assembling a separate pass after all code is generated is made to call 'objieeea.c'. This dumps the code and data segments to the object file. As necessary, this also calls 'dbgieee.c' to insert debug information into the output file. Once any necessary output files are created, the frontend then calls into 'invoke.c' which does things like invoke the resource compiler and assembler (when files of those formats are specified on the command line) then the linker is invoked with all the files resulting from the other steps. The main structures in the x86 backend are AMODE and OCODE. For each function, a linked list of OCODE structures is created. Each OCODE structure holds an instruction or label. Operands to an instruction are in the AMODE structure and are stored in the related OCODE structure. At the end of code generation, the OCODE statements are traversed. Either an assembly language file will be created, or the instructions will be assembled into memory, for later dumping to an object file. 3.3 Maintenance issues While it isn't necessarily a good thing from a maintenance standpoint to have the compiler have what is essentially an assembler built into it, and use an entirely different code base from the assembler at that, there are a couple of challenges that make it hard to avoid. First, the debug information relies on knowing specific locations of code routines and variables (in relation to the actual generated code), and the intermediate code doesn't specifically support that at this point. Second, the compiler does have an inline assembler which does syntax-check the incoming code. There are a couple of solutions to this problem - first I could migrate the compiler backend to use the same code as the assembler. This would include the stuff that was generated by the ADL program, as well as inheriting the object file routines from the src\obj directory. This is the direction I planned to go down as it would also allow me to extend the ADL program to cover the compiler as well. But perhaps a simpler approach is to just augment the intermediate code with debug information, then move to the C++/11 standard for inline assembly and finally have some type of post processor program read through the intermediate code. In this case inline assembly is specified as string constants and not parsed by the compiler itself. Which has a minor drawback in that then syntax errors won't be caught during the compile, but that might be a good compromise. 3.4 Intermediate code format The intermediate code is a kind of high-level assembly language. It is intended that it is simple enough that during code generation, each intermediate code statement can be replaced with one or more assembly language instructions of the target architecture, which may sometimes be involved in setting up registers and/or performing the function. Intermediate code assignments can often be replaced by one instruction... things like an add would also likely only be replaced by a single instruction. A conditional may take multiple instructions, at least one to perform the comparison and then one to perform the branch. The other reason for the intermediate code is it enables many optimizations in a very generic way. 3.4.1 Variable names and constants A variable can either be a temporary variable, or an in-memory variable. Temporary variables are created for intermediate results that don't have a corresponding source language variable associated with them. For example, the C language specifies that the integral arguments to '+' be converted at least to int (or to the higher of the two types if one or the other is larger than int). The conversion is specified implicitly so it doesn't have a corresponding source level variable, and gets a temporary register during code generation. Temporary registers are used in optimization, and eventually the register allocator tries to fit them into registers. Temporary variables that can't be allocated to a register are spilled to the stack in a minimal fashion. Prior to optimization, the compiler moves all local variables that it can into temporary registers. There are some rules governing what can and can't become a temporary, for example if the address of a variable is taken it has to have an address in memory and can't be a temp. Temporary variables are in the form: T#(). where # is an arbitrary temporary number, is the register the allocator assigned, and is one of the types enumerated in section 3.4.1.2. In memory variables are in the form: :. where mangled name is the variable name with an underscore in front of it (in C++ the mangled name becomes much more complex) is the location as specified in section 3.4.1.1, and is the variable's type. (mangled name can be a number for the ABS location). There is also a third type of variable, the RV variable. It has the format: RV.. When on the right side of an assigment, it refers to the return value from the preceding gosub. When on the left side of an assignment, it refers to the return value for the function currently being compiled. Constants are in the form #. - for a decimal constant #$. - for a hexadecimal constant # - for a variable address #