Go to the first, previous, next, last section, table of contents.
To create a compiler that integrates into GCC, you need create many
files. Some of the files are integrated into the main GCC makefile, to
build the various parts of the compiler and to run the test
suite. Others are incorporated into various GCC programs such as
gcc.c. Finally you must provide the actual programs comprising your
compiler.
The files are:
-
COPYING. This is the copyright file, assuming you are going to use the
GNU General Public Licence. You probably need to use the GPL because if
you use the GCC back end your program and the back end are one program,
and the back end is GPLed.
This need not be present if the language is incorporated into the main
GCC tree, as the main gcc directory has this file.
-
COPYING.LIB. This is the copyright file for those parts of your program
that are not to be covered by the GPL, but are instead to be covered by
the LGPL (Library or Lesser GPL). This licence may be appropriate for
the library routines associated with your compiler. These are the
routines that are linked with the output of the compiler. Using
the LGPL for these programs allows programs written using your compiler
to be closed source. For example LIBC is under the LGPL.
This need not be present if the language is incorporated into the main
GCC tree, as the main gcc directory has this file.
-
ChangeLog. Record all the changes to your compiler. Use the same format
as used in treelang as it is supported by an emacs editing mode and is
part of the FSF coding standard. Normally each directory has its own
changelog. The FSF standard allows but does not require a meaningful
comment on why the changes were made, above and beyond why they
were made. In the author's opinion it is useful to provide this
information.
-
root.texi. Macros for use in the main manual eg email addresses etc.
-
treelang.texi. The manual, written in texinfo. Your manual would have a
different file name. You need not write it in texinfo if you don't want
do, but a lot of GNU software does use texinfo.
-
Make-lang.in. This file is part of the make file which in incorporated
with the GCC make file skeleton (Makefile.in in the gcc directory) to
make Makefile, as part of the configuration process.
Makefile in turn is the main instruction to actually build
everything. The build instructions are held in the main gcc manual and
web site so they are not repeated here.
There are some comments at the top which will help you understand what
you need to do.
There are make commands to build things, remove generated files with
various degrees of thoroughness, count the lines of code (so you know
how much progress you are making), build info and html files from the
texinfo source, run the tests etc.
-
README. Just a brief informative text file saying what is in this
directory.
-
config-lang.in. This file is read by the configuration progress and must
be present. You specify the name of your language, the name(s) of the
compiler(s) incouding preprocessors you are going to build, whether any,
usually generated, files should be excluded from diffs (ie when making
diff files to send in patches). Whether the equate 'stagestuff' is used
is unknown (???).
-
lang-options. This file is included into gcc.c, the main gcc driver, and
tells it what options your language supports. This is only used to
display help (is this true ???).
-
lang-specs. This file is also included in gcc.c. It tells gcc.c when to
call your programs and what options to send them. The mini-language
'specs' is documented in the source of gcc.c. Do not attempt to write a
specs file from scratch - use an existing one as the base and enhance
it.
-
Your texi files. Texinfo can be used to build documentation in HTML,
info, dvi and postscript formats. It is a tagged language, is documented
in its own manual, and has its own emacs mode.
-
Your programs. The relationships between all the programs are explained
in the next section. You need to write or use the following programs:
-
lexer. This breaks the input into words and passes these to the
parser. This is lex.l in treelang, which is passed through flex, a lex
variant, to produce c code lex.c. Note there is a school of thought that
says real men hand code their own lexers, however you may prefer to
write far less code and use flex, as was done with treelang.
-
parser. This breaks the program into recognizable constructs such as
exprerssions, statemente etc. This is parse.y in treelang, which is
passed through bison, which is a yacc variant, to produce c code parse.c.
-
back end interface. This interfaces to the code generation back end. In
treelang, this is tree1.c which mainly interfaces to toplev.c and
treetree.c which mainly interfaces to everything else. Many languages
mix up the back end interface with the parser, as in the C compiler for
example. It is a matter of taste which way to do it, but with treelang
it is separated out to make the back end interface cleaner and easier to
understand.
-
header files. For function prototypes and common data items. One point
to note here is that bison can generate a header files with all the
numbers is has assigned to the keywords and symbols, and you can include
the same header in your lexer. This technique is demonstrated in
treelang.
-
compiler main file. gcc comes with a program toplev.c which is a
perfectly serviceable main program for your compiler. treelang uses
toplev.c but other languages have been known to replace it with their
own main program. Again this is a matter of taste and how much code you
want to write.
The GCC compiler consists of a driver, which then executes the various
compiler phases based on the instructions in the specs files.
Typically a program's language will be identified from its suffix (eg
.tree) for treelang programs.
The driver (gcc.c) will then drive (exec) in turn a preprocessor, the main
compiler, the assembler and the link editor. gcc options allow you to
override all of this. In the case of treelang programs there is no
preprocessor, and mostly these days the C preprocessor is run within the
main C compiler apparently for reasons of speed.
You will be using the standard assembler and linkage editor so these are
ignored from now on.
You have to write your own preprocessor if you want one. This is usually
totally language specific. The main point to be aware of is to ensure
that you find some way to pass file name and line number information
through to the main compiler so that it can tell the back end this
information and so the debugger can find the right source line for each
piece of code. That is all there is to say about the preprocessor except
that the preprocessor will probably not be the slowest part of the
compiler and will probably not use the most memory so don't waste too
much time tuning it until you know you need to do so.
The main compiler for treelang consists of toplev.c from the main GCC
compiler, the parser, lexer and back end interface routines, and the
back end routines themselves, of which there are many.
toplev.c does a lot of work for you and you shoudl seriously consider
whether you want to reinvent it. It is quite possible to reuse it, as in
the case of treelang.
Writing this code is the hard part of creating a compiler using GCC. The
back end interface documentation is incomplete and the interface is
complex.
There are three main aspects to interfacing to the other gcc code.
In treelang this is handled mainly in tree1.c
and partly in treetree.c. Peruse toplev.c for details of what you need
to do.
Interfacing to the garbage collection. In treelang this is mainly in
tree1.c.
Memory allocation in the compiler should be done using the ggc_alloc and
kindred routines in ggc*.*. At the end of every function, toplev.c calls
the garbage collection several times. The garbage collection calls mark
routines which go through the memory which is still used, telling the
garbage collection not to free it. Then all the memory not used is
freed.
What this means is that you need a way to hook into this marking
process. This is done by calling ggc_add_root. This provides the address
of a callback routine which will be called duing garbage collection and
which can call ggc_mark to save the storage. If storage is only
used within the parsing of a function, you do not need to provide a way
to mark it.
Note that you can also call ggc_mark_tree to mark any of the back end
internal 'tree' nodes. This routine will follow the branches of the
trees and mark all the subordinate structures. This is useful for
example when you have created a variable declaaration that will be used
across multiple functions, or for a function declaration (from a
prototype) that may be used later on. See the next item for more on the
tree nodes.
In treelang this is done in treetree.c. A typedef called 'tree' which is
defined in tree.h and tree.def in the gcc directory and largely
implemented in tree.c and stmt.c forms the basic interface to the
compiler back end.
In general you call various tree routines to generate code, either
directly or through toplev.c. You build up data structures and
expressions in similar ways.
You can read some documentation on this which can be found via the gcc
main web page. In particular, the documentation produced by Joachim
Nadler and translated by Tim Josling can be quite useful. the C compiler
also has documentation in the main GCC manual (particularly the current
CVS version) which is useful on a lot of the details.
In time it is hoped to enhance this document to provide a more
comprehensive overview of this topic. The main gap is in explaining how
it all works together.
-
TAGS: Use the make ETAGS commands to create TAGS files which can be used in
emacs to jump to any symbol quickly.
-
GREP: grep is also a useful way to find all uses of a symbol.
-
TREE: The main routines to look at are tree.h and tree.def. You will
probably want a hardcopy of these.
-
SAMPLE: look at the sample interfacing code in treetree.c. You can use
gdb to trace through the code and learn about how it all works.
-
GDB: the GCC back end works well with gdb. It traps abort() and allows
you to trace back what went wrong.
-
Error Checking: The compiler back end does some error and consistency
checking. Often the result of an error is just no code being
generated. You will then need to trace through and find out what is
going wrong. The rtl dump files can help here also.
-
rtl dump files: The main compiler documents these files which are dumps
of the rtl (intermediate code) which is manipulated doing the code
generation process. This can provide useful clues about what is going
wrong. The rtl 'language' is documented in the main GCC manual.
Go to the first, previous, next, last section, table of contents.