CSCE 531
Summer 2019
Project Part I
Due Thursday, March 28, 2019 (extended 48 hours)

Process C global variable declarations. This involves both installing the declarations into the symbol table and allocating memory for the variables in the assembly language output file. Also, after all declarations have been processed, you should dump the symbol table (using st_dump() from symtab.h); to do this, run your executable with the "-d" or "--dump" option as a command line argument.

Your compiler should read C source code from stdin and write the x86 assembly language output to stdout. Your compiler executable should be called pcc3. You will not have to emit assembly code explicitly, but rather call appropriate routines in the back end (backend-x86.c and backend-x86.h). Besides altering the gram.y file, put syntax tree-building functions into a new file tree.c, with definitions for export in tree.h. Put code-generating routines into a new file encode.c, with definitions for export in encode.h. With few exceptions throughout the project, all backend routines are called from encode.c (some may be called directly from the grammar). No backend routines should be called from tree.c, hence you will not need to include backend-x86.h in tree.c.

The scores given below are for graduate students. Undergraduates get a 10% boost overall.

To receive 80% of the credit: You must be able to process the following basic type specifiers: int, char, float, and double. You may limit the syntax so that only one type specifier may be given per declaration. You must also be able to handle pointer and array type modifiers. You may limit the syntax so that array dimensions must always be given. You may assume the dimension given will always be an unsigned integer constant. Each declaration should include an identifier (id). If not, an error should be issued. A symbol table entry should be made for each id. The entry should indicate the type of the declaration. Routines for building and analyzing types are in the types module (types.h) and bucket module (bucket.h), and routines for manipulating the symbol table are in the symbol table module (symtab.h). You are required to use these modules, but you are not allowed to modify them. For more on these and the other modules, see the Resources section, below.

To receive 90% of the credit: In addition to obtaining the 80% level, you should also allow multiple type specifiers per declaration. You should handle the additional specifiers signed, unsigned, short, and long. You should add the necessary semantic checks and error messages to support multiple type specifiers (e.g., short short, unsigned double, et cetera are illegal). You should also add the function type modifier. You should add the necessary semantic checks and error messages to support function modifiers (it is illegal for a function to return a function, for example). Only "old style" functions need to be supported at this level, that is, with no parameter list between parentheses.

To receive 100% of the credit: In addition to obtaining the 90% level, you should also allow parameters in function declarations. You should insist that each parameter declaration includes an id (else semantic error). The possible parameter types are the same as described in the previous levels, including pointers, arrays, and functions. You should also support the void return type for a function. A parameter may be a reference parameter, e.g.,
int f(int& a); void g(int (&a)[5]);
This is the only aspect of the language that is not part of C. You can assume that any "&" appears only once in a parameter declaration, and only modifies the complete parameter type (so for example, you will never see int h(int&* a);). You can also assume that any parameter of function type has no parameter declarations of its own (you will only see "old style" function types as parameters).

The semantic errors you should check for at this level are that each parameter declaration must include an id, and that the same id should not appear more than once in the same parameter list.

To receive 110% of the credit (that is, 10% extra credit): In addition to obtaining the 100% level, you should also be capable of processing initializers. You may assume that the initializing expressions will only be unsigned constants. You should support initializations of arrays, including multidimensional arrays. For multidimensional arrays "the brace-enclosed list of initializers should match the structure of the variable being initialized" (to quote Harbison & Steele, "C: A Reference Manual"). Arrays may be incompletely initialized; fill remaining slots with zeros. You do not have to support the initialization of arrays with string literals. You also may assume that pointers will only be initialized to zero. Be sure to consider semantic errors: wrong number of initializers, wrong type, etc.

The x86 (actually 32-bit i386) assembly code to be emitted for this assignment is generated automatically by calling functions in backend-x86.c, which I will discuss briefly in class.

At all levels you are responsible for detecting duplicate declarations. At the 100% level, you must also detect duplicate declarations in parameter lists.

Your compiler should be capable of detecting multiple semantic errors in one file. You can make arbitrary decisions about how to proceed when errors occur (for instance, with a duplicate declaration you might decide to ignore the second declaration). The important point is to do something so you can proceed (without causing a later segmentation fault during compilation).

You may allow the compiler to stop processing with the first syntax error. A syntax error is defined with respect to the distributed grammar (gram.y, see next paragraph).

Resources

The file proj_src.zip unzips to a directory proj_src containing the base files for the project. The base files include:

a header file with some general definitions defs.h
a complete lexical scanner scan.l,
a skeleton yacc/bison parser gram.y,
a symbol table module symtab.[hc],
a type module types.[hc],
a bucket module (for handling type specifiers) bucket.[hc],
a message-generating module message.[hc],
a sample main routine main.c,
backend routines for spitting out assembly code backend-x86.[hc],
a few miscellaneous utility routines utils.c, and
a sample Makefile.

The lexical scanner will strip out comments, but it does not handle any of the C preprocessor commands (#include, #define, #ifdef, et cetera). You will not be expected to handle preprocessor commands, anyway.

Do not alter the module files (backend-x86.*, symtab.*, types.*, bucket.*, message.*, main.c). If you feel the need to alter one of these files, then there is a problem, either with your code or with ours. If you think there is a bug in the code we gave you, then you are probably wrong, but please let us know anyway. If necessary, we will issue updates in a timely fashion.

You should copy the files in proj_src into a sibling directory named proj1, where you develop your code.

The file proj1test.zip unzips to a directory proj1test that contains files for testing. These test files are C source (.c) files with names starting with the letter "T"). Generally, test file names follow a regular pattern:
"T"[1-4]"L"[0-9]+"_"(err|ok)".c"
The digit after the "T" indicates the project installment. The number between the "L" and the underscore indicates which level is being tested. The text between the underscore and the period indicates whether or not the file contains errors. The "err" files are used for testing your compiler's error reporting, and the "ok" files are to test your compiler's actual translation of well-formed C code. The .s and .err files are the "officially correct" outputs of the compiler to stdout and stderr, respectively, and are used for comparisons when running the test script (see below).

These are not necessarily the only test files that will be used when grading, so do your own testing too.

The proj_src and proj1test directories should be subdirectories of the same parent. Do not nest one inside the other.

The proj1test directory also contains the file "pcc3", a working executable solution to the entire project. With one exception (see below), your compiler's output (both to stdout and stderr) must match the output of the solution pcc3 on the same test file. Also in the proj1test directory is the Perl script proj1-test.pl, which will be used for grading. You may run it yourself with the "--self-test" option, but beforehand, you will need to hand-edit the script near the top to point to the common parent directory of proj1 and proj1test.

The 80% level functionality will be needed in order to do later parts of the project, so be sure you at least get that much of the assignment completed.

Grading

As with previous assignments, we will grade in a mostly automated fashion, using the Perl script proj1-test.pl. This script first attempts to compile your compiler by executing the "make" command. If the make succeeds, then it will execute your compiler on each of the test files (redirected to stdin), capturing your compiler's output to stdout and stderr separately, and comparing your output with that of the official solution. Running this script yourself is your best determination of how you will score. Generally, anything short of files matching exactly will cause points to be subtracted. However, there are three important things to consider:

As mentioned above, your program's behavior on test files ending in "_err.c" will only be graded on how they handle errors; the normal assembly output and symbol table dump will be ignored.
For test files ending in "_ok.c", your program should not generate any semantic error messages (because there aren't any semantic errors in these files). However (just for this installment), the output to stderr contains a dump of the symbol table, which lets us know whether you are installing symbols correctly. So this output will also be compared with the solution.
In this installment only, the script will make no attempt to assemble, link, or execute your compiler's assembly language output, because you are not yet compiling executable C programs. This will be attempted in future installments, however.

The directory also includes a file comments.txt which is the comments file produced when I run the script on my own solution. If your comments.txt file looks like this one, then that is a good sign.

The solution executable pcc3 reads C code from standard input, writes assembly code to standard output, and error messages to standard error. If given the "-d" or "--dump" option, it also dumps the symbol table to stderr and the end of compilation. (If given the "--dump-all" option, it will also dump the symbol table after every batch of local declarations, as well as at the end of compilation. This will be useful when testing future installments of the project.) You may do whatever you wish with this program (it may be useful to run it on tiny C programs to see what it produces).

Platform

The official platform for your compiler development is the Linux machines in our department (e.g., l-1d43-01.cse.sc.edu, l-1d43-02.cse.sc.edu, and the like). You may develop code on another platform (GNU/Linux/Unix-like is heavily preferred; I strongly recommend against using Windows), but you must make sure your program ultimately compiles and runs correctly on the official platform, because the testing script proj1-test.pl will evaluate your compiler on this platform when grading.

WARNING: Porting your code from one platform to another can be an unexpectedly time-consuming task. You should NEVER wait until the last minute to do this. If you develop on a separate platform, you should test your code frequently on the official platform to guard against unpleasant surprises. There will be no extra consideration for projects submitted late because of porting issues (e.g., wrong version of gcc, wrong include directories, etc.). You have been warned.

Suggestions for division of labor within a team

This is merely a suggestion. The first thing you should do as a team is to explore and understand the base code given to you, as well as getting a better feel for the C language and its syntax. All team members should do this, but if the team has more than one person, it may help if, say, one team member studies the grammar (gram.y) while another looks at the symbol table module (reading the comments in symtab.h), while another concentrates on the types module, etc., each reporting to the other team members.

A multiperson team should meet regularly -- at least once or twice a week, or if the class is during the summer, every day. Set up a schedule of meetings as soon as possible. Each team member should contribute substantially to the coding effort, and should also understand her or his teammates' contributions as well.

Submission

To receive full credit for the assignment, your team must submit it via CSE Dropbox (Moodle) no later than 11:59 p.m. on the due date. Late submissions will be accepted with penalties described in the syllabus up to one week late. There should be only one submission per team; each team should designate one of its members to submit on behalf of the team. Any number of resubmissions are allowed up to the final deadline, and only the last submission will be graded. This will be true for future project installments as well.

You must turn in all source files (even the ones we gave you) and a Makefile for your compiler.

To turn in this assignment, follow these steps exactly. Any deviation from these instructions will get points taken off.

make sure all your source files and Makefile are in the directory "proj1" as described above (do not include the test files); make sure that typing "make -B" in this directory sucessfully creates your executable.
run "make clean" and remove any other non-program-source files from the directory,
create a tarball of the directory in its parent (run "tar cvf proj1.tar proj1" from the parent directory),
gzip it ("gzip proj1.tar").
Upload the single file proj1.tar.gz to CSE Dropbox. (If the Dropbox server is down or otherwise not available, you should instead attach this file to an email sent directly to the TA before the deadline (please cc me as well).

The test script will check for extraneous files in your directory and we will subtract points if we find them. Thus the "make clean" step is crucial. Check your directory by hand just in case. Also note that on linux systems, file and directory names are case-sensitive; having a different case in any names above will get points taken off.

Nota bene ...

You get credit for features successfully implemented. You do not get credit for attempting to do something; you get credit for the things that you can successfully demonstrate work.

Work on and test your system incrementally and back up your system frequently, especially when the due time is approaching! Too many times in the past, a student made seemingly minor code changes to try to improve a stable system, only to find that the altered system crashed completely and was useless. They didn't back up the old system, and they didn't have time to undo the changes before the project was due. They weren't even sure they could remember what the changes were. FAIR WARNING: don't let this happen to you; you will not be given any leniency if this happens.

As always, you are expected to do your own work on this assignment, although this time, a team counts as a "single person".

Finally: you should adequately document and structure your program. Remember, you or your teammates may be called upon to explain this program orally during a subsequent quiz.

FAQs

This list will probably be updated in the coming days in response to student queries.

How do I run the executable solution? Go to your copy of the directory with the executable and test files and type
./pcc3
or, if you want the symbol table dump, type "./pcc3 -d" or "./pcc3 --dump". Then you can start typing in a C program at the keyboard (end with ctrl-D on a new, empty line), seeing the output as you go. To run the executable on one of the test programs as input, type, for example,
./pcc3 < T1L80_ok.c
How to I capture both output streams (stdout and stderr) into separate files (for example, so that I can get the symbol table dump in a file)? In the bash shell, type
./pcc3 -d < T1L80_ok.c > T1L80_ok.s 2> T1L80_ok.err
This puts the standard output (assembly code) into T1L80_ok.s and standard error (the symbol table dump) into T1L80_ok.err. If you are not already running the bash shell, you need to type "bash" beforehand and "exit" afterwards.
The grammar is huge! Do I have to deal with the whole thing? No, not hardly. For this installment, you will only deal with a small subset of the grammar. Many of the required actions are just the default actions.
I've been giving some grammar symbols some types and now I'm getting lots of type conflicts, even in parts of the grammar that I haven't touched. Why is this and what can I do about it? Remember that unless you supply an explicit action for a production, it defaults to the action $$ = $1;, which will cause a type conflict unless $$ and $1 have the same type. The easiest way to suppress the default action is to supply an explicit empty action {}.
The grammar is still pretty labyrinthine, and it is hard to tell how the C source is being parsed. Is there a way to see what the parser is doing at each step? In defs.h uncomment the compiler directive
#define YYDEBUG 1
Then run a make. This will make the parser display a complete trace during a parse. Now run
./pcc3
with no redirections. When prompted, type in a C program one token per line, and notice the results. You can ignore the lines starting with "Stack now ..." and "Entering state ..."; they are just the parser's internal bookkeeping. The parts starting with "Reducing stack by rule ... (line ...)" show exactly which production the parser is reducing by and when. Tracing the parse will be even more useful in Project II. Just remember to re-comment that line in defs.h before submitting.
Do I ever need to modify the actual grammar, i.e., alter/add productions? No, not for this installment, at least. Later, yes.
I just edited some code to fix an error, but after re-running, the error is still there, as if I hadn't done anything. WTF? Do you have all the dependencies in your Makefile that you should? If not, your edited source might not be recompiling when you run make.
Besides annotating the grammar, should we modify any of the other base code you gave us? No, with only one exception: if you include in the %union declaration types you've defined in other .h files, you should also include these .h files before including y.tab.h in scan.l.
There doesn't seem to be much to put into tree.c. What's up with that? That depends on how much code you want to include with the grammar. Most of my actions consist of a single function call, so I put these function definitions in tree.c, with prototypes in tree.h. That is stylistically better than putting long reams of code in grammar actions.
How do we issue error messages? Use the error() function in message.h. You call it like you would call printf(), except the final newline is always included. The functions warning() and bug() are called the same way. Bug aborts the compilation after it is called. Use bug to display internal bugs in your code.
How do I install a new variable name into the symbol table? More generally, what do the tags and fields of the ST_DR record mean? When you are ready to install a new variable into the symbol table, you create a new ST_DR (Symbol Table Data Record; use stdr_alloc()), fill in the fields yourself appropriately, then pass it to st_install().
Do I need to install parameters of procedures/functions into the symbol table? No, not right now. But you do need to put the parameters into a PARAM_LIST that you build up yourself, and include that when building the function type (ty_build_func()). You need to check that there are no duplicate names on one list. (In Part II, you will install parameters into the symbol table in the local scope of a procedure/function definition.)
HELP!!! It's 3:00 in the morning, I've been fiddling with the code for hours and hours now, and I'm still getting a segfault/weird behavior/NULL pointer, etc.... Take three deep breaths, save all your work, log off, and go to bed.
Okay, I did that, and now it's the next day, and I'm still having problems. Unlike Java, the C language will--happily and without warning--allow you to shoot yourself in the foot, so to speak. Chances are you trashed some memory doing something, and it didn't show up until way later in some completely unrelated part of the program. In the short term, if you haven't done it already, pepper your code with statements like
  message("I am about to call such-and-such with param value %d\n",...);
or
  msg("After %d for-loop iterations, x == %d",...);
etc. These and other printf-like output functions are defined in message.h and output to stderr, which is unbuffered, so the message is output promptly, even if stdout is redirected. At the very least, these statements will help you pin down exactly when the segmentation fault occurs. For the long term, you should be including lots of internal checks in your code -- for example,
  if (what_I_expected_here_is_not_true)     bug("I did not expect this value for such-and-such: %d", ...);
The bug() function is also defined in message.h and is extremely useful. It halts your program with your message preceded by an input line number. For example, whenever you use a "->" operator, are your sure the pointer operand is properly initialized and non-NULL? If not, check it with a call to bug() beforehand.

After you fix the errors, be sure to remove or comment out all the calls to message(), msg(), and msgn() you added to track down the errors. You can and should keep the calls to bug() in the code permanently, however.
How do I use a debugging tool like gdb? I have no idea. In the past, I have found debugging tools generally not worth the effort to learn, so I do not know how to use them. Instead, I follow the advice of the previous FAQ.

This file was last modified Wednesday March 27, 2019 at 14:26:50 EDT.

CSCE 531 Summer 2019 Project Part I Due Thursday, March 28, 2019 (extended 48 hours)