CSCE 531
Spring 2019
Homework 1
Due Friday February 8 at 11:59pm

Submission Guidelines

In general, all exercises that I assign throughout the course are to be submitted by CSE Dropbox (go to https://dropbox.cse.sc.edu). More detailed submission instructions are included with this handout.

You must submit this assignment by the deadline above. No late submissions will be accepted for this assignment.

Assignment

Read this entire handout before starting the assignment.

This is an individual assignment. No teams.

This exercise has three objectives: (1) to get you used to writing C code with nontrivial linked data structures; (2) to introduce you to using flex; (3) to illustrate one of the jobs of a C preprocessor.

You can access documentation for flex ("fast lex") on the web or by typing "info flex" on any of the linux boxes in our labs (or logging in remotely).

Write a flex program to handle simple #define directives in a C program. (By "simple" I mean no defining parameterized macros.) The C source input file will be either specified on the command line or read from stdin (standard input -- if no file is specified in the command line). All your normal output will go to stdout (standard output), with any error messages or warnings sent to stderr (standard error). When executed, your program should do some initializations then call yylex() only once. In other words, the flex pattern matching engine should be the main driver of your program. You have the option of making your flex program stand-alone, with all your code in one file, or you may distribute code among different C source files as well as the flex program. Your flex file name should end with ".lex".

Your program should be submitted as a gzipped tarball of a single directory named "hw1" which includes a Makefile to use for compiling your code (so that typing "make" will produce the executable). When you are ready to submit, remove all automatically generated files, go to the parent of your hw1 directory, and type
tar cvf hw1.tar hw1 gzip hw1.tar
This will produce a file named hw1.tar.gz which you then upload to dropbox.

Your executable should be called babycpp. Typing "make clean" should remove any automatically generated files, including the executable. To run babycpp, you type
./babycpp [filename]
Here, the brackets mean the argument is optional.

PLEASE NOTE: Unlike in Windows, file and command names in Linux are all case-sensitive. You must strictly adhere to the names above. Naming your executable "BABYCPP", your directory "HW1", or your submission file "HW1.tar.gz", for example, will get points taken off.

Except for flex, you must write your code in C. That is, program code with the flex program must be C code, not C++ or Java code, and the .c file that flex produces should be compiled with gcc, not g++ or any other compiler. There may be incompatibilities between flex and C++, and you all should get more practice with C anyway. Be sure to test your programs thoroughly, at the very least, on one of the linux machines in our lab.

Your submission will be tested automatically, so you must adhere strictly to the interface above. More specifically, we will run a script on one of the linux lab machines (l-1d43-17.cse.sc.edu, for example) that will automatically compile and execute your program on the various test files, capturing the output for comparison with that of the solution. If a glitch causes us to have to manually run your program, you will lose points.

Supporting files

The directory hw1 contains some files you can use to test your code, as well as some sample code you are free to crib and modify (or not) as you wish without citation (this is the only exception to the course's plagiarism policy). You can download this whole tree at once as a zip file.

You can test your solution yourself, just as we test it, by running the Perl script hw1-test.pl, found in the directory above, using the "--self-test" option. Once you download the script, you'll need to make sure you have execute permissions on the script. At the command prompt, type
chmod o+x hw1-test.pl
Then if you just type
./hw1-test.pl
with no arguments, you'll get information on how to use the self-test option.

You will also need to hand-edit the script before it works. Change the string constant on line 19 to point to the directory with the test files on your system.

Details

In a C program, a #define directive is a single line that looks like this:

#define key value

Here, key is a C-style identifier (underscores allowed anywhere), and value is some arbitrary text---usually either an identifier or a constant. For example,
#define MAX 10
This associates the key (MAX) with the value 10. Any occurrences of the key later in the program are replaced with the corresponding value verbatim. For example,
#define MAX 10 int a[MAX], i; int main() { if (i<MAX) return 0; else return MAX; }
becomes
int a[10], i; int main() { if (i<10) return 0; else return 10; }
You may assume that the only values allowed are:

integer constants (with an optional sign, which could be + or -),
C string constants (surrounded by double quotes), or
C-style identifiers.

Your flex pattern(s) must distinguish between the three possible types of value, above, and you must store them accordingly, keeping track of which type each value is (integer constant, string constant, or identifier). See the section, "Data records in the dictionary," later in this handout. (This is actually not quite how the preprocessor handles these directives. In reality, the value is always stored literally as a string, regardless of what type it looks like it should be, and key/value substitution is literal text substitution.)

Things get interesting when the value of a #define directive is an identifier. The value could then be the key of another #define directive (or even the same one). In general, when an occurrence of a key is replaced in the code by its value, and that value is another key, then the replacement happens again with the value associated with the latter key, and so on. This continues until either the replacement value is not a key or a cycle of key-value associations is entered.

Here are two examples to illustrate:

The code
#define FOO BAR #define BAR BAT #define BAT FAT #define FAT 5 int i = FOO; int main() { return BAR; }
becomes
int i = 5; int main() { return 5; }
The code
#define CIRCLE3 CIRCLE4 #define CIRCLE1 CIRCLE2 #define CIRCLE5 CIRCLE6 #define CIRCLE2 CIRCLE3 #define CIRCLE4 CIRCLE5 #define CIRCLE6 CIRCLE4 char *CIRCLE1, *CIRCLE2, *CIRCLE3, *CIRCLE4, *CIRCLE5, *CIRCLE6;
becomes
char *CIRCLE4, *CIRCLE4, *CIRCLE4, *CIRCLE4, *CIRCLE5, *CIRCLE6;

For that last one, I recommend drawing a six-vertex directed graph for CIRCLE1,...,CIRCLE6 with arrows pointing from key to value.

Your program must be able to detect and handle circular definitions as illustrated above.

Testing

The directory hw1 contains files associated with this assignment. The subdirectory "test" contains some test files for input to your flex programs. I will use these files and others to test your programs. I will add more files to this directory as time goes on.

This assignment is based on how the gcc preprocessor actually does things (with some differences), so you should check the behavior of the actual preprocessor on these test files. To see what the gcc preprocessor does on some input file test.c, run
gcc -E test.c
the result is sent to standard output and any errors/warnings are sent to standard error. You should do this with the examples above as well as the test files.

There are some discrepancies with your output and that of gcc with the -E option:

You won't generate any line number directives, that is, lines starting with `#' followed by a number.
You will remove the entire line containing the #define directive, including the newline at the end.
You will determine the type of the value and store it internally in a format appropriate to its type (i.e., integers as ints, identifiers and string constants as strings). (gcc -E does not do any of this.)

When done testing your program, type "make clean" to remove automatically generated files before you submit.

Assumptions you can make

In the list below, by "whitespace" I mean one or more tabs and/or space characters. The newline is not counted as whitespace here, because the preprocessor treats them differently from spaces and tabs.

Here is what you may assume about the input:

There won't be any other kinds of directives for you to handle other than #define directives.
Every line of code containing a #define directive will have the following syntax: optional whitespace, followed by the literal string "#define", followed by whitespace, followed by a C-style identifier (the key), followed by whitespace, followed by a single object of one of the three possible kinds described above (the value), optionally followed by whitespace, followed by the final newline character.
#define directives won't be split across multiple lines. (You could do this in C by ending each intermediate line with a backslash.)
All integer values in #define directives are within the range of a signed long int.
The literal string "#define" won't appear anywhere outside a #define directive (not even in a comment).
No identifier in the code that matches one of the keys will ever appear inside a comment or string literal (except if the string literal itself is part of a #define directive).

The upshot of the last two assumptions is that you don't need to do anything special with comments, or with string literals outside of #define directives. Just ignore them.

Assumptions you cannot make

You cannot assume any a priori limits on the number of #define directives in the input code (there could be several million). Likewise, you cannot assume any limits on the number of keys needing substitution, or on the length of identifiers (within reason). You can only assume that if you handle your data structures reasonably efficiently, you won't run out of memory.

You also should allow a key to be redefined in a subsequent directive, but issue a warning to stderr whenever this occurs.

Efficiency requirements

The last paragraph will force you to implement linked structures. For example, for the data structure containing the key-value pairs (called a dictionary), you might use either a hash table with linking (rather than open addressing) or some kind of search tree with guaranteed balancing; these include 2-3 trees and balanced binary search trees such as an AVL, red-black, or splay trees. If you use a hash table, be prepared to resize it up if the load factor becomes too great (more than about 4, say). You won't have to worry about removing any items from the dictionary, but there may be updates.

Your code should be reasonably time-efficient. This is somewhat hard to quantify, but at the very least, your code should run in time O(mn log n), where n is either the size of the input file or of the output file, whichever is larger, and m is maximum number of keys to search to replace any single identifier. (In the worst case, m = Θ(n), and so your program would run in roughly quadratic time, but in practice m is much smaller than n.)

Your program should also be space-efficient. Your memory requirements should be linear in the size of the input file -- within a reasonably modest constant factor.

Features of flex you cannot use

I am disallowing anything in flex beyond its default features described in class. In particular, do not use any flex options such as start conditions, states, REJECT features, etc. The reason is not because these options are bad. In fact, they can be very powerful -- a bit too powerful. Instead, I want you to focus only on the core features of flex. Restricting yourself to the default, core behavior will make your code more readable and portable.

Miscellaneous hints

Flex rules

The rules you use for pattern matching are somewhat flexible (sorry!). You could use a single rule to capture an entire #define directive, then determine "by hand" the kind of value by looking at its text. Alternatively, you could capture an entire #define directive in one of three separate rules, each rule matching one of the three kinds of value. Neither of these approaches isolates the key or the value as a whole match. A better alternative is to match the three components of the directive as separate matches, using global flags to keep track of your progress through the directive.

Some code you can use if you want

I included two files in the hw1 directory: my_defs.h and my_dict.c. I used this code for my solution, and some of the code is described in the next section, below. You are free to look at it, use it or not, modified or not, without attribution.

Data records in the dictionary

You will need to use variant records to store key-value pairs, since the values associated with the keys can be of different types. I'll describe variant records in lecture, but here is a sample declaration of a key-value record that may occur as part of a linked list:
typedef enum { INT_CONST, STR_CONST, ID } VAL_TYPE; typedef struct dr { struct dr *next; char *key; VAL_TYPE tag; union { int intconstval; char *strconstval; char *idval; } u; } DICT_REC, *DR;

The code above defines three named types: VAL_TYPE (an enumerated type), DICT_REC (a variant record type), and DR (a pointer to DICT_REC). The "tag" field indicates which field of the union is currently active. You dynamically allocate a new record like this:
    DR rec_ptr;     /* ... */     rec_ptr = (DR) malloc(sizeof(DICT_REC));
Then, for example, to set the record to hold the integer value 5,
    rec_ptr->tag = INT_CONST;     rec_ptr->u.intconstval = 5;
To set the record to hold an identifier stored in id_buf,
    rec_ptr->tag = ID;     rec_ptr->u.idval = id\_buf;
(But you can't store an integer value and a string in the same record at the same time.)

Handling strings in C

In C, a string is just an array of characters. There is no special string type. In flex, the text of a match is stored in the array yytext, but this array is clobbered by the next match. If you want to store the match for safe keeping, you'll need to copy it to a new location. For example,
#include <string.h> /* ... */ char *new_str; /* ... */ new_str = (char *) malloc(strlen(yytext)+1); strcpy(new_str, yytext);
You need the "+1" to make room for the null character (ASCII value 0) terminating the copied string. The integer value yyleng is also set to the length of the match, so instead of strlen(yytext) you could have just used yyleng.

To compare two strings lexicographically (which includes checking if they are equal), use strcmp().

Handling circular definitions

The time to check for circularities is when a new key-value pair is added to the dictionary, i.e., when a #define directive is processed. This is much more efficient than checking for circularities at substitution time. Starting with the new key, follow the key-value "links" forward until either the value found is not a key (in which case do nothing) or the value found is equal to the key being added. In the latter case, the new link is creating a cycle; you can then mark each node in the cycle as such, which means that it is not to be used or followed.

For example, in one of the sample pieces of code above, the line
#define CIRCLE6 CIRCLE4
creates the cycle CIRCLE6 -> CIRCLE4 -> CIRCLE5 -> CIRCLE6, at which time this entire cycle is effectively removed from future searches.

Do not remove the cycle from the dictionary, however! It is possible that one of the links on the cycle could change due to a redefinition, breaking the cycle. Run the preprocessor on the file test1.c in the test directory to see this phenomenon. When this happens, you must unmark all the nodes on the (former) cycle.

More hints (answers to FAQs)

Any code you see below you are free to copy and modify without attribution.

The hash table

Skip this if you are using something besides a hash table with linking for your dictionary.

Here is how I define my hash table:
#include <stdlib.h> /* ... */ #define INITIAL_HASH_SIZE 8 #define MAX_LOAD_FACTOR 2 #define SCALE_FACTOR 2 /* ... */ int h_size; // Current size of the hash table array int num_items; // Number of items stored in the dictionary DR *hash_tab; // The actual array (dynamically allocated) /* ... */ h_size = INITIAL_HASH_SIZE; /* ... */ hash_tab = (DR *) malloc(h_size * sizeof(DR));
Here is my home-grown, general-purpose hash function:
int hash(const char *key, int h_size) { int sum = 0; for (; *key; key++) sum = (37*sum + *key) % h_size; return sum; }
There is nothing special about 37, except that it is prime. For performance reasons, h_size should not have any factors in common with 37, which means that, because 37 is prime, it should just not be a multiple of 37.

When the load factor (the ratio of num_items to h_size) exceeds MAX_LOAD_FACTOR, I resize the hash table up by a factor of SCALE_FACTOR. That is, I allocate a new array with size h_size*SCALE_FACTOR (updating h_size accordingly), move all records from the old array into the new array, and de-allocate the old array (free(old_array)). These numbers can vary, but SCALE_FACTOR should be at least 2, and MAX_LOAD_FACTOR should not be more than, say, 5 or 6 or so.

Line numbers

You are not required to keep track of line numbers, but if you want to (for warning messages, say), you can do this easily in flex by including a rule for the newline character "\n", whose action increments a global count.

Syntax of the #define directive

To repeat, you do NOT need to handle errors in #define directives. In all the test files I'll use, all lines containing #define directives will have the syntax:
#define key value
where key is an identifier, and value is either a (possibly signed) integer constant, string constant, or identifier. Whitespace will separate the three items, and whitespace will optionally appear before and after.

Semantics of substitutions

All substitutions are made using the dictionary in its current state at the time of the substitution. So for example,
#define ID1 ID2 int i = ID1; #define ID2 ID3 int j = ID1;
becomes
int i = ID2; int j = ID3;

Standard output versus standard error

This assumes you have the #include <stdio.h> directive somewhere.

Every C program is given two output streams: standard output (stdout) and standard error (stderr). Normal output (transformed code) should be sent to stdout, while any warnings or errors should be sent to stderr. Without redirection, both streams are echoed to the terminal, so they appear interleaved with each other.

But beware! These two streams are handled differently and not synchronized. Stdout is buffered while stderr is not. That means that stderr messages go out immediately, while stdout output may be delayed in a buffer. This means that the actual order that things appear on the terminal does not necessarily reflect the order that they are generated by the program.

If you want to better synchronize things, you can flush the stdout buffer at any time by calling fflush(stdout). For example, if you do this just before sending an error message to stderr, then your normal output will appear on the terminal before the error message.

Do not send warning messages to stdout! If you do this, the warnings will be mixed in with the normal output, and so the output files won't match. Instead, send the messages to standard error using
fprintf(stderr, "Warning: ...", ...)
instead of
printf("Warning: ...", ...)
which sends the message to stdout. fprintf() is called just like printf(), but takes the stream to send the output as the first argument.

DOS newlines versus Unix/Linux newlines

In the Windows world, lines in text files end with the sequence \r\n (carriage-return, linefeed), whereas in Unix/Linux they end in \n only. Please be aware of this if you edit text files on Windows. You can hand-edit your files with emacs on Unix/Linux to remove the offending carriage returns, or you can run your file through the lexer generated by this flex program:
%% \r\n putchar('\n'); %% int main() { yylex(); return 0; }
(You compile lex.yy.c using the -lfl switch.)

The od (octal dump) command

This is a great command! Use it to inspect the contents of a file. For example,
od -c mytextfile
lists each byte of the text file in a readable way. This is especially useful to determine the presence and nature of whitespace in your file, which is often hard to do when viewing your file in some editors. Other switches give other output formats. Type
man od
for more information.

Getting started

Below are just my recommendations. Your mileage may vary.

First, run the C preprocessor (gcc -E) on the various test files and see what happens. Then look at and test the sample flex programs I made available to you (see the next section).

Implement the flex pattern matcher next, because this is probably the least familiar to you. To start with, you can have trivial actions that maybe just print the texts of the matches. This will allow you to test whether you've got your rules correct. You'll need rules for matching #define directives or their individual components, as well as a rule for finding identifiers in the rest of the code.

Next, implement the dictionary, including routines for searching for, adding, and updating an entry, as well as checking for, marking, and unmarking cycles. Please, please, PLEASE make every attempt to implement this data structure from scratch. That way, you will learn the C programming techniques that are most important for the main project to come.

Flex examples

To help you get more of a feeling for using flex, I have provided two sample flex programs, with Makefiles, used in previous classes in the subdirectories "wordcount" and "tokens". The latter was taken directly from the flex tutorial (see "info flex" above). Copy each directory to one that you own. For the tokens program, type "make", then
./tokens sample.pas
and observe the output. For the wordcount program, type "make", then
./wordcount
then type a few lines of text, ending with ctrl-D on a line by itself (ctrl-D provides an end-of-file marker for keyboard input).

When done testing, type "make clean" to remove automatically generated files before you submit.

You are free to adapt and modify either of these programs for your homework without citation.

This page was last modified Wednesday January 30, 2019 at 10:32:40 EST.

CSCE 531 Spring 2019 Homework 1 Due Friday February 8 at 11:59pm