In general, all exercises that I assign throughout the course are to be submitted by CSE Dropbox (go to https://dropbox.cse.sc.edu). More detailed submission instructions are included with this handout.
You must submit this assignment by the deadline above. No late submissions will be accepted for this assignment.
Read this entire handout before starting the assignment.
This is an individual assignment. No teams.
This exercise has three objectives: (1) to get you used to writing C code with nontrivial linked data structures; (2) to introduce you to using flex; (3) to illustrate one of the jobs of a C preprocessor.
You can access documentation for flex ("fast lex") on the web or by typing "info flex" on any of the linux boxes in our labs (or logging in remotely).
Write a flex program to handle simple #define directives in a C program. (By "simple" I mean no defining parameterized macros.) The C source input file will be either specified on the command line or read from stdin (standard input -- if no file is specified in the command line). All your normal output will go to stdout (standard output), with any error messages or warnings sent to stderr (standard error). When executed, your program should do some initializations then call yylex() only once. In other words, the flex pattern matching engine should be the main driver of your program. You have the option of making your flex program stand-alone, with all your code in one file, or you may distribute code among different C source files as well as the flex program. Your flex file name should end with ".lex".
Your program should be submitted as a gzipped tarball of
a single directory named "hw1" which includes a Makefile to use for
compiling your code (so that typing "make" will produce the
executable). When you are ready to submit, remove all automatically
generated files, go to the parent of your hw1 directory, and type
tar cvf hw1.tar hw1
gzip hw1.tar
This will produce a file named hw1.tar.gz which you then upload to
dropbox.
Your executable should be called babycpp. Typing "make clean" should
remove any automatically generated files, including the executable.
To run babycpp, you type
./babycpp [filename]
Here, the brackets mean the argument is optional.
PLEASE NOTE: Unlike in Windows, file and command names in Linux are all case-sensitive. You must strictly adhere to the names above. Naming your executable "BABYCPP", your directory "HW1", or your submission file "HW1.tar.gz", for example, will get points taken off.
Except for flex, you must write your code in C. That is, program code with the flex program must be C code, not C++ or Java code, and the .c file that flex produces should be compiled with gcc, not g++ or any other compiler. There may be incompatibilities between flex and C++, and you all should get more practice with C anyway. Be sure to test your programs thoroughly, at the very least, on one of the linux machines in our lab.
Your submission will be tested automatically, so you must adhere strictly to the interface above. More specifically, we will run a script on one of the linux lab machines (l-1d43-17.cse.sc.edu, for example) that will automatically compile and execute your program on the various test files, capturing the output for comparison with that of the solution. If a glitch causes us to have to manually run your program, you will lose points.
The directory hw1 contains some files you can use to test your code, as well as some sample code you are free to crib and modify (or not) as you wish without citation (this is the only exception to the course's plagiarism policy). You can download this whole tree at once as a zip file.
You can test your solution yourself, just as we test it, by running the
Perl script hw1-test.pl, found in the directory above, using the
"--self-test" option. Once you download
the script, you'll need to make sure you have execute permissions on
the script. At the command prompt, type
chmod o+x hw1-test.pl
Then if you just type
./hw1-test.pl
with no arguments, you'll get information on how to use the self-test
option.
You will also need to hand-edit the script before it works. Change the string constant on line 19 to point to the directory with the test files on your system.
In a C program, a #define directive is a single line that looks like this:
#define key value
Here,
#define MAX 10
This associates the key (MAX) with the value 10. Any occurrences of
the key later in the program are replaced with the corresponding value
verbatim. For example,
#define MAX 10
int a[MAX], i;
int main()
{
if (i<MAX)
return 0;
else
return MAX;
}
becomes
int a[10], i;
int main()
{
if (i<10)
return 0;
else
return 10;
}
You may assume that the only values allowed are:
Things get interesting when the value of a #define directive is an identifier. The value could then be the key of another #define directive (or even the same one). In general, when an occurrence of a key is replaced in the code by its value, and that value is another key, then the replacement happens again with the value associated with the latter key, and so on. This continues until either the replacement value is not a key or a cycle of key-value associations is entered.
Here are two examples to illustrate:
Your program must be able to detect and handle circular definitions as illustrated above.
The directory hw1 contains files associated with this assignment. The subdirectory "test" contains some test files for input to your flex programs. I will use these files and others to test your programs. I will add more files to this directory as time goes on.
This assignment is based on how the gcc preprocessor actually does
things (with some differences), so you should check the behavior of
the actual preprocessor on these test files. To see what the gcc
preprocessor does on some input file test.c, run
gcc -E test.c
the result is sent to standard output and any errors/warnings are sent
to standard error. You should do this with the examples above as well
as the test files.
There are some discrepancies with your output and that of gcc with the -E option:
When done testing your program, type "make clean" to remove automatically generated files before you submit.
In the list below, by "whitespace" I mean one or more tabs and/or space characters. The newline is not counted as whitespace here, because the preprocessor treats them differently from spaces and tabs.
Here is what you may assume about the input:
You cannot assume any
You also should allow a key to be redefined in a subsequent directive, but issue a warning to stderr whenever this occurs.
The last paragraph will force you to implement linked structures. For example, for the data structure containing the key-value pairs (called a dictionary), you might use either a hash table with linking (rather than open addressing) or some kind of search tree with guaranteed balancing; these include 2-3 trees and balanced binary search trees such as an AVL, red-black, or splay trees. If you use a hash table, be prepared to resize it up if the load factor becomes too great (more than about 4, say). You won't have to worry about removing any items from the dictionary, but there may be updates.
Your code should be reasonably time-efficient. This is somewhat hard to quantify, but at the very least, your code should run in time O(mn log n), where n is either the size of the input file or of the output file, whichever is larger, and m is maximum number of keys to search to replace any single identifier. (In the worst case, m = Θ(n), and so your program would run in roughly quadratic time, but in practice m is much smaller than n.)
Your program should also be space-efficient. Your memory requirements should be linear in the size of the input file -- within a reasonably modest constant factor.
I am disallowing anything in flex beyond its default features described in class. In particular, do not use any flex options such as start conditions, states, REJECT features, etc. The reason is not because these options are bad. In fact, they can be very powerful -- a bit too powerful. Instead, I want you to focus only on the core features of flex. Restricting yourself to the default, core behavior will make your code more readable and portable.
The rules you use for pattern matching are somewhat flexible (sorry!). You could use a single rule to capture an entire #define directive, then determine "by hand" the kind of value by looking at its text. Alternatively, you could capture an entire #define directive in one of three separate rules, each rule matching one of the three kinds of value. Neither of these approaches isolates the key or the value as a whole match. A better alternative is to match the three components of the directive as separate matches, using global flags to keep track of your progress through the directive.
I included two files in the hw1 directory: my_defs.h and my_dict.c. I used this code for my solution, and some of the code is described in the next section, below. You are free to look at it, use it or not, modified or not, without attribution.
You will need to use variant records to store key-value pairs, since
the values associated with the keys can be of different types. I'll
describe variant records in lecture, but here is a sample declaration
of a key-value record that may occur as part of a linked list:
typedef enum { INT_CONST, STR_CONST, ID } VAL_TYPE;
typedef struct dr {
struct dr *next;
char *key;
VAL_TYPE tag;
union {
int intconstval;
char *strconstval;
char *idval;
} u;
}
DICT_REC, *DR;
The code above defines three named types: VAL_TYPE (an enumerated
type), DICT_REC (a variant record type), and DR (a pointer to
DICT_REC). The "tag" field indicates which field of the union is
currently active. You dynamically allocate a new record like this:
DR rec_ptr;
/* ... */
rec_ptr = (DR) malloc(sizeof(DICT_REC));
Then, for example, to set the record to hold the integer value 5,
rec_ptr->tag = INT_CONST;
rec_ptr->u.intconstval = 5;
To set the record to hold an identifier stored in id_buf,
rec_ptr->tag = ID;
rec_ptr->u.idval = id\_buf;
(But you can't store an integer value and a string in the same record
at the same time.)
In C, a string is just an array of characters. There is no special
string type. In flex, the text of a match is stored in the array
yytext, but this array is clobbered by the next match. If you want to
store the match for safe keeping, you'll need to copy it to a new
location. For example,
#include <string.h>
/* ... */
char *new_str;
/* ... */
new_str = (char *) malloc(strlen(yytext)+1);
strcpy(new_str, yytext);
You need the "+1" to make room for the null character (ASCII value 0)
terminating the copied string. The integer value yyleng is also
set to the length of the match, so instead of strlen(yytext) you
could have just used yyleng.
To compare two strings lexicographically (which includes checking if they are equal), use strcmp().
The time to check for circularities is when a new key-value pair is added to the dictionary, i.e., when a #define directive is processed. This is much more efficient than checking for circularities at substitution time. Starting with the new key, follow the key-value "links" forward until either the value found is not a key (in which case do nothing) or the value found is equal to the key being added. In the latter case, the new link is creating a cycle; you can then mark each node in the cycle as such, which means that it is not to be used or followed.
For example, in one of the sample pieces of code above, the line
#define CIRCLE6 CIRCLE4
creates the cycle
CIRCLE6 -> CIRCLE4 -> CIRCLE5 -> CIRCLE6, at which time this entire
cycle is effectively removed from future searches.
Do not remove the cycle from the dictionary, however! It is possible that one of the links on the cycle could change due to a redefinition, breaking the cycle. Run the preprocessor on the file test1.c in the test directory to see this phenomenon. When this happens, you must unmark all the nodes on the (former) cycle.
Any code you see below you are free to copy and modify without attribution.
Skip this if you are using something besides a hash table with linking for your dictionary.
Here is how I define my hash table:
#include <stdlib.h>
/* ... */
#define INITIAL_HASH_SIZE 8
#define MAX_LOAD_FACTOR 2
#define SCALE_FACTOR 2
/* ... */
int h_size; // Current size of the hash table array
int num_items; // Number of items stored in the dictionary
DR *hash_tab; // The actual array (dynamically allocated)
/* ... */
h_size = INITIAL_HASH_SIZE;
/* ... */
hash_tab = (DR *) malloc(h_size * sizeof(DR));
Here is my home-grown, general-purpose hash function:
int hash(const char *key, int h_size)
{
int sum = 0;
for (; *key; key++)
sum = (37*sum + *key) % h_size;
return sum;
}
There is nothing special about 37, except that it is prime. For
performance reasons, h_size should not have any factors in common
with 37, which means that, because 37 is prime, it should just not be
a multiple of 37.
When the load factor (the ratio of num_items to h_size) exceeds MAX_LOAD_FACTOR, I resize the hash table up by a factor of SCALE_FACTOR. That is, I allocate a new array with size h_size*SCALE_FACTOR (updating h_size accordingly), move all records from the old array into the new array, and de-allocate the old array (free(old_array)). These numbers can vary, but SCALE_FACTOR should be at least 2, and MAX_LOAD_FACTOR should not be more than, say, 5 or 6 or so.
You are not required to keep track of line numbers, but if you want to (for warning messages, say), you can do this easily in flex by including a rule for the newline character "\n", whose action increments a global count.
To repeat, you do NOT need to handle errors in #define directives. In all the
test files I'll use, all lines containing #define directives will have
the syntax:
#define key value
where key is an identifier, and value is either a (possibly signed)
integer constant, string constant, or identifier. Whitespace will
separate the three items, and whitespace will optionally appear before
and after.
All substitutions are made using the dictionary in its current state
at the time of the substitution. So for example,
#define ID1 ID2
int i = ID1;
#define ID2 ID3
int j = ID1;
becomes
int i = ID2;
int j = ID3;
This assumes you have the #include <stdio.h> directive somewhere.
Every C program is given two output streams: standard output (stdout) and standard error (stderr). Normal output (transformed code) should be sent to stdout, while any warnings or errors should be sent to stderr. Without redirection, both streams are echoed to the terminal, so they appear interleaved with each other.
But beware! These two streams are handled differently and not synchronized. Stdout is buffered while stderr is not. That means that stderr messages go out immediately, while stdout output may be delayed in a buffer. This means that the actual order that things appear on the terminal does not necessarily reflect the order that they are generated by the program.
If you want to better synchronize things, you can flush the stdout buffer at any time by calling fflush(stdout). For example, if you do this just before sending an error message to stderr, then your normal output will appear on the terminal before the error message.
Do not send warning messages to stdout! If you do this, the warnings
will be mixed in with the normal output, and so the output files won't
match. Instead, send the messages to standard error using
fprintf(stderr, "Warning: ...", ...)
instead of
printf("Warning: ...", ...)
which sends the message to stdout. fprintf() is called just like
printf(), but takes the stream to send the output as the first argument.
In the Windows world, lines in text files end with the sequence \r\n
(carriage-return, linefeed), whereas in Unix/Linux they end in \n
only. Please be aware of this if you edit text files on Windows. You
can hand-edit your files with emacs on Unix/Linux to remove the
offending carriage returns, or you can run your file through the lexer
generated by this flex program:
%%
\r\n putchar('\n');
%%
int main()
{
yylex();
return 0;
}
(You compile lex.yy.c using the -lfl switch.)
This is a great command! Use it to inspect the contents of a file.
For example,
od -c mytextfile
lists each byte of the text file in a readable way. This is
especially useful to determine the presence and nature of whitespace
in your file, which is often hard to do when viewing your file in some
editors. Other switches give other output formats. Type
man od
for more information.
Below are just my recommendations. Your mileage may vary.
First, run the C preprocessor (gcc -E) on the various test files and see what happens. Then look at and test the sample flex programs I made available to you (see the next section).
Implement the flex pattern matcher next, because this is probably the least familiar to you. To start with, you can have trivial actions that maybe just print the texts of the matches. This will allow you to test whether you've got your rules correct. You'll need rules for matching #define directives or their individual components, as well as a rule for finding identifiers in the rest of the code.
Next, implement the dictionary, including routines for searching for, adding, and updating an entry, as well as checking for, marking, and unmarking cycles. Please, please, PLEASE make every attempt to implement this data structure from scratch. That way, you will learn the C programming techniques that are most important for the main project to come.
To help you get more of a feeling for using flex, I have provided two
sample flex programs, with Makefiles, used in previous classes in the
subdirectories "wordcount" and "tokens". The latter was taken directly from the flex tutorial
(see "info flex" above). Copy each directory to one that you own.
For the tokens program, type "make", then
./tokens sample.pas
and observe the output. For the wordcount program, type "make", then
./wordcount
then type a few lines of text, ending with ctrl-D on a line by itself
(ctrl-D provides an end-of-file marker for keyboard input).
When done testing, type "make clean" to remove automatically generated files before you submit.
You are free to adapt and modify either of these programs for your homework without citation.