CSCI 587 - Assignment 2 Text Compression 
Date: Jan 23, 1997
Date Due: Jan 30, 1997
-  The dictionary used by the Unix system command spell is in the file
        /usr/dict/words. Write a small C program that will calculate the 
        frequencies of each character.
        
        -  What major differences do you see from the frequencies 
                presented in class?
        
-  Why is this text not a good sample text?
        
 
-  Using the Huffman coding encode the phrase "USC wins".
-  Compare the performance of compress, compact, and gzip calculating 
        the compression ratio for each on /usr/dict/words. 
        Also record the time required using the command "time."
        
        -  To do this create a symbolic link to the dictionary 
                from your home directory with the command
        
 ln -s /usr/dict/words words
-  Then run each of the commands on the file using time using the elapsed time.
        
 time compact words
 Use the manual ( man 1 time ) to interpret the output.
 u = user time, s = system time, then total or elapsed time.
-  Check the size using the command "wc"
        
-  You will need to uncompact, uncompress, and guzip to get back to the original file
        
 
-  Find a spelling/grammar checker on a PC or Mac.
        
        -  Which wordprocessor are you using?
        
-  What is the grammar checker's evaluation of 
             
 An hoarse is one thee gulf curse.
 Out the window, the bird flew.
-  How does the spelling checker respond to 
             
 fastly, greenly, et cetera (and I mean the phrase)
 
-  Extra Credit 3 points A digram is a two character sequence.
        
        -  Using /usr/dict/words calculate a static model 
                for digram compression. 
        
-  For the top ten digrams  compute a Huffman code
        
 
  | Character | Frequency % | Huffman code | 
 | space | 18.21 | 111 | 
 | E | 10.53 | 000 | 
 | T | 7.68 | 1101 | 
 | A | 6.22 | 1011 | 
 | I | 6.14 | 1001 | 
 | O | 6.06 | 1000 | 
 | R | 5.87 | 0111 | 
 | S | 5.81 | 0110 | 
 | N | 5.73 | 0100 | 
 | H | 3.63 | 11001 | 
 | C | 3.11 | 10101 | 
 | L | 3.07 | 10100 | 
 | D | 2.97 | 01011 | 
 | M | 2.48 | 00111 | 
 | U | 2.27 | 00110 | 
 | P | 1.89 | 00100 | 
 | F | 1.68 | 110001 | 
 | G | 1.65 | 110000 | 
 | B | 1.32 | 010100 | 
 | W | 1.13 | 001011 | 
 | Y | 1.07 | 001010 | 
 | V | 0.70 | 0101010 | 
 | K | 0.31 | 01010110 | 
 | X | 0.25 | 010101110 | 
 | Q | 0.10 | 0101011110 | 
 | J | 0.06 | 01010111110 | 
 | Z | 0.06 | 01010111111 | 
Figure from Introduction to Natural Language Processing 
by Mary Dee Harris.