Course Notes for CSCE 531, Fall 2003 Stephen A. Fenner These notes are mostly written for myself, so they may sometimes be unclear. Week 01 ------- What is a compiler? A translator of programs from source language to target language. Target language may be high-level. Pipeline picture of a compiler Front-end and back-end separation issues: modularity and portability front end -> back end Project information Homework info Cheating info Successive refinement of pipeline approach: Front end: controls everything (syntax-directed translation) | -> lexer -> parser -> semantic checks -> code generator -> text tokens syntax type check s.t.'s target code trees type coercion declarations | ^ ^ | / / v v v symbol table Start with lexical analysis: TOKEN is smallest meaningful unit of a program Ex: foo x2 3220 3.14e-60 + ( } != && ; , . whitespace normally only delimits tokens Week 02 ------- lexer recognizes tokens in input stream, sends token type and text to parser. Matching tokens, token syntax: regular expressions: Fix finite alphabet Sigma of symbols Sigma^* is the set of all finite strings of symbols from Sigma A _language_ over Sigma is any subset of Sigma^* We are currently interested in which strings match a particular token type (language) Sigma = ASCII A regular expression specifies a particular set of strings (the strings matching (or matched by) the expression) Regular expressions: character constants: a character sets: [...], [^...] concatenation: RS Kleene closure: R*, R+, R? Disjunction: R|S Empty string: epsilon, or "" Precedence: highest: [] *, +, ? (unary postfix) concatenation (binary infix) | (binary infix) Parentheses can be used freely to force grouping. Ex: [ab] [a-z] [A-Za-z_][A-Za-z0-9_]* (identifiers) [0-9]+ (unsigned integer constants) [+-]?[0-9]+ (optionally signed integer constants) (aa)*|(aaa)* versus (aa|aaa)* Mention escaping metachars and quoted strings lex (flex) compiles regular expression information into a lexical analyzer (C code) source file syntax a sample file matching rules: greedy, ties broken by order in file actions Unix commands: flex mylexer.lex # produces lex.yy.c gcc lex.yy.c -o mylexer -lfl # executable ./mylexer # reads standard input .lex file parts: preamble %% rules %% utils preamble: verbatim C code for start of file, named regular expressions rules: exprs and actions go into yylex() utils: verbatim C code at the bottom Week 03 ------- Lex notes: named regexps in lex . (dot) reg exp (often with default action) metachars in [...] loose their meta-ness except for -,^,\, and ] must escape last two yyleng, yytext, yyin, yyout, yyerror DFAs: alphabet states directed, labeled edges/transitions start state accepting states semantics determinism constraint: no two edges from same state with same label (equiv: transitions form partial function) examples ab* (ab)* a*bbb[ab]* b*abb[ab]* odd number of b's DFAs are equivalent to reg exps can effectively get each from the other How lex handles DFAs The DFA for each reg exp is looking for a suffix of the input to match Assume path to accepting state from any state (otherwise prune) Each DFA keeps track of the start of the suffix it is currently matching initially: whole input stream unread all DFAs are active and in start state no best matching string found yet while not eof and there are active DFAs, do read an input symbol s for each active DFA D (in order by the list of lex rules) do if can't advance D by s, then deactivate D (D is now inactive) else advance D to the next state via s (D stays active) and update the start of the current match if necessary if the start of D's current match is after the start of the currently best matched string, then deactivate D endif endif if D is still active and in an accepting state, then if the string currently matched by D is strictly better than the previous best matching string, then replace the best matching string with this one endif endif endfor endwhile copy all characters before the start of the best match to the output /* yytext and yyleng are the best matched string */ perform the action corresponding to the DFA that matched the string back up on the input anything read after the matched string lex is greedy; why is the following an error? int a,b,c; c = a+++++b; how about c = a++---b; how about c = a+++--b; correct: c = a+++ ++b; Other regexp examples: Pascal real constants. Flex in depth Regexp to NFA (with epsilon-moves) (Legend: o = nonaccepting state, O = accepting state) (Invariant: exactly one accepting state) emptyset ->o O epsilon ->O a (in the alphabet) ->o-->O a rs ->[o r o]-------->[o s O] epsilon where [o r o] is NFA for r with start state on left and accepting state on right, turned into a nonaccepting state, and [o s O] is NFA for s with start state on left and accepting state on the right, remaining as an accepting state r|s /------->[o r o]--------v ->o epsilon epsilon O \------->[o s o]--------^ epsilon epsilon r* ->[o r o] | | ->O<------ NFA to DFA (build states as epsilon-closures): Given NFA N = , Let S be a subset of Q. epsilon-cl(S): C := S; WHILE there is an epsilon edge from a state in C to a state t not in C DO C := C union {t} ENDWHILE; return C END trans(S,a): /* S is a subset of Q, and a is an alphabet symbol */ C := emptyset; FORALL q in Q DO IF there is an a-labelled edge to q from some state in C THEN C := C union {q} ENDIF ENDFOR; return epsilon-cl(C) END buildDFA(N): S_0 := epsilon-cl({q_0}); designate S_0 as the start state; stateSet := {S_0}; WHILE there is an S in stateSet and an alphabet symbol a such that (S,a) has not been processed yet DO T = trans(S,a); mark (S,a) as being processed; stateSet := stateSet union {T}; draw an a-labeled edge from S to T ENDWHILE; FORALL S in stateSet DO IF S contains an accepting state from N (i.e., S intersect F != emptyset) THEN designate S as an accepting state ENDIF ENDFOR; return new DFA END Week 04 ------- syntax analysis syntactic objects production rules (context-free) grammars terminals (tokens) T and nonterminals N start symbol ex: E -> E+E E -> E*E E -> c E -> (E) Notation: A,B,C,... represent nonterminal variables a,b,c,...,z represent terminal variables X,Y,Z terminal and nonterminal vars alpha,beta,gamma,... strings of terminals and nonterminals shorthand for multiple productions S is start symbol define: parse tree (not the same as syntax trees) the string corresponding to a parse tree a string is generated by a grammar if there is a parse tree ... the language recognized by a grammar parsing means finding a parse tree (essentially) previous example ambiguity unambiguous arithmetic expression grammar sample grammars over {a,b}: S -> aS | Sb | epsilon S -> aS | bS | epsilon BNF yacc/bison use see example explain homework: escc yacc file escc in detail: typing attrs through union, type, and token declarations step through grammar -- actions build syntax tree (attrs are syntax trees) expression vs expression statement (systax: semicolon; semantics: former returns a value, latter does not) encode traverses syntax tree post-order postfix stack discipline explain assembly code issues size of stack items, registers, loads, stores, converts, addressing, codes for data types actions in middle of productions Week 05 ------- how bottom-up parser works: lookahead, end-of-stream symbol $ semantic stack, includes grammar symbols, state info, attributes shifts and reduces bottom-up traversal of parse tree actions can always be assumed to come at the end (o.w. add empty prod) gets rightmost derivation in reverse example: simple expression evaluator E -> E + T | E - T | T etc add E' -> E production accept when only E' is on stack and lookahead is $ 2003/02/10 lex and yacc interaction: yacc produces y.tab.c and y.tab.h lex source file has #include y.tab.h overview of entire project Describe Project 1: external declarations in C unadorned C grammar: gram.y lexical analyzer: scan.l (minimal change, if any) modules: symtab bucket types message utils backend three levels of grading go through grammar: declarations type_specifiers declarators initializers type_specifiers: bucket module rest of type: type module enrolling in symtab: st_enter_id (returns enrollment papers) do this in the production: identifier -> IDENTIFIER (only place IDENTIFIER occurs in the grammar) installing in symtab (need papers, type) Week 06 ------- lex and yacc interaction: yacc produces y.tab.c and y.tab.h lex source file has #include y.tab.h overview of entire project Describe Project 1: external declarations in C unadorned C grammar: gram.y lexical analyzer: scan.l (minimal change, if any) modules: symtab bucket types message utils backend three levels of grading go through grammar: declarations type_specifiers declarators initializers type_specifiers: bucket module rest of type: type module enrolling in symtab: st_enter_id (returns enrollment papers) do this in the production: identifier -> IDENTIFIER (only place IDENTIFIER occurs in the grammar) installing in symtab (need papers, type) Project: test files T1Lx.c actions/rules come at end (yacc simulates actions in the middle by introducing new empty production) building decl tree: your own data struct immediately convert id string to ST_ID at identifier : IDENTIFIER production (use st_enter_id()) basic type specifiers handled with bucket pointer attribute is int converting decl tree to type: traverse top down, build type bottom up with ty_build_*() functions installing in symtab: allocate new ST_DR record (GDECL), insert type and ST_ID and call st_install() (return value tells of duplicate declaration) backend routines for encoding declarations: b_global_decl() compute size using ty_query() and ty_query_array() b_alloc_char() b_alloc_int() b_alloc_float() etc. no assembly code for function-type declarations Week 07 ------- top-down parsing predictive parsing recursive descent parser for arith expressions use function for each nonterminal, use parameter for inherited attribute for E_prime() and T_prime() return synthesized attribute: value of expression precondition for each: lookahead is first token of the syntactic unit postcondition for each: lookahead is first token after syntactic unit (special EOS token $) EBNF makes more compact grammar (equiv to BNF) regexp for right side of production ::= ( ( '+'|'-') )* ::= ( ( '*'|'/') )* ::= CONST | '(' ')' functions now are DFAs that match the right-hand side. (Can add semantic actions inside the regexp's) top-down grammars: can't have left recursion: remove immediate left-recursion and immediate cycles at same time: replace A -> Aa1 | ... | Aan | b1 | ... | bm with A -> b1A' | ... | bmA' A' -> a1A' | ... | anA' (can remove any production where ai is empty) get equivalent grammar without immediate left recursion roughly describe Project II run grammar transformation of E -> E+T | E-T | T into LL(1) form DFA (or NFA) to regexp. Week 08 ------- Project II in depth: expressions: expr tree nodes, types Example: int a; double x; ... x = a + 3.5; conversions: unary, binary, assignment 1. convert to r-val if needed 2. unary conversions 3. binary or assignment conversion Example: int a; char c; ... a = c = 300; what is the value of a? (definitely 44 (with gcc)) mention constant folding! back end routines compound statements: st_enter_block() and st_exit_block() symtable stores variables in stacks intermediate actions are converted to empty production reduce actions function definitions: Example: int foo(a,b) char a; float b; { a = a + b; } b_func_prologue(...) then offset = b_store_formal_param(...) store offset (from the frame pointer) in binding field of ST_DR Constant folding function calls: when building expression syntax tree: function name is subexpression of FUNCALL node with list of parameter expressions (each unary converted (actually parameter converted, but same as unary)); type is return type of the function (the function name is an expression!) when traversing for encoding: could do usual postorder, pushing address of f on stack first (then calling b_funcall_by_ptr() after loading actual args), but there's a shortcut when the function is just an identifier: don't push address, just load args then call b_funcall_by_name() To encode a function call: b_alloc_arglist( number of arguments ) /* allocates space on stack */ (evaluate first arg) b_load_arg( typetag ) /* moves value to where it should go (reg or stack) */ (evaluate second arg) b_load_arg( typetag ) ... b_funcall_by_name( name, return typetag ) (or b_funcall_by_ptr()) /* must match b_alloc_arglist() */ [an internal stack of actual argument counters is maintained: new initialized counter pushed by b_alloc_arglist(), updated by b_load_arg(), popped by b_funcall_by_...()] draw activation record on control stack with parameters, local stuff, and frame pointer To begin a function definition: 1. install function in symbol table as FDECL (check for previous decl, etc.) 2. st_enter_block() 3. check id list against declarations 4. b_func_prologue() 5. for each formal parameter (in order) do: a. get its type b. offset = b_store_formal_param(typetag of the parameter) c. allocate ST_DR record, insert offset into binding field d. install parameter as a PDECL A single internal counter of formal arguments is initiallized by b_func_prologue() and updated by b_store_formal_param() b_store_formal_param() does automatic type conversion and placement (calling convention puts some parameters in registers; we want them all on the stack) remembering the function name in expressions, local and parameter variables are referenced by offset with b_push_loc_addr(offset) /* pushes actual address on the stack */ Back to grammar conversion for top-down parsing: recall transformation to remove immediate left recursion; apply to usual bottom-up grammar for arith expressions removing all left recursion: order nonterminals A1, ..., An for i := 1 to n do for j := 1 to i-1 do /* invariant: Aj -> Akb implies j < k */ let Aj -> g1 | ... | gm be productions starting with Aj replace each Ai -> Aja with Ai -> g1a | ... | gma remove immediate left recursion from Ai /* postcondition: Ai -> Aj... implies i alpha beta_1 | alpha beta_2 to A -> alpha A' A' -> beta_1 | beta_2 For each A find longest alpha common to two or more alternatives. Do the above if alpha != epsilon Repeat until nothing more can be done FIRST and FOLLOW sets (helpful in construction of predictive parser): FIRST(alpha) = { terminals a | a begins a string derived from alpha } (if alpha => epsilon, then epsilon is in FIRST(alpha)) FOLLOW(A) = { terminals a | a can appear immediately to the right of A in some sentential form } (define sentential form) FIRST(X) and FOLLOW(A) are easy to compute: FIRST(X): if X is a terminal, then FIRST(X) = {X} if X -> epsilon, add epsilon to FIRST(X) if X is nonterminal and X -> Y1...Yk then place terminal a in FIRST(X) if a is in FIRST(Yi) for some i and epsilon is in FIRST(Yj) for all j alpha B beta, put all of FIRST(beta) into FOLLOW(B) except epsilon for production A -> alpha B, or A -> alpha B beta (where beta => epsilon), put all of FOLLOW(A) into FOLLOW(B) A predictively parsable grammar is called LL(1) (left-to-right read, leftmost derivation with one symbol of lookahead) G is LL(1) iff whenever A -> alpha | beta are productions, then 1. FIRST(alpha) and FIRST(beta) are disjoint 2. if beta =>* epsilon, then FOLLOW(A) and FIRST(alpha) are disjoint Week 09 ------- Syntax-directed translation Semantic rules "annotate" grammar, attached to productions rules of form b := f(a1,...,an) where a1,...,an,b are attributes of symbols in the production (side effects and/or void return allowed) b _depends_ on a1,...,an dependency graph arises with parse tree when parsing input. General case: topological sort of dependency graph (detects cycles) evaluate attributes in order. Too general, inefficient for important cases: 1. syntesized attributes (depend only on attrs of children) only: can compute on the fly/store on semantic stack during bottom-up parsing (yacc works this way) 2. L-attributed grammar. [An attribute is inherited if depends on parent and siblings only.] All attributes are either (a) synthesized and depend on only synthesized attributes of children, or (b) inherited and depend only on attributes of left siblings or inherited attributes of parent. Can't do directly in yacc (without hacking, see below), need to build structure to remember delayed attribute computation (good example: declaration trees). Example of inherited (L-) attributes: D -> T L { L.type := T.type; } T -> int { T.type := int; } T -> real { T.type := real; } L -> id { install_type(id,L.type); } L_1 -> L_2 , id { install_type(id,L_1.type); L_2.type := L_1.type } T.type is synthesized; L.type is inherited. Can do pretty easily in recursive descent predictive parsing: pass inherited attributes (known when fcn is called) in as IN parameters return synthesized attributes as return value or through OUT parameters Inherited attributes in yacc (for L-attributed definition). This is a bit of a hack. Good graduate project: make this a legitimate, type-checked part of yacc. Basic idea: pass inherited attribute from parent to child through $0. Example for the grammar above: decl : type list ; type : int { $$ = INT; } | real { $$ = REAL; } ; list : id { install_type($1,$0); } | list ',' id { install_type($3,$0); } ; A more revealing example---passing a break label down to a break statement: start : { $$ = NULL; } stmt stmt : IF expr THEN { $$ = $0; } stmt ELSE { $$ = $0; } stmt | '{' { $$ = $0; } stmt_list '}' | WHILE expr DO { $$ = new_symbol(); } stmt { encode_label($4); } | BREAK { encode_jump($0); } | other ; stmt_list : /* null derive */ | stmt_list { $$ = $0; } stmt ; Week 10 ------- Creating a bottom-up (LR) parser Converting a grammar to an LR parser is similar to building a DFA from an NFA. States will be sets of items, and we define closure and transition functions. A simple LR(0) parser We fix an augmented grammar G' with start state S' which appears only in the production S' -> S. An _LR(0)_item_ is any expression of the form [A -> alpha . beta], where A -> alpha beta is any production of G'. The dot represents a possible location in the parse tree traversal. Our states are sets of LR(0) items that are as small as possible subject to the following invariant: "If the parser is in state S while at a location in the parse tree traversal compatible with some item I, then I is a member of S." Here is the closure operation: cl(s): C := s repeat C := C union { [A -> . gamma] | [B -> alpha . A beta] is in C } until C doesn't change return C end Idea: if we are about to parse A, then we could be starting on any A-production. Here is the starting state: s_0 := cl({[S' -> . S]}) It represents the state at the beginning of the parse, where nothing has been read yet. The item [S' -> . S] is the _kernal_item_ of s_0. Here is the trans operation: s is a state and X is a grammar symbol: trans(s,X): T := { [A -> alpha X . beta] | [A -> alpha . X beta] is in s } return cl(T) end Idea: trans(s,X) is the state resulting from starting in state s and successfully parsing X as a prefix of the input. The items in T are the _kernel_items_ of trans(s,X). Every state is the closure of its kernel items, so only the kernel items need be stored to specify the state, and two states are equal iff their kernel items are equal. We build states for the parser as we built states for the DFA, i.e., states reachable from the start state via transitions: states(G): form augmented grammar G' C := { s_0 } // the start state while there is a state s in C and a grammar symbol X s.t. trans(s,X) is not in C do C := C union { trans(s,X) } return C end The state and the lookahead (or nonterminal) tell the parser what to do at any step, via the action and goto tables: For each state s and terminal a, - If [S' -> S .] is in s, then set action[s,$] to "accept" - For any item [A -> alpha . a beta] in s, set action[s,a] to "shift s'", where s' = trans(s,a) - For any item [A -> alpha .] in s such that a is in FOLLOW(A), set action[s,a] to "reduce A -> alpha" Any unset entries in the action table are set to "error" If an action table entry gets set to two contradictory actions, then we register a conflict (shift/reduce or reduce/reduce). Conflicts can be tolerated by providing rules for resolving them. By default, yacc resolves a shift/reduce conflict in favor of shifting, and a reduce/reduce conflict in favor of reducing by the production listed earliest in the grammar. For each state s and nonterminal A, - set goto(s,A) to trans(s,A) The parser behaves as follows: The internal state is always the one stored on top of the stack. Initially, stack contents are (s_0,S'), and lookahead is first token of the input. In state s with lookahead a: if action[s,a] = "accept", then halt with success status else if action[s,a] = "error" then give a syntax error message and call some error recovery routine else if action[s,a] = "shift s'" then push (s,a) onto the stack and advance the lookahead one token else if action[s,a] = "reduce A -> alpha" then pop |alpha| items from the stack, exposing some (s',X), and push (goto(s',A),A) onto the stack "Standard example": G' is E' -> E E -> E + T E -> T T -> T * F T -> F F -> c F -> ( E ) States (, kernal items first; s_0 is start state): s_0: E' -> . E E -> . E + T E -> . T T -> . T * F T -> . F F -> . c F -> . ( E ) s_1 = trans(s_0,E): E' -> E . E -> E . + T s_2 = trans(s_0,T): E -> T . T -> T . * F s_3 = trans(s_0,F): T -> F . s_4 = trans(s_0,c): F -> c . s_5 = trans(s_0,'('): F -> ( . E ) E -> . E + T E -> . T T -> . T * F T -> . F F -> . c F -> . ( E ) s_6 = trans(s_1,+): E -> E + . T T -> . T * F T -> . F F -> . c F -> . ( E ) s_7 = trans(s_2,*): T -> T * . F F -> . c F -> . ( E ) s_8 = trans(s_5,E): F -> ( E . ) E -> E . + T s_2 = trans(s_5,T) s_3 = trans(s_5,F) s_4 = trans(s_5,c) s_5 = trans(s_5,'(') s_9 = trans(s_6,T): E -> E + T . T -> T . * F s_3 = trans(s_6,F) s_4 = trans(s_6,c) s_5 = trans(s_6,'(') s_10 = trans(s_7,F): T -> T * F . s_4 = trans(s_7,c) s_5 = trans(s_7,'(') s_11 = trans(s_8,')'): F -> ( E ) . s_6 = trans(s_8,+) s_7 = trans(s_9,*) --------- Canonical LR(1) parsing: Incorporate the lookahead information with the state. An LR(1) item is an LR(0) item along with a terminal (or $): [A -> alpha . beta, a]. States are sets of LR(1) items. Fewer conflicts than with LR(0) parsing, but too many states. Compromise: LALR parsing is canonical LR(1) parsing with some states identified. Yacc is an LALR parser generator. Week 11 ------- Type checking, run-time systems. A type is a set of possible values A type system is a means of restricting objects to be of a certain type Type checking (via a type system) means preventing operations from being performed on values of the wrong type. static (compile time) dynamic (run time) *Most* logical errors in C or a weakly typed language are type errors. A language is strongly typed if all type errors can be found at compile time (static), i.e., if a program compiles successfully, it is guaranteed to run without type errors. Virtually no existing language in practical use is strongly typed, but some are stronger than others: Untyped assembler (can interpret any data as any type at any time) Lisp (just a few types, heterogenous lists, lots of runtime type errors) C (fairly good but circumventable type restrictions: casting, unions; ANSI/ISO C better than traditional/old style C) C++ (same as C, but variant records can be done as class hierarchies) Perl, Python (few types, but careful syntax prevents crashes) Pascal, PL1 (more type distinctions than C, C++, e.g., ptrs vs arrays) Ada, Java (careful, finicky type checking, both static and dynamic) ML (very finicky, sophisticated static type system, e.g., no implicit coercions) Strongly typed basic (primitive) types: int, real, char, bool, etc. common type constructors: structured: array[I] of T (I is index type, T is element type) record { name1 T_1; ... name_n T_n } T_1,...,T_n are member (or field) types function types: function taking D returning C (D is domain type, C is codomain (or "range") type) set of T more abstract type constructors (ADTs, classes, structures, functors): list of T (or "T list" as in ML) vector of T set of T bag of T etc. (include basic operations (methods) as part of type) unstructured: subrange of T (T is an "ordinal" type) pointer to T enum(erated) types (values are names from an explicit list) In Pascal, the index type is (usually) a subrange type. Type checking this means making sure the index is within the range. So a range error is a type error. Can't easily enforce this at compile time, so it is checked dynamically if at all. Most compilers (Java is a big exception) don't bother with this check (by default), since it slows down the program's execution. Theoretical aspects ------------------- Two abstract binary type constructors: x (Cartesian product) and -> (function building). Let S and T be types. The type S x T is the type whose values are all ordered pairs of a value of type S and a value of type T The type S -> T consists of all functions (mappings) taking a value of type S as an argument and mapping it to (returning) a value of type T. Example: can roughly implement array and record types in terms of x and -> The "universal" type rule: variables, functions, and operators can all be typed. E.g., a : int % : int x int -> int Type conversions are just unary operators. E.g., int2real : int -> real Expanded syntax tree (0-2 tree): nodes become leaves, internal nodes are either app (apply) or pair. The pair node returns the ordered pair of its children. The app node returns the result of applying its left child (an operator) to its right child (the argument). For example, + app / \ ==> / \ a 6 + pair / \ a 6 Types of internal nodes in the expanded tree satisfy the following two bottom-up rules: 1. The type of a pair node is the Cartesian product of the types of its two children. 2. If the left child of an app node has a type of the form S->T and the right child has type S, then the app node has type T. Otherwise, type error. With sufficiently careful typing, _all_ type errors are seen as violations of Rule 2. Polymorphism: an operator acting on different types at different times 1. Operator overloading, e.g., + : int x int -> int + : real x real -> real 2. Object-oriented polymorphism (derived classes, inheritance) An object of derived type may be at any type used as an object of base type. ->+-------------------+ | data of base type | +-------------------+ | extra data of | | derived type | +-------------------+ 3. Parameterized polymorphism (the most sophisticated - ML) Type expressions may contain type variables that can be instantiated in different ways at different times. Use lower case Greek letters for type variables. For example, a generic list sorting operator may be typed as sort : alpha list -> alpha list but we must also incorporate the comparison operator: sort : (alpha x alpha -> bool) x alpha list -> alpha list So if we have data such as lessthan : string x string -> string mylist : string list then sort(lessthan,mylist) makes sense and returns a string list. The type of sort is _instantiated_ so that its domain type matches the type of its argument. Sometimes the argument may also be of polymorphic type, and it too must be instantiated to match the domain type of the operator. This process is generally known as unification. Unification Given two type expressions s and t (e.g., in an application, the domain type of the operator and the type of the argument) possibly with type variables, find the _most_general_unifier_, that is, the least possible instantiation of the variables that makes s and t identical (if such a thing exists). This process is known as unification. The common subexpression is unique if it exists, in which case we say that s and t unify. For example, s = (alpha -> beta) -> gamma t = gamma -> (int -> alpha x alpha) These unify. The most general unifier is alpha := int beta := int x int gamma := int -> int x int yielding the common instance type (int -> int x int) -> (int -> int x int). Another example: s = alpha x beta -> gamma t = delta -> delta x real These unify: delta := alpha x beta gamma := (alpha x beta) x real to yield the type alpha x beta -> (alpha x beta) x real. The most general unifier may still have type variables. It is most general in the sense that any other common instance must be an instance of the most general one. A concrete type is one with no type variables. There is a linear time unification algorithm (complicated, never used). There is a much simpler almost linear time algorithm that uses the disjoint sets data structure, supporting union and find ops. Informal procedure: see if corresponding nodes in the two expressions can be matched, i.e., both are the same op or at least one is a variable (in the latter case, sub the other node for the var). The recursively try to unify the respective successors. If we succeed, we're done. If we cannot identify two nodes, then we fail: the expressions do not unify. Type expressions are trees, or dags (trees with shared nodes), or general graphs (with cycles). A type whose expression contains a cycle is a recursive type. Typical recursive type: linked list node: x<--------\ / \ | x x | / \ / \ | data T next ptr | \---/ which is the result of unifying beta with ({data} x T) x ({next} x beta ptr) Universal Rule for parameterized polymorphic types: In an app node, if the domain type of the operator unifies with the type of the argument, then the type of the app node is the codomain type, subject to the instantiations of the unification. Otherwise, type error. This is ML's type checking mechanism. ML also infers types by how they are used. Overloaded operators are resolved this way. - fun successor x = x + 1; val successor : fn : int -> int - fun twotimes x = x + x; error: cannot resolve overloaded operator + Run-time systems Project III Control flow if's, loops, switches break and return (and continue?) break destinations Week 12 ------- Back-end issues Code generation, control-flow analysis, transformations Intermediate languages: as simple and unstructured as possible while being machine independent. triples, quadruples, 2-address code (human readable) 3-address code instructions x, y, z are vars (simple type), s, t, u vars or constants, binop is +,-,*,/,%, unop is -, relop is <,<=,=,>=,>,<> L is a label. Any instruction may be immediately preceded by a label. x := s binop t x := unop s x := s *s := t x := *s s[t] := u x := s[t] goto L if s relop t goto L read x /* gets x from input stream */ write s /* writes s to output stream */ arg s /* makes s actual argument value for fcn call */ store x /* stores actual argument in x inside fcn */ call L x := call L return return s temporary values for internal nodes of syntax tree stack of virtual regs Various kinds of optimizations, e.g., peephole Control flow analysis: basic blocks, flow graph (assumes no fcn calls) Good for detecting inner loops, inaccessible code, redundant jumps opportunities to move code to reduce number of jumps Code transformations: shared common subexpressions, constant folding, algebraic transformations (e.g, x := y * 0 to x := 0), linear array indexing optimizations Project IV Pointers and arrays * and & operators pointer arithmetic [] operator function pointers L-values and R-values Week 13 ------- Liveness analysis, data-flow analysis, code-optimization Control points A variable is _live_ at a control point if it can be used again beyond p before it is reset, else it is dead. Algo for computing variable liveness--within a block, globally Week 14 ------- Project V structs, unions, enums struct/union/enum tags for tag symbol table member lists as part of type offsets, size, alignment . and -> operators (think as unary) conversions Bootstrapping Lots of good reasons to write a compiler for a language L in L itself---consistent data formats, syntax, etc. We're writing a C compiler in C. ML compilers are often written in ML. Chicken & egg problem (the egg came first, obviously) The first C compiler cannot be written entirely in C Good step-by-step strategy: write a compiler for a small, simple fragment C' of C in assembler or some other pre-existing language, like Fortran. Write a compiler for a much larger subset C'' of C in C'. Write the full C compiler in C''. May use fewer or more steps. Pascal was implemented this way. Two big advantages of writing an L compiler in L: 1. Given an inefficient L compiler c that generates poor (inefficient, bloated) target code, produce an efficient L compiler c' that produces high-quality target code. a) Write the good compiler c' in L b) Compile c' using c -- get inefficent compiler that produces good code c) Use this compiler to recompile c' -- get good compiler producing good code. 2. Porting an L compiler from one machine M1 (with target code T1) to another, brand new machine M2 (with target code T2) a) Write a compiler c in L that translates L into T2 b) Compile c on M1 using existing compiler, getting c' in T1 for T2 c) Recompile c using c', getting a compiler c'' in T2 for T2 d) copy c'' to M2.