Compilers | Notion

TODO merge in notes from CS164

TODO: distinguish grammars from parsers more cleanly in these notes
compiler types
- recursive descent
  - parses LL(k)
  - simplest to hand-code; simply recursive functions that backtrack
  - non-predictve; requires backtracking
  - most embedded parser languages are this (e.g. haskell's parsec monadic parser), unless the embedded language constructs the data structures fed into a "more separate" parsing
- predictive grammars
  - use $k$ tokens of lookahead to fully determine which production to use
  - no backtracking needed
- top-down operator precedence (pratt) parser
  - http://journal.stuffwithstuff.com/2011/03/19/pratt-parsers-expression-parsing-made-easy/
  - basic idea: allow yourself to call into main parse(precedence) from anywhere, and the precedence is always checked to be non-decreasing
  - also controls associativity
  - pseudocode
    
    def parse(precedence): token = next() p = prefix_parsers[token.type] parsed = p.parse(token) while True: token = next() p = infix_parsers[token.type] if p is None or precedence < token.precedence: break parsed = p.parse(parsed, token) return parsed
parsing types
- shift-reduce parsing: a type of bottom-up parsing
grammar types (hierarchy)
- bottom-up
  - LR
    - LR(0)
    - SLR(1!)
    - LALR(1)
    - LR(1)
    - LR(n)
grammar types (details)
- general
  - first L means left-to-right scanning
  - second L/R means left-to-right/right-to-left derivation (leftmost/rightmost)
  - (k) = k-token lookahead
- LL(k) aka top-down: simplest
  - produces leftmost derivation top-down
  - LL(1): need to look ahead 1 token to determine which rule to apply (for infix, etc.)
  - animation: http://www.cs.wm.edu/~noonan/animations/llparse.html
  - ANTLR
  - LL is a subset of LR
- non-commutative LL(*): most hand-written parsers, combinators
  - memoized non-commutative LL(*): packrat parsers, scala 2.8 parser combinators
  - harder theoretical analysis vs. LR, but probably weaker
- LR(k)
  - k is the number of tokens to lookahead at the end of a rule
  - produces rightmost derivation in reverse bottom-up TODO
  - don't get used much; internal tables tend to get too big
  - handles left-recursion
- LALR(k)
  - YACC, bison
- SLR(k)
  - LR(0) + followsets TODO
grammar forms
- Backus-Naur Form (BNF): TODO
- Extended BNF (EBNF): TODO
determining ambiguity: uncomputable
- eg "dangling else" in C/C++/Java; convention is attach to nearest
  - if SLR/LR(1)/LALR LR, rely on generated parser feature of preferring shift over reduce if conflict
- LR (and thus LL) grammars are always non-ambiguous
- unsure: LR parsers can identify certain non-ambiguous grammars, but they have this gray area where they can't tell whether the grammar is ambiguous
- but at least GLR can notify you of this when it happens
parsing expression grammar (PEG)
- recursive descent (top-down) parsers: no left recursion
- packrat parsers
  - a memoizing parsers; use much more space TODO how much?
  - linear time (unlike GLR)
  - rewriting extension can handle left recursion
- no ambiguity because alternative productions are ordered
  - that's also a downside: can't tell when you have ambiguous grammar
- all LL grammars are PEGs but not vice versa
code generation
- division is slow integer operation
  - gcc (among others) uses trickery involving non-div instrs, but only works for constant divisors
  - http://research.swtch.com/2008/01/division-via-multiplication.html
superoptimizers: exhaustively search for truly optimal generated code sequence
language classes: low to high power
- regular: regexes; DFAs/NFAs
- context-free: don't need symbol table
  - python is context-free but req's tokenizer that recognizes (de-)indents
- context-sensitive: need symbol table; eg in C++, is T*x a type decl or a mult? ("most vexing parse")

Tools

parser generators
- lex/yacc: old and busted; LALR
- flex/bison: new hotness; LALR
  - generates C
  - does GLR
  - flex includes flex++ for C++; bison++ also exists
- javacc: most popular for java
- alex/happy: haskell
- ANTLR: LL(k) parser generator
edsl parsers
boost spirit 2: GLR

Grammars

terminals: loosely, can be used interchangeably with "token"
non-terminals
start symbol