-
TODO: distinguish grammars from parsers more cleanly in these notes
-
compiler types
- recursive descent
- parses LL(k)
- simplest to hand-code; simply recursive functions that backtrack
- non-predictve; requires backtracking
- most embedded parser languages are this (e.g. haskell's parsec monadic
parser), unless the embedded language constructs the data structures fed
into a "more separate" parsing
- predictive grammars
- use $k$ tokens of lookahead to fully determine which production to use
- no backtracking needed
- top-down operator precedence (pratt) parser
-
http://journal.stuffwithstuff.com/2011/03/19/pratt-parsers-expression-parsing-made-easy/
-
basic idea: allow yourself to call into main parse(precedence) from
anywhere, and the precedence is always checked to be non-decreasing
-
also controls associativity
-
pseudocode
def parse(precedence):
token = next()
p = prefix_parsers[token.type]
parsed = p.parse(token)
while True:
token = next()
p = infix_parsers[token.type]
if p is None or precedence < token.precedence: break
parsed = p.parse(parsed, token)
return parsed
-
parsing types
- shift-reduce parsing: a type of bottom-up parsing
-
grammar types (hierarchy)
- bottom-up
- LR
- LR(0)
- SLR(1!)
- LALR(1)
- LR(1)
- LR(n)
-
grammar types (details)
- general
- first L means left-to-right scanning
- second L/R means left-to-right/right-to-left derivation
(leftmost/rightmost)
- (k) = k-token lookahead
- LL(k) aka top-down: simplest
- non-commutative LL(*): most hand-written parsers, combinators
- memoized non-commutative LL(*): packrat parsers, scala 2.8 parser
combinators
- harder theoretical analysis vs. LR, but probably weaker
- LR(k)
- k is the number of tokens to lookahead at the end of a rule
- produces rightmost derivation in reverse bottom-up TODO
- don't get used much; internal tables tend to get too big
- handles left-recursion
- LALR(k)
- SLR(k)
-
grammar forms
- Backus-Naur Form (BNF): TODO
- Extended BNF (EBNF): TODO
-
determining ambiguity: uncomputable
- eg "dangling else" in C/C++/Java; convention is attach to nearest
- if SLR/LR(1)/LALR LR, rely on generated parser feature of preferring
shift over reduce if conflict
- LR (and thus LL) grammars are always non-ambiguous
- unsure: LR parsers can identify certain non-ambiguous grammars, but they
have this gray area where they can't tell whether the grammar is ambiguous
- but at least GLR can notify you of this when it happens
-
parsing expression grammar (PEG)
- recursive descent (top-down) parsers: no left recursion
- packrat parsers
- a memoizing parsers; use much more space TODO how much?
- linear time (unlike GLR)
- rewriting extension can handle left recursion
- no ambiguity because alternative productions are ordered
- that's also a downside: can't tell when you have ambiguous grammar
- all LL grammars are PEGs but not vice versa
-
code generation
- division is slow integer operation
-
superoptimizers: exhaustively search for truly optimal generated code
sequence
-
language classes: low to high power
- regular: regexes; DFAs/NFAs
- context-free: don't need symbol table
- python is context-free but req's tokenizer that recognizes (de-)indents
- context-sensitive: need symbol table; eg in C++, is
T*x
a type decl or a
mult? ("most vexing parse")