Programming Language Syntax

Prev: Introduction Next: Names, Scopes, and Bindings

Prev: Introduction Next: Names, Scopes, and Bindings

2.1 Check Your Understanding

  1. What is the difference between syntax and semantics?
  2. What are the three basic operations that can be used to build complex regular expressions from simpler regular expressions?
  3. What additional operation (beyond the three of regular expressions) is provided in context-free grammars?
  4. What is Backus-Naur form? When and why was it devised?
  5. Name a language in which indentation affects program syntax.
  6. When discussing context-free languages, what is a derivation? What is a sentential form?
  7. What is the difference between a right-most derivation and a left-most derivation?
  8. What does it mean for a context-free grammar to be ambiguous?
  9. What are associativity and precedence? Why are they significant in parse trees?

2.2 Check Your Understanding

  1. List the tasks performed by the typical scanner.
  2. What are the advantages of an automatically generated scanner, in comparison to a handwritten one? Why do many commercial compilers use a handwritten scanner anyway?
  3. Explain the difference between deterministic and nondeterministic finite automata. Why do we prefer the deterministic variety for scanning?
  4. Outline the constructions used to turn a set of regular expressions into a minimal DFA.
  5. What is the “longest possible token” rule?
  6. Why must a scanner sometimes “peek” at upcoming characters?
  7. What is the difference between a keyword and an identifier?
  8. Why must a scanner save the text of tokens?
  9. How does a scanner identify lexical errors? How does it respond?
  10. What is a pragma?

2.3.1-2.3.3 Check Your Understanding

  1. What is the inherent “big-O” complexity of parsing? What is the complexity of parsers used in real compilers?
  2. Summarize the difference between LL and LR parsing. Which one of them is also called “bottom-up”? “Top-down”? Which one is also called “predictive”? “Shift-reduce”? What do “LL” and “LR” stand for?
  3. What kind of parser (top-down or bottom-up) is most common in production compilers?
  4. Why are right-most derivations sometimes called canonical?
  5. What is the significance of the “1” in LR(1)?
  6. Why might we want (or need) different grammars for different parsing algorithms?
  7. What is an epsilon production?
  8. What are recursive descent parsers? Why are they used mostly for small languages?
  9. How might a parser construct an explicit parse tree or syntax tree?
  10. Describe two common idioms in context-free grammars that cannot be parsed top-down.
  11. What is the “dangling else” problem? How is it avoided in modern languages?
  12. Discuss the similarities and differences between recursive descent and table-driven top-down parsing.
  13. What are FIRST and FOLLOW sets? What are they used for?
  14. Under what circumstances does a top-down parser predict the production A -> alpha?
  15. What sorts of “obvious” facts form the basis of FIRST set and FOLLOW set construction?
  16. Outline the algorithm used to complete the construction of FIRST and FOLLOW sets. How do we know when we are done?
  17. How do we know when a grammar is not LL(1)?

2.3.4 Check Your Understanding

  1. What is the handle of a right sentential form?
  2. Explain the significance of the characteristic finite-state machine in LR parsing.
  3. What is the significance of the dot (.) in an LR item?
  4. What distinguishes the basis from the closure of an LR state?
  5. What is a shift-reduce conflict? How is it resolved in the various kinds of LR-family parsers?
  6. Outline the steps performed by the driver of a bottom-up parser.
  7. What kind of parser is produced by yacc/bison? By ANTLR?
  8. Why are there never any epsilon productions in an LR(0) grammar?

2.6 Exercises

2.1

Write regular expressions to capture the following.

  • (a) Strings in C. These are delimited by double quotes ("), and may not contain newline characters. They may contain double-quote or backslash characters if and only if those characters are “escaped” by a preceding backslash. You may find it helpful to introduce shorthand notation to represent any character that is not a member of a small specified set.
  • (b) Comments in Pascal. These are delimited by (* and *) or by { and }. They are not permitted to nest.
  • (c) Numeric constants in C. These are octal, decimal, or hexadecimal integers, or decimal or hexadecimal floating-point values. An octal integer begins with 0, and may contain only the digits 0-7. A hexadecimal integer begins with 0x or 0X, and may contain the digits 0-9 and a/A-f/F. A decimal floating-point value has a fractional portion (beginning with a dot) or an exponent (beginning with E or e). Unlike a decimal integer, it is allowed to start with 0. A hexadecimal floating-point value has an optional fractional portion and a mandatory exponent (beginning with P or p). In either decimal or hexadecimal, there may be digits to the left of the dot, the right of the dot, or both, and the exponent itself is given in decimal, with an optional leading + or - sign. An integer may end with an optional U or u (indicating “unsigned”), and/or L or l (indicating “long”) or LL or ll (indicating “long long”). A floating-point value may end with an optional F or f (indicating “float” - single precision) or L or l (indicating “long” - double precision).
  • (d) Floating-point constants in Ada. These match the definition of real in Example 2.3, except that (1) a digit is required on both sides of the decimal point, (2) an underscore is permitted between digits, and (3) an alternative numeric base may be specified by surrounding the nonexponent part of the number with pound signs, preceded by a base in decimal (e.g. 16#6.a7#e+2). In this latter case, the letters a...f (both upper- and lowercase) are permitted as digits. Use of these letters in an inappropriate (e.g. decimal) number is an error, but need not be caught by the scanner.
  • (e) Inexact constants in Scheme. Scheme allows real numbers to be explicitly inexact (imprecise). A programmer who wants to express all constants using the same number of characters can use sharp signs (#) in place of any lower-significance digits whose values are not known. A base-10 constant without exponent consists of one or more digits followed by zero or more sharp signs. An optional decimal point can be placed at the beginning, the end, or anywhere in-between. For the purposes of this exercise, ignore sign, exponent, radix, exactness and length specifiers, and complex or rational values.
  • (f) Financial quantities in American notation. These have a leading dollar sign ($), an optional string of asterisks (*, used on checks to discourage fraud), a string of decimal digits, and an optional fractional part consisting of a decimal point (.) and two decimal digits. The string of digits to the left of the decimal point may consist of a single zero (0). Otherwise it must not start with a zero. If there are more than three digits to the left of the decimal point, groups of three (counting from the right) must be separated by commas (,). Example: $**2,345.67. Feel free to use productions to define abbreviations, so long as the language remains regular.

2.2

Show, as circles-and-arrows diagrams, the finite automata for Exercise 2.1.

2.3

Build a regular expression that captures all nonempty sequences of letters other than file, for, and from. For notational convenience, you may assume the existence of a not operator that takes a set of letters as argument and matches any other letter. Comment on the practicality of constructing a regular expression for all sequences of letters other than the keywords of a large programming language.

2.4

  • (a) Show the NFA that results from applying the construction of Figure 2.7 to the regular expression letter(letter|digit)*.
  • (b) Apply the transformation illustrated by Example 2.14 to create an equivalent DFA.
  • (c) Apply the transformation illustrated by Example 2.15 to minimize the DFA.

2.5

Starting with the regular expressions for integer and decimal in Example 2.3, construct an equivalent NFA, the set-of-subsets DFA, and the minimal equivalent DFA. Be sure to keep separate the final states for the two different kinds of token. You may find the exercise easier if you undertake it by modifying the machines in Examples 2.13 through 2.15.

2.6

Build an ad hoc scanner for the calculator language. As output, have it print a list, in order, of the input tokens. For simplicity, feel free to simply halt in the event of a lexical error.

2.7

Write a program in your favorite scripting language to remove comments from programs in the calculator language (Example 2.9).

2.8

Build a nested-case-statements finite automaton that converts all letters in its input to lower case, except within Pascal-style comments and strings. A Pascal comment is delimited by { and }, or by (* and *). Comments do not nest. A Pascal string is delimited by single quotes. A quote character can be placed in a string by doubling it. This upper-to-lower mapping can be useful if feeding a program written in standard Pascal, which ignores case, to a compiler that considers upper- and lowercase letters to be distinct.

2.9

  • (a) Describe in English the language defined by the regular expression a*(ba*ba*)*. Your description should be a high-level characterization, one that would still make sense if we were using a different regular expression for the same language.
  • (b) Write an unambiguous context-free grammar that generates the same language.
  • (c) Using your grammar from part (b), give a canonical (right-most) derivation of the string b a a b a a a b b.

2.10

Give an example of a grammar that captures right associativity for an exponentiation operator, e.g. ** in Fortran.

2.11

Prove that the following grammar is LL(1):

decl -> ID decl_tail
decl_tail -> , decl
           -> : ID ;

The final ID is meant to be a type name.

2.12

Consider the following grammar:

G -> S $$
S -> A M
M -> S | epsilon
A -> a E | b A A
E -> a B | b A | epsilon
B -> b E | a B B
  • (a) Describe in English the language that the grammar generates.
  • (b) Show a parse tree for the string a b a a.
  • (c) Is the grammar LL(1)? If so, show the parse table; if not, identify a prediction conflict.

2.13

Consider the following grammar:

stmt -> assignment
      -> subr_call
assignment -> id := expr
subr_call -> id ( arg_list )
expr -> primary expr_tail
expr_tail -> op expr
           -> epsilon
primary -> id
         -> subr_call
         -> ( expr )
op -> + | - | * | /
arg_list -> expr args_tail
args_tail -> , arg_list
           -> epsilon
  • (a) Construct a parse tree for the input string foo(a, b).
  • (b) Give a canonical (right-most) derivation of this same string.
  • (c) Prove that the grammar is not LL(1).
  • (d) Modify the grammar so that it is LL(1).

2.14

Consider the language consisting of all strings of properly balanced parentheses and brackets.

  • (a) Give LL(1) and SLR(1) grammars for this language.
  • (b) Give the corresponding LL(1) and SLR(1) parsing tables.
  • (c) For each grammar, show the parse tree for ([]([]))[](()).
  • (d) Give a trace of the actions of the parsers in constructing these trees.

2.15

Consider the following context-free grammar:

G -> G B
   -> G N
   -> epsilon
B -> ( E )
E -> E ( E )
   -> epsilon
N -> ( L ]
L -> L E
   -> L (
   -> epsilon
  • (a) Describe, in English, the language generated by this grammar. Hint: B stands for balanced; N stands for nonbalanced. Your description should be independent of the particular grammar chosen.
  • (b) Give a parse tree for the string (( ]( ).
  • (c) Give a canonical (right-most) derivation of this same string.
  • (d) What is FIRST(E) in our grammar? What is FOLLOW(E)?
  • (e) Given its use of left recursion, our grammar is clearly not LL(1). Does this language have an LL(1) grammar? Explain.

2.16

Give a grammar that captures all levels of precedence for arithmetic expressions in C, as shown in Figure 6.1. Hint: this exercise is somewhat tedious; you’ll probably want to attack it with a text editor rather than a pencil.

2.17

Extend the grammar of Figure 2.25 to include if statements and while loops, along the lines suggested by the following examples:

abs := n
if n < 0 then abs := 0 - abs fi
 
sum := 0
read count
while count > 0 do
    read n
    sum := sum + n
    count := count - 1
od
write sum

Your grammar should support the six standard comparison operations in conditions, with arbitrary expressions as operands. It should also allow an arbitrary number of statements in the body of an if or while statement.

2.18

Consider the following LL(1) grammar for a simplified subset of Lisp:

P -> E $$
E -> atom
   -> ' E
   -> ( E Es )
Es -> E Es
    -> epsilon
  • (a) What is FIRST(Es)? FOLLOW(E)? PREDICT(Es -> epsilon)?
  • (b) Give a parse tree for the string (cdr '(a b c)) $$.
  • (c) Show the left-most derivation of (cdr '(a b c)) $$.
  • (d) Show a trace, in the style of Figure 2.21, of a table-driven top-down parse of this same input.
  • (e) Now consider a recursive descent parser running on the same input. At the point where the quote token is matched, which recursive descent routines will be active, i.e. what routines will have a frame on the parser’s run-time stack?

2.19

Write top-down and bottom-up grammars for the language consisting of all well-formed regular expressions. Arrange for all operators to be left-associative. Give Kleene closure the highest precedence and alternation the lowest precedence.

2.20

Suppose that the expression grammar in Example 2.8 were to be used in conjunction with a scanner that did not remove comments from the input, but rather returned them as tokens. How would the grammar need to be modified to allow comments to appear at arbitrary places in the input?

2.21

Build a complete recursive descent parser for the calculator language. As output, have it print a trace of its matches and predictions.

2.22

Extend your solution to Exercise 2.21 to build an explicit parse tree.

2.23

Extend your solution to Exercise 2.21 to build an abstract syntax tree directly, without constructing a parse tree first.

2.24

The dangling else problem of Pascal was not shared by its predecessor Algol 60. To avoid ambiguity regarding which then is matched by an else, Algol 60 prohibited if statements immediately inside a then clause. The Pascal fragment

if C1 then if C2 then S1 else S2

had to be written as either

if C1 then begin if C2 then S1 end else S2

or

if C1 then begin if C2 then S1 else S2 end

in Algol 60. Show how to write a grammar for conditional statements that enforces this rule. Hint: you will want to distinguish in your grammar between conditional statements and nonconditional statements; some contexts will accept either, some only the latter.

2.25

Flesh out the details of an algorithm to eliminate left recursion and common prefixes in an arbitrary context-free grammar.

2.26

In some languages an assignment can appear in any context in which an expression is expected: the value of the expression is the right-hand side of the assignment, which is placed into the left-hand side as a side effect. Consider the following grammar fragment for such a language. Explain why it is not LL(1), and discuss what might be done to make it so.

expr -> id := expr
     -> term term_tail
term_tail -> + term term_tail | epsilon
term -> factor factor_tail
factor_tail -> * factor factor_tail | epsilon
factor -> ( expr ) | id

2.27

Construct the CFSM for the id_list grammar in Example 2.20 and verify that it can be parsed bottom-up with zero tokens of look-ahead.

2.28

Modify the grammar in Exercise 2.27 to allow an id_list to be empty. Is the grammar still LR(0)?

2.29

Repeat Example 2.36 using the grammar of Figure 2.15.

2.30

Consider the following grammar for a declaration list:

decl_list -> decl_list decl ; | decl ;
decl -> id : type
type -> int | real | char
     -> array const .. const of type
     -> record decl_list end

Construct the CFSM for this grammar. Use it to trace out a parse, as in Figure 2.30, for the following input program:

foo : record
         a : char;
         b : array 1 .. 2 of real;
      end;

2.31-2.37

In More Depth.