2024考纲

Introduction

Master: the phases of a compiler.
Understand: what is a compiler. All the concepts about the compiler stages
You Should be familiar with each stage of the compiler, and describe the whole process in graph or text.

Lexical

Master: Write regular expression, the transition from regular expression to NFA, then to DFA, DFA minimization and the construction of scanner
Understand: Concept of regular expression, NFA, DFA
Be familiar with the expression of R.E.. Know the difference of R.E. and C.F.G, can translate each from the other. While given a R.E, know how to draw the NFA and DFA, and how to minimize it.

C.F.G.

Understand: Context-free grammar, Derivation, Parse tree, Abstract syntax tree, ambiguous grammar. Total under all the concepts.
Be familiar with the grammar of C.F.G, know how to make a left/right most derivation. Given a set of grammar and an expression, can draw the parse tree and abstract syntax tree.

Top-Down Parsing

Master: LL(1) grammar, Construction of Recursive-descent parsing, LL(1) parsing, Computing First set and Follow set
Given the C.F.G, know the left factor and left recursion removal, and know how to compute its first/ follow set, know to judge if it is a LL(1) grammar,and know to construct the parsing table.

Bottom-Up Paring:

Master: LR(0) parsing, SLR(1) parsing
Understand: Right sentential form, Viable prefix, Handle
It must be easy of you to write out the right sentential form of a given C.F.G and expression, and figure out the variable prefix and handles. Know the difference of LR(0) and SLR(1). Knowing when will the ambiguities happen. What’s more, given the C.F.G, be able to write out its LR(0) items, draw the DFA, construct the SLR(1) parsing table, and know how to parse a expression according to it in a stack table step by step.

Semantic:

Master: Dependency graphs, Algorithms for attribute computation
Understand: Attribute grammar, Synthesized and Inherited attributes. S-attributes, L-attributes.
Know the task of semantic analysis, and the implementation method of it. Know how to compute the attribute values by given equations. Know how to draw a dependency graph, and how to label out the attributes or values to a parsing graph.

Intermediate Code Generation:

Master: Intermediate code generation for basic structures:Three-address code; TAC for control structure, TAC for expression
Given context-free grammar and Attribute grammar, can translate the source code into TAC by SDT

考试题型 1)选择题 10题共20分;2)解答题 80分

编译器组成

alt text

lexical analysis 词法分析 (scanner)

RE

alt text

Compound Regular Expressions

  • If R1 and R2 are regular expressions, R1R2 is a regular expression represents the concatenation of the languages of R1 and R2.
  • If R1 and R2 are regular expressions,R1 | R2 is a regular expression representing the union of R1 and R2.
  • If R is a regular expression, R* is a regular expression for the Kleene closure of R.
  • If R is a regular expression, R+ is a regular expression for the positive closure of R.
  • If R is a regular expression, (R) is a regular expression with the same meaning as R.

示例

  • for strings of length exactly four
    • (0|1){4} = (0|1)(0|1)(0|1)(0|1)
  • for strings that contain at most one zero:
    • 1(0|ε)1 = 1*0?1*
  • All strings not containing the substring 101
    $0^\ast\left(1^\ast000^\ast\right)^\ast1^\ast0^\ast$
  • All strings having at least two occurrences of the substring 00
    $\left(0|1\right)^\ast00\left(0|1\right)^\ast00\left(0|1\right)^\ast|\left(0|1\right)^\ast000\left(0|1\right)^\ast$
  • Comments, consisting of a string surrounded by / and /, without an intervening */, unless it is inside double-quotes (“)
    $/o ( o^\ast( a | ” | ”o/” ) | / )^\ast o^+/$
    • a stands for all characters except * , and /
    • o stands for *

Identifier

  • Identifier: strings of letters or digits, starting with a letter
  • letter = ‘A’ | … | ‘Z’ | ‘a’ | … | ‘z’ or letter = [A-Za-z]
  • digit = [0-9]
  • identifier = letter (letter | digit) *
  • Is (letter | digit) the same as (letter | digit) * ?
    • NO

Nondeterministic Finite Automata(NFA)

RE->NFA

NFA has n states and m transitions;

  • Can determine whether a string of length k matches NFA in time O(k(n+m)).

conflict resolve

  • maximal munch rule
    • 最长子串匹配
  • adding priority rule
    • 设立优先级,关键字的优先级在identifier之上

      error report

  • Trick: Add a “catch-all” rule that matches any character and reports an error.
  • 啥也没有match到意味着发生错误;
  • 增加⼀台用于报错的机器,但是优先级设置为最低

    Deterministic Finite Automata(DFA)

    NFA->DFA

Every state must have exactly one transition defined for every letter.

  • ε-moves are not allowed.

Two problems need to be solved in translation
1) Eliminate ε-transition
2) Eliminate multiple transitions from a state on a single character

Subset Construction 子集构造法

alt text

alt text

alt text

Minimizing Algorithm

alt text

Syntax Analysis 语法分析 (parsing)

alt text

• Goal: Recover the structure described by that series of tokens.
• Goal: Report errors if those tokens do not properly encode a structure.

Specification of syntax structure :

  • Context-free grammar (CFG)
  • Method of turning grammar rules into code for parsing
    • Top-down parsing method
    • Bottom-up parsing method

      Formalisms fot syntax analysis

Language of CFG

RE 的构造能力有限,不能描述所有的语言。CFG可以描述更多的语言。使⽤context- free grammer能够展示信息以及构造结构的能⼒⽐regular expression强的多。

Definition: A context-free grammar G= $(V_T,V_N,P,S)$:

alt text

alt text
alt text

RE vs CFG

CFG 中不能使用Kleene闭包,因为这样会导致无限递归。但是可以使用递归的产生式。

Derivation 推导

alt text

Sentence and Sentential Form 句子和句型

  • Sentence: A string of terminal symbols that can be derived from the start symbol.
  • Sentential form: A string of terminal and nonterminal symbols that can be derived from the start symbol.
  • 右句型 right sentential form: A sentential form that can be derived from the start symbol by a rightmost derivation.
    alt text

    parse tree and abstract syntax tree

    要将解析树转化成为抽象语法树
  • Function of parse trees
    • A parse tree is a useful representation of the structure of a string of tokens
    • Parse trees represent derivations visually

alt text

Derivations and Parse Trees

alt text

  • A parse tree has
    • Terminals at the leaves
    • Non-terminals at the interior nodes
  • A left-right traversal of the leaves is the original input
  • The parse tree shows the association of operations, the input string does not !
  • There may be multiple ways to match the input
  • Derivations (and parse trees) choose one
  • A parse tree of a string corresponds in general to many derivations of the string
  • Derivations do not uniquely represent the structure of the strings they construct, while parse trees do
  • Parse trees abstract the essential features of derivations while factoring out superficial difference in order

    Leftmost Derivation

    Leftmost derivation
  • A derivation in which the leftmost nonterminal is replaced at each step in the derivation
  • It corresponds to a preorder traversal of the parse tree

    Rightmost derivation

  • Rightmost derivation
  • A derivation in which the rightmost non-terminal is replaced at each step in the derivation
  • It corresponds to the reverse of a postorder traversal of the parse tree

    Abstract Syntax Tree

    The need of abstract syntax tree
  • A parse tree contains much more information than is absolutely necessary for a compiler to produce executable code
    alt text
    alt text
    alt text

    ambiguity 二义性

    对于grammer G,某⼀个string w拥有多于⼀个的不同的分析树(最左或者最右推导)那么就说这个grammar具有⼆义性。二义性来源于文法设计的不当。

    如何处理二义性

    处理⼆义性的办法其实很简单,就是改变语法使之正确
  • Disambiguating rule 添加规则
    • Example
      • Rule 1: 优先级 Enforces precedence of * over –
      • Rule 2: 结合性 左联系 is left-associative
      • Rule 3:Most closely nested rule 就近嵌套原则
  • Rewriting the grammar 重写语法
    alt text
    alt text
    alt text

Left Recursion

什么是左递归,什么是右递归
alt text

Elimination of Left Recursion

alt text

Left Factoring

alt text

parsing algorithm

Top-down parsing 自顶向下

Top-down parsing begins with virtually no information.

  • Begins with just the start symbol, which matches every program

An idea: treat parsing as a graph search

  • Each node is a sentential form (a string of terminals and nonterminals derivable from the start symbol).
  • There is an edge from node α to node β iff α ⇒ β.

多对多的关系,用图表示。节点是句子和句型,边是推导关系。Parsing as a Search

BFS

BFS的问题

Enormous time and memory usage:

  • Lots of wasted effort: 大量的无用工作
    • Generates a lot of sentential forms that couldn’t possibly match.
    • But in general, extremely hard to tell whether a sentential form can match – that’s the job of parsing!
  • High branching factor: 大量的分支
    • Each sentential form can expand in (potentially) many ways for each nonterminal it contains.

      Leftmost BFS

      Leftmost DFS

      summary of Leftmost BFS/DFS

      alt text

      backtracking 回溯

      BFS剪枝,时间和空间开销大。

用BFS:进行图遍历搜索,复杂,时间复杂度过高,产生大量无用分支,时间和空间的最差情况都是指数级别。现代编译器中不被使用

剪枝:由终止符号做前缀时,如果无法与输入的前缀匹配则剪枝。(时,无法剪枝,因为前缀一直是非终止符号)

用DFS:有比BFS更好的空间复杂度和时间复杂度,但是无法匹配左递归,因为会一直循环。

predictive parsing 预测性分析

  • The leftmost DFS/BFS algorithms are backtracking algorithms.
    • Guess which production to use, then back up if it doesn‘t work.
    • Try to match a prefix by sheer dumb luck.
  • There is another class of parsing algorithms called predictive algorithms.
    • Based on remaining input, predict (without backtracking) which production to use.

Two Kinds of Predictive Parsing

  • Recursive-descent parsing
  • LL(1) parsing

Idea:利用先行词(lookahead tokens),也就是上面提到过的终止符前缀

从输入串和文法的开始符号开始分析

以当前input的token来独⼀⽆⼆地决定下⼀个被使⽤的production,parsing是预测性

预测分析文法包括LL(k)文法,其中L表示从左向右扫描,L表示最左推导,k表示“需要k个先行词用于预测”

LL(1)文法是常用的,也不完全常用。

LL(1)

It’s definition depends on the definition of lookahead sets—First set and Follow set.

  • lookahead sets: 用于预测的token集合。
    first 和 follow
  • first set: 一个非终结符号的first set是它的所有产生式的first set的并集。

Definition:

  • follow set: 一个非终结符号的follow set是它的所有产生式的follow set的并集。

Definition:

判断一个文法是否是LL(1)文法:
  • 可空的非终止符(nullable nonterminal), 如果一个非终止符号的产生式中有$\epsilon$,那么这个非终止符号是可空的。

  • 计算产生式右侧所有的First(a),并验证对于产生式$A \rightarrow \alpha | \beta $,如果不存在$\beta = \epsilon$,则验证FIRST($\alpha$)和FIRST($\beta$)两两交集是否为空。如果不为空,则不是LL(1)文法。

  • 只有当一个非终结符号的first set和follow set没有交集时,这个文法才是LL(1)文法。
  • 如果一个非终止符号的产生式中有$\epsilon$,那么需要考察这个非终止符号的follow set。First(A)和Follow(A)的交集为空才是LL(1)文法。

alt text
例1:
alt text
例2:
alt text
典型的非LL(1)文法:

  • if a grammar has left factor(左因子) or left recursion(左递归), it is not LL(1).
  • 有左公因⼦或者左递归,⼀定不是LL(1)型grammar
  • 没有左公因⼦或者左递归也不⼀定是LL(1)算法

    Recursive-descent parsing 递归下降分析

    alt text

    构造递归下降分析器+LL1 parsing

    alt text
    alt text

  • 分析表

    • 一般的,根据FIRST集,分别填上能够推导出对应符号的式子;
    • 文法中可以推导出空集的非终止符的:就把follow集中的每一个元素和这个非终止符的对应的cell里填上推导出空的式子;

alt text

  • 分析栈怎么写?:

    • 分析栈要基于分析表进行填写。stack和input的最左边符号就定位了分析表中的某个cell,其中的产生式就是Action的内容。

      错误处理

      Error Recovery in Top-down Parsers
  • Error Recovery: after an error has occurred, the parser picks a likely place to resume the parsing

  • Error Repair: the parser attempts to infer a correct program from the incorrect one given

    Error Recovery in Recursive-Descent Parsers:Panic Mode

    这种模式的主要目的是尽可能多地发现和报告源代码中的错误,而不仅仅是停在第一个遇到的错误。

    • Discard input symbols until a synchronizing token is found
    • Synchronizing tokens are those that can legally follow the error
  • In complex situations, the error handler will consume a possibly large number of tokens in an attempt to find a place to resume parsing

  • Phrase-level Recovery

    Bottom-up parsing 自底向上

    掌握:LR(0)语法分析,SLR(1)语法分析
    理解:正确的句型,可行的前缀,句柄
    对于给定的C.F.G和表达式,你一定能很容易地写出正确的句型,并找出可变前缀和句柄。知道LR(0)和SLR(1)的区别。知道什么时候会出现歧义。此外,给定C.F.G,能够写出它的LR(0)项,画出DFA,构造SLR(1)解析表,并知道如何根据它在堆栈表中一步步解析表达式。

  • 右句型:句型的一种,句型的最右边是一个非终结符号。

  • 最右推导:从开始符号推导出句型的过程。
  • handle:在推导过程中,句型的一部分,跟产生式右部所匹配的部分。(在右边)
  • 从输入串开始,逐步推导出文法的开始符号。
  • 归约:推导的逆过程,将handle归约回产生式左侧的非终结符号。(此时在分析栈中,handle所处位置是顶部
  • 将右句型一分为二分为左部/右部,handle在左部的最右边。左部被称为work area,所有的handle都在work area中。
  • 用一个分析栈来存储已经分析过的部分,栈底是文法的开始符号。top-down的分析栈中是待分析的内容
  • shift-reduce parsing: 从左到右扫描输入串,shift是将输入串的一个token放入栈中,reduce是将handle归约为产生式左部的非终结符号。
  • 什么时候规约?做预测。这是后续算法解决的问题
  • 可行前缀:找到一个handle之前,在栈里出现过的所有串都可称为可行前缀。如果前缀是可行的,证明分析也是对的,所以也称为“活前缀”

    LR parsing

    自底向上的分析方法,最需要解决的问题是如何正确识别句柄。句柄是逐步产生的,
  • 状态:

    • 移进状态:$S \rightarrow \cdot bBB$
    • 待规约状态: $S \rightarrow b\cdot BB$ $S \rightarrow bB\cdot B$
    • 已经规约状态: $S \rightarrow bBB \cdot$
  • LR(0) parsing: 用于解决shift-reduce冲突的问题。
    alt text

  • Parsing Table:

    • top-down的分析表:选择产生式的规则
    • bottom-up的分析表:选择规约的产生式

alt text

  • 增广文法 Augmented Grammar

    • 将文法的开始符号G改为新的符号G’,然后添加一个新的产生式 $G’ \rightarrow G$,将新的符号替换为原来的开始符号。
    • 目的是让开始符号只出现在一个产生式的左部,这样可以保证在规约时,栈中只有一个符号,分析器只有一个接收状态。
  • 项目

    • 项目是对文法的产生式的扩展,包括一个点,表示产生式的位置,描述规约状态。
    • 等价项目应该放在同一个项目集闭包中。

alt text

LR(0) parsing的问题:
  • 无法处理shift-reduce冲突
    alt text
  • 无法处理reduce-reduce冲突
    alt text

LR(0) only accepts languages where the handle can be found with no right context.
• Our shift/reduce parser only looks to the left of the handle, not to the right.
• How do we exploit the tokens after a possible handle to determine what to do?

SLR(1) parsing

S(simple)LR(1) parsing: “ simple “ 在借助 FOLLOW SET 就可以解决冲突。

alt text
alt text
基本思想:看当前的handle对应的非终止符号(产生式的左部)后面允许出现哪些终止符号

判断是否为SLR(1)文法

alt text
满足以下条件的文法是SLR(1)文法:

  1. 存在shift-reduce冲突
  2. 冲突能够通过FOLLOW集解决
    SLR分析存在的问题
    SLR只是简单地考察下一个输入符号是否属于规约项目 $A \rightarrow \alpha$ 相关的FOLLOW(A),但是 $b \in FIRST(\alpha)$,只是规约的一个必要条件,而非充分条件。

alt text
也就是集合${a_1,a_2,…,a_m}$和$FOLLOW{B_n}$的交集不为空,下一个输入的字符在此之中,那么就会出现冲突。

SLR分析表

alt text
alt text

SLR分析栈

(正确:only state symbols at the top of the stack)
alt text

LR(1) parsing

SLR能帮我们排除一些不可能的规约,但是不能保证所有的规约都是正确的。

alt text
alt text
alt text

LALR parsing

LR1的项目集太多,LALR1是对LR1的优化,将LR1的项目集合并,减少项目集的数量。

语义分析

掌握:依赖图,属性计算算法
理解:属性语法,合成和继承属性。S-属性,L-属性。了解语义分析的任务及其实现方法。知道如何通过给定的方程式计算属性值。知道如何绘制依赖图,以及如何将属性或值标注到解析图中。

AST vs. Parse Tree

AST 是解析树的压缩形式

  • 运算符出现在内部节点,而不是叶子节点。
  • 列表是“扁平化的”。
  • 省略了语法细节

    • 例如括号、逗号、分号
  • AST 是编译器后期阶段的更好结构

    • 抽象语法树表示实际源代码标记序列的抽象
    • 尽管如此,它们包含转换所需的所有信息。比解析树更有效
    • 解析器将执行解析树所表示的所有步骤,但通常只会构建抽象语法树

alt text

SDD (Syntax-Directed Definition) 语法制导定义

  • 语法制导定义是一种用于描述语法制导翻译的形式化方法。
  • 语法制导定义是对CFG的推广,
    • 每个文法符号都和一个语义属性集合相关联
    • 每个产生式都与一组语义规则相关联,这些规则用于计算产生式各个文法符号的属性值。
  • 语法制导定义的属性可以是综合属性或继承属性。
    • 综合属性是由产生式的右侧的属性计算得到的属性,分析树中N节点上的非终结符A的综合属性只能通过N的子节点的属性或者N节点本身的属性值来定义。
      • 终结符可以有综合属性,是由词法器提供的词法值。
    • 继承属性是由产生式的左侧的属性计算得到的属性,分析树中N节点上的非终结符A的由N的父节点或者N的兄弟节点的属性或者N本身的属性值计算得到的属性。
      • 终结符没有继承属性。

属性文法

SDD的求值顺序

S-SDD

  1. S-attribute grammar is the attributed grammar, of which.
    A. each attribute must be synthesized.
    仅仅使用综合属性的SSD称为S-SDD。

如果一个SDD是S-SDD,那么它的属性计算顺序是自底向上的。

L-SDD

L-SDD当且仅当它的每个属性要么是综合属性,要么是满足以下条件的继承属性:

  • 假设存在一个产生式$A \rightarrow X_1X_2…X_n$, 右边的符号$X_1X_2…X_n$的继承属性仅依赖于
    • A的继承属性
    • $Xi$左边的符号$X_1X_2…X{i-1}$的属性
    • $X_i$的本身的属性,但是$X_i$的属性不能在依赖图中形成环路。

SDT (Syntax-Directed Translation) 语法制导翻译

  • 可以看作是对语法制导定义的实现。
  • 显式地指明了语义规则的计算顺序。

    parse tree

    alt text

    Dependency Graph

    alt text

    控制流SDT

中间代码生成

掌握:基本结构的中级代码生成:三地址代码;控制结构的TAC,表达式的TAC
给定上下文无关语法和属性语法,能够通过SDT将源代码翻译成TAC

什么是 IR 生成?

• Intermediate Representation (IR) Generation
• 编译器前端的最后阶段。
• 目标:将程序转换为编译器后端所需的格式。
• 生成的代码无需优化;这由后续过程处理。
• 生成的代码不需要汇编;这也可以由后续过程处理。

  • 中间代码的必要性
    抽象语法树与目标代码不同,特别是在控制流结构的表示上
    中间代码以顺序形式表示语法树,更接近目标代码
  • 中间代码的流行形式:
    • 三地址代码

      三地址代码

      The most basic instruction of three address code has the general form $x=y op z$
      Which represents the evaluation of expressions
  • x,y,z are names, constants or compiler-generated temporary names
  • op stands for any arithmetic or logical operator, such as + ,’and’
  • “Three-address code” comes from this form of instruction, in general each of x,y and z represents an address in memory

    logical expression

    alt text

    Control Flow Statements

  • 三地址代码的基本结构是三地址指令,控制结构的三地址代码是基于基本结构的。

alt text

Translation of Boolean expressions in the context of control statements

  • Translation Method
    • The value of Boolean expression is not saved in a temporary but represented by a position reached in a program

alt text

  • 代码生成中使用的函数
    • newlabel() 每次调用时都会返回一个新标签、
  • Attributes
    • E true (E false) 是当 E 为 true(false)时 控制流向的标签
    • S.next 是附在S的代码后执行的第一个三地址指令上的标签
    • S.begin 是附在S的代码生成的第一个指令上的标签

alt text

alt text
alt text
alt text
alt text

triple implementation

quadruple implementation

作业