Regular Expressions and Finite-State Automata

languages in computer science

An English sentence can be regarded as a string of words (단어들의 나열),

and an English word can be regarded as a string of letters (문자들의 나열).

•

Not every string of letters is a legitimate (올바른) word, and not every string of words is a grammatical sentence.

•

Legitimate word? ⇒ dictionary. Grammar? ⇒ grammar book.

Computer languages are similar to English in that

•

Certain strings of characters are legitimate words of the language and

•

Certain strings of words can be put together according to certain rules to form syntactically correct programs

A compiler for a computer language analyzes the stream of characters in a program

•

First to recognize individual word and sentence units (this part is called a lexical (어휘, 사전적인) scanner),

•

Then to analyze the syntax, or grammar, of the sentences (this part is called a syntactic analyzer),

•

and finally to translate the sentences into machine code (this part is called a code generator).

Example: a simple calculator language

Words

•

Valid: Numbers (123), operators (+, -), and parentheses ((, )).

•

Invalid: 12a3, ++.

Sentences

•

Valid: 3 + 4, (2 * 5) - 1.

•

Invalid: 3 +, 2 * (5 -

Calculator (Compiler) steps

•

Lexical scanning: break input into tokens.

– Input: 3 + 4 → Tokens: 3, +, 4.

•

Syntactic analysis: check grammar.

– Confirms: [number] [operator] [number].

•

Code Generator: Translate to action.

– Output: 7.

Formal language

A formal language over an alphabet is any set of strings of characters of the alphabet.

(알파벳상의 formal language, 형식언어. 형식언어: 특정한 법칙에 따라 구성된 문자열들의 집합)

Q. For the alphabet Σ = {a, b}, we saw that an language L1 can be defined as follows.

L1 over Σ: the set of all strings that begin with the character a and have length at most three

characters. = {a, aa, ab, aaa, aab, aba, abb}

b. A palindrome (회문) is a string that looks the same if the order of its characters is reversed.

For instance, aba and baab are palindromes (or level, noon, radar, ..).

Define a language L2 over Σ to be the set of all palindromes obtained using the characters of Σ.

Write ten elements of L2.

Set of all Strings

Note that Σn is essentially the Cartesian product of n copies of Σ.

•

Σ+ is the set of all strings over Σ except for λ and is called the positive closure of Σ. (or Kleene plus)

•

Σ∗ = Σ1 ∪ Σ2 ∪ Σ3 ∪…

The language Σ∗ is called the Kleene closure of Σ (or Kleene star)

•

Σ∗ = Σ0 ∪ Σ1 ∪ Σ2 ∪…

Null string is also a string!

Null string λ

•

String: alphabet의 원소(즉, character)들을 유한 번 나열한 것

•

Null string: 0번 나열한 것 (길이가 0인 String)  Null string도 string!

Empty Set ∅

•

Set: a collection of elements

•

Empty set: 원소의 개수가 0인 Set

Σ𝟎𝟎

•

길이가 0인 String의 집합

•

Σ0 = λ ≠ ∅

Combining Languages

Regular Expression

One of the most useful ways to define a language is by means of a regular expression

(정규식을 이용해서 language를 정의한다)

As an example, one regular expression over Σ = {a, b, c} is

a | (b | c)* | (ab)*

If the alphabet happens to include symbols such as ( | ) ∗, special provisions have to be made to avoid

ambiguity.

An escape character, usually a backslash, is added before the potentially ambiguous symbol.

•

E.g, a left parenthesis would be written as \( and the backslash itself would be written as \\.

To eliminate parentheses, an order of precedence for the operations used to define regular

expressions has been introduced: the highest is ∗, concatenation is next, and | is the lowest.

It is also customary to eliminate the outer set of parentheses in a regular expression, because doing so

does not produce ambiguity.

Thus, (a((bc)*)) = a(bc)* and (a | (bc)) = a | bc

Shorthands for regular expressions

A number of shorthand notations have been developed to facilitate working with regular expressions

in text processing.

•

[beginning character – ending character] is commonly used to represent the regular expression

that consists of a single character in the range from the beginning to the ending character.

It is called a character class.

Finite State Automata

Finite-state automata

A finite-state automaton is an idealized machine that processes input step by step.

•

Each piece of input to a finite-state automaton leads to a change in the state of the automaton,

which in turn affects how subsequent input is processed.

Imagine, for example, the act of dialing a telephone number. Dialing 010 puts the telephone system

in a state of readiness to receive the remaining 8 digits of a mobile phone number.

•

On the other hand, dialing 02 sets it up to expect 7 or 8 additional digits for a landline in Seoul.

Similarly, dialing 031 transitions the system to expect a Gyeonggi Province landline number,

requiring another 7 or 8 digits.

Ex) A simple Vending Machine

A simple vending machine dispenses bottles of juice that cost $1 each. The machine accepts quarters and half-dollars only and does not give change. As soon as the amount deposited equals or exceeds $1 the machine releases a bottle of juice. The next coin deposited starts the process over again.

•

a set of states, together with an indication about

•

which is the initial state and which are the accepting states (when something special happens),

•

a list of all input elements,

•

and specification for a next-state function that defines which state is produced by each input in each state.

The operation of a finite-state automaton is commonly described by a diagram called a (state-)

transition diagram

•

states are represented by circles

•

accepting states by double circles

•

There is one arrow that points to the initial state

•

Other arrows that are labeled with input symbols and

point from each state to other states to indicate

the action of the next-state function.

Specifically, an arrow from state s to state t labeled m means that N(s, m) = t.

The next-state table for an automaton shows the values of the next-state function N for all possible states s and input symbols i.

In the annotated next-state table, the initial state is indicated by an arrow and the accepting states are marked by double circles.

The language accepted by an automaton

Now suppose a string of input symbols is fed into a finite-state automation in sequence.

•

After each successive input symbol has changed the state, the automation ends up in a certain state

– Either an accepting state or a nonaccepting state

⇒ The action of a finite-state automation separates into two subsets

Those strings that send the automation to an accepting state are said to be accepted by the automation.