I got this piece of code here, it is a program written in an imperative programming language structured in Haskell, so the question is \"how can I implement a lexer analyser and
I'll not write the whole thing for you, but I'll get you started with each bit. The three stages we'll go through are:
We could make a separate lexing stage between 2 and 3, but Parsec is happy to do both levels. Lexing is where you split the input into tokens - bits of the input that make sense - the equivalent of words in human language, which are also known as lexemes. Skipping a separate lexing phase means we need to be a little more explicit about whitespace etc.
First you need to define the grammar. Best done with paper and pencil, but I'll get you started:
program ::= line {[newline line]}
line ::= num dot whitespace statement
statement ::= declaration | write | ifstatement | goto | assignment | stop
declaration = "Int" whitespace varname equals int
varname = letter {[alphanum]}
-- more things here, including the more interesting ifstatement:
ifstatement = "if" whitespace expression whitespace equals expression whitespace statement
-- lower level stuff:
dot = "."
int = digit {[digit]}
digit = "0" | "1" | ... | "9"
whitespace = " " okspace | "\t" okspace
okspace = "" | whitespace
Have a think about how that matches your sample program, and think about how you'll finish it off:
1. Int n=5
2. write n
3. Int fac=1
4. if 0 n goto 8 -- unusual
5. fac := fac * n
6. n := n+1 -- complex
7. goto 4
8. write fac
9. stop
The if statement in line 4 is unusual because there's no =
or ==
in it. Perhaps that's to simplify the grammar and it can only accept single variables or integers, with a space between. Perhaps that's a typo and you mean to have an equals sign and arbitrary expressions. Find out which and rewrite the ifstatement
part of the grammar.
The assignment in line 6 is complex, because here you have to parse arbitrary arithmetic expressions. As far as I recall there are loads of examples of that knocking about, so I'll gladly skip that for now. Make it another question if you get stuck with that bit, but hopefully you'll have built up your parsing skills with the rest of it first.
An Abstract Syntax Tree represents the combination of tokens that make up your input. In Haskell we can define our own to suit the context, which will make life much simpler
I'm actually compiling this answer (a good way of checking for typos etc), which is why I need some declarations at the top of the code:
module ParseLang where
import Text.Parsec hiding (Line)
import Text.Parsec.String
import Control.Applicative hiding ((<|>), many)
We'll just make a Program
a list of Line
s, but enforce with the parser that there has to be at least one.
type Program = [Line]
For a Line
, it needs a number, and a statement, but the dot is just syntax we don't need to store. We can store the line number as an Int
, so although that allows negative numbers in the type declaration, again the parser won't accept negative.
data Line = Line {aNum::Int, aStatement :: Statement}
deriving Show
Multiple options are easy to define:
data Statement = Declaration VarName Int
| Write VarName
| IfStatement Expression Expression Statement
| Goto LineNumber
| Assignment VarName Expression
| Stop
deriving Show
Notice the absence of all the syntax cruft/connectives/equal signs leaving just the bits that change.
I'm stopping there - you can finish:
data Expression = Expression -- I've left this one for you
deriving Show
type VarName = String -- could use newtype for type safety for these to
type LineNumber = Int
The lower level syntax doesn't need representing in the AST because we'll use strings for it.
This bit is now nice and easy. Let's start at the bottom of the syntax tree and work up.
num :: Parser Int
num = read <$> many digit
We've used <$>
, which is a synonym for fmap
that we got by importing Control.Applicative
. Here it alters the values returned by the parser using the pure function on the left, in this case, read
. Have a look at this other answer for an introduction to fmap
if you're not used to it.
Let's make a parser that parses a string literal then some whitespace:
whitespace = space >> spaces -- a space then optional spaces
lit :: String -> Parser String
lit xs = string xs <* whitespace
Now that <*
is interesting. It looks like <*>
, which in effect combines two parsers, and is used in conjunction with <$>
which in effect maps a pure function onto the result. *>
and <*
combine two parsers, but ignore the output of one of them, so string "goto" <* whitespace
parses a "goto"
and some whitespace, but throws away the whitespace.
Now we're ready to parse a goto statement:
goto :: Parser Statement
goto = Goto <$> (lit "goto" *> num)
Now let's have a go at varName
varName :: Parser VarName
varName = (:) <$> letter <*> many (alphaNum <|> oneOf "'_")
A couple of things are going on there.
1. <|>
is alternative choice - one or the other, so (alphaNum <|> oneOf "'_")
accepts an alphanumeric character or one of that pair of innocent characters '
and _
you might want to include in a variable name.
2. f <$> parser1 <*> parser2
is a really nice way of combining parsers. It runs parser1, then parser2, then maps the function f
over the results f that they produced. It works with many parsers:
--ifstatement = "if" whitespace expression whitespace equals whitespace expression whitespace statement
ifStatement :: Parser Statement
ifstatement = IfStatement
<$> (lit "if" >> expression)
<*> (lit "=" >> expression) -- I put = in, but see below
<*> (whitespace >> statement) -- I'd be happier with a then in here
If you only allow a VarName
or Int
instead of a general Expression
, then you don't need the equals sign.
Here's how you put it together:
statement :: Parser Statement
statement = goto
<|> stop
<|> declaration
<|> write
<|> ifStatement
<|> assignment
--program ::= line {[newline line]}
program :: Parser Program
program = many1 line
--line ::= num dot whitespace statement
line :: Parser Line
line = Line <$> (num <* char '.') <*> (statement <* char '\n')
But I'll leave you with error messages every time you try to use a parser you've not finished yet, so the whole thing will compile OK and the bits you've defined should work.
stop = error "You've not yet defined stop"
declaration = error "You've not yet defined declaration"
write = error "You've not yet defined write"
ifStatement = error "You've not yet defined ifStatement"
assignment = error "You've not yet defined assignment"
expression = error "You've not yet defined expression"