Wednesday, 26 October 2011

Left-recursion in Parsec

Lately I've been using the Parsec library for Haskell to write a parser and interpreter for a university assignment. Right-recursive grammars are trivial to parse with combinatorial parsers; tail recursion and backtracking make this simple. However, implementing a left-recursive grammar will often result in an infinite loop, as is the case in Parsec when using basic parsers.

Parsec does support left-recursion however. Unsatisfied with the lack of good tutorials when I googled for advice, I decided to write this. Hopefully it helps someone. If I can make this better or easier to understand, please let me know!

Left recursive parsing can be achieved in Parsec using chainl1.

chainl1 :: Stream s m t => ParsecT s u m a -> ParsecT s u m (a -> a -> a) -> ParsecT s u m a

As an example of how to use chainl1, I'll demonstrate its use in parsing basic integer addition and subtraction expressions.

First we'll need an abstract syntax tree to represent an integer expression. This can represent a single integer constant, or an addition / subtraction operation which involves two integer expressions.

data IntExp = Const Int
            | Add IntExp IntExp
            | Sub IntExp IntExp 

If addition and subtraction were to be right-associative, we'd parse the left operand as a constant, and attempt to parse the right operand as another integer expression. Upon failing, we'd backtrack and instead attempt to parse an integer constant. Reversing this approach to make the expressions left-associative would cause infinite recursion; we'd attempt to parse the left operand as an integer expression, which attempts to parse the left operand as an integer expression, which tries to... you get the point.

Instead we use chainl1 with two parsers; one to parse an integer constant, and another which parses a symbol and determines if the expression is an addition or subtraction.

parseIntExp :: Parser IntExp
parseIntExp =
  chainl1 parseConstant parseOperation

parseOperation :: Parser (IntExp -> IntExp -> IntExp)
parseOperation =
  do spaces
     symbol <- char '+' <|> char '-'
     case symbol of
       '+' -> return Add
       '-' -> return Sub

parseConstant :: Parser IntExp
parseConstant =
  do xs <- many1 digit
     return $ Const (read xs :: Int)

Here, parseOperation returns either the Add or Sub tag of IntExp. Using GHCi, you can confirm the type of Add as:

Add :: IntExp -> IntExp -> IntExp

So, we have a parser which will parse a constant and a parser which will parse a symbol and determine what type of operation an expression is. In parseIntExp, chainl1 is the glue which brings these together. This is what allows left-associative parsing without infinitely recursing.

A complete code sample is available here. The abstract syntax tree has been created as an instance of the Show typeclass to print in a more readable format, which shows that the grammar is indeed left-associative.

ghci>  run parseIntExp "2 + 3 - 4"

1 comment:

  1. Thanks. I found your article when searching about Parsec and left-recursion. I hadn't seen the `chainl1` before.

    For your particular example I thought the buildParserExpression might suit better.

    . term :: Parser Exp
    . term = buildExpressionParser table parseConstant
    . where
    . table = [[op "+" Add AssocLeft, op "-" Sub AssocLeft]]
    . op s f assoc = Infix (do { _ <- string s <* spaces; return f }) assoc
    . parseConstant :: Parser Exp
    . parseConstant = do
    . xs <- many1 digit <* spaces
    . return $ Const (read xs :: Int)

    However, in this case it seems pretty critical to only skip spaces after a token and not before.