In this post, I show a concise and compositional implementation of a regular expression matcher in continuation-passing style. The program is written in the Racket language.
First, we introduce a small pattern language in s-expression that is used to descirbe regular languages. The BNF grammar of this pattern language is shown below.
( ‹pattern› ++ ‹pattern› )
( ‹pattern› || ‹pattern› )
( ‹pattern› * )
( ‹pattern› + )
( ‹pattern› ? )
The first three cases are atomic: (1) ∅ denotes empty set, i.e., there is not string in this language; (2) () denotes empty string, i.e., the only string in this language is a string of length 0; or (3) the pattern can be a Racket symbol of length 1, which matches a literal character.
The rest of cases are composition of basic expressions. ++ is the concatenation operator, and || is the alternation operator. * is the Kleene star, denoting a pattern can be repeated 0 or more times. Similarly, + is the Kleene plus, denoting a pattern can be repeated 1 or more times. The question mark ? here matches the pattern either 0 or 1 time.
For example, (a ++ b), (a ++ (b || c)), and ((a +) *) are all well-formed regular expressions.
According to the grammar of the pattern language, the literal character is represented by Racket symbol value. Due to this reason, we first make two helper functions that are used to compare symbols and strings.
(define (symbol-length s) (string-length (symbol->string s))) (define (first-char-eq? sym str) (string=? (symbol->string sym) (substring str 0 1)))
Then we may define the accept function. It matches the regular expression reg. The first three cases are relatively straightforward: (1) if a ∅ is matched, then #f is immediately returned; (2) if a () is matched, we invoke the continuation k with the string str; (3) if the pattern is a single-character symbol, we check whether the string is length 0. If so, #f is immediately returned; otherwise, we compare the first character of the string and symbol-represented pattern, and invoke the continuation k with the rest string.
To handle the concatenation case, accept is invoked by matching r1 with a new continuation. The new continuation takes argument s*, which represents the rest string after r1 is matched, and tries to match it against with pattern r2.
To handle the alternation case, we directly use boolean operator or on matching r1 and r2. Either one of them succeeds will lead to succeed of the alternation pattern. We use similar way to handle the question mark – we may match the pattern once, or discard it by immediately invoking the continuation k.
The Kleene star/plus are interesting ones: we use another accept-* function to handle star; and use accept-* in the continuation of plus.
;; accept : pattern × string × (string → bool) → bool (define (accept reg str k) (match reg ['∅ #f] ['() (k str)] [(? symbol?) #:when (eq? (symbol-length reg) 1) (if (zero? (string-length str)) #f (and (first-char-eq? reg str) (k (substring str 1))))] [`(,r1 ++ ,r2) (accept r1 str (λ (s*) (accept r2 s* k)))] [`(,r1 || ,r2) (or (accept r1 str k) (accept r2 str k))] [`(,r *) (accept-* r str k)] [`(,r +) (accept r str (λ (s*) (and (not (string=? str s*)) (accept-* reg s* k))))] [`(,r ?) (or (accept r str k) (k str))]))
The function accept-* handles Kleene Star. Because of the recursive natural of Klenee star, we need to define accept-* as a separate function. Again, it uses or to deal with the alternation semantics: either the pattern is matched 0 times (base case), then the continuation k is invoked; or we can match it once, with a new continuation, where we recursively (inductively) invoke accept-* on the same pattern. The predicate (not (string=? str s*)) is used to ensure the match make progress.
;; accept-* : pattern × string × (string → bool) → bool (define (accept-* reg str k) (or (k str) (accept reg str (λ (s*) (and (not (string=? str s*)) (accept-* reg s* k))))))
Finally, we can define a top-level match function, with an initial continuation which expects the string s* to be length 0.
(define (regmatch? reg str) (accept reg str (λ (s*) (zero? (string-length s*)))))
We can use a few test cases to check the correctness of the regualr expression matcher.
(check-true (regmatch? '(a ++ (b ++ c)) "abc")) (check-true (regmatch? '() "")) (check-true (regmatch? '((a *) +) "aaa")) (check-true (regmatch? '(a ?) "a")) (check-true (regmatch? '(a ?) "")) (check-true (regmatch? '((a ++ (b *)) ++ ((c ?) || (d +))) "abbddd")) (check-true (regmatch? '(a ++ (b || c)) "ab")) (check-true (regmatch? '((a +) +) "aaa")) (check-true (regmatch? '((a +) +) "aaaaaaaaaaaaa")) (check-true (regmatch? '((a *) *) "aaaaa")) (check-true (regmatch? '((a || b) *) "aaaabbbbaaa")) (check-false (regmatch? '∅ "abc")) (check-false (regmatch? '() "a")) (check-false (regmatch? '((a *) +) "aa!")) (check-false (regmatch? '((a +) *) "aa!")) (check-false (regmatch? '(a ++ (b || c)) "aab")) (check-false (regmatch? '(a ++ (b || ((c *) ?))) "acccccd")) (check-false (regmatch? '((a +) +) "aaaaaaaaaaaa!"))
Danvy and Nielsen (2001) showed not only a higher-order matcher in CPS, but also an abstract machine that matches regular expressions, derived by defunctionalization. This post follows the same essence.
Olivier Danvy and Lasse R. Nielsen. Defunctionalization at Work. In Proc. Proceedings of the 3rd ACM SIGPLAN International Conference on Principles and Practice of Declarative Programming, 2001. http://doi.acm.org/10.1145/773184.773202