Regular Expression Engine in 14 Lines

My previous post talks the basic operations of regualr expressions. This article shows a very small yet complete implementation of regular expression engine in Racket. It supports primitve operations on regualr expressions: union operator |, Kleene star operator *, option operator ?, and at least once operator +. And it only takes 14 lines of code!

The syntax is S-Expression inspired. We use postfix style that follows the idioms when writing regular expressions in other mainstream languages.

<pat> ::= ∅                 ;; Empty set
        | ""                ;; Empty string
        | <char>            ;; A single character
        | (<pat> *)         ;; Kleene star
        | (<pat> +)         ;; Repeat one or more times
        | (<pat> ?)         ;; Optional
        | (<pat> + <pat>)   ;; Concatenation
        | (<pat> or <pat>)  ;; Or

The matcher is written in continuation-passing style, where the continuation is a function of type string -> bool. The argument string of the continuation is the rest string we need to match. For example, in the case of concatenation, we first match r1 by invoking accept with a continuation, and this continuation will match r2 with the rest of the string s*. To match a Kleene star, we define an auxiliary function accetp-star, which accpets a star pattern when either the string repeats it zero time, or any times. Note that in the continuation of the recursive call of accept, we need to check the matching is making some progress by asserting (not (string=? str s*)).

(define (accept reg str k)
  (match reg
    ['∅ #f]
    ["" (k str)]
    [(? string?) (if (zero? (string-length str)) #f
                     (and (string=? reg (substring str 0 1)) (k (substring str 1))))]
    [`(,r1 + ,r2) (accept r1 str (λ (s*) (accept r2 s* k)))]
    [`(,r1 or ,r2) (or (accept r1 str k) (accept r2 str k))]
    [`(,r *) (accept-star r str k)]
    [`(,r +) (accept r str (λ (s*) (and (not (eq? str s*))(accept-star reg s* k))))]
    [`(,r ?) (or (accept r str k) (k str))]))
(define (accept-star reg str k)
  (or (k str) (accept reg str (λ (s*) (and (not (string=? str s*)) (accept-star reg s* k))))))
(define (regmatch reg str) (accept reg str (λ (s*) (zero? (string-length s*)))))

Let’s do some tests for it!

(check-true (regmatch "" ""))
(check-true (regmatch '(("a" *) +) "aaa"))
(check-true (regmatch '("a" ?) "a"))
(check-true (regmatch '("a" ?) ""))
(check-true (regmatch '(("a" + ("b"*)) + (("c"?) or ("d"+))) "abbddd"))
(check-true (regmatch '("a" + ("b" or "c")) "ab"))
(check-true (regmatch '(("a" +) +) "aaa"))
(check-true (regmatch'(("a" +) +) "aaaaaaaaaaaaa"))
(check-true (regmatch '(("a" *) *) "aaaaa"))

(check-false (regmatch '∅ "abc"))
(check-false (regmatch "" "a"))
(check-false (regmatch '(("a" *) +) "aa!"))
(check-false (regmatch '(("a" +) *) "aa!"))
(check-false (regmatch '("a" + ("b" or "c")) "aab"))
(check-false (regmatch '("a" + ("b" or (("c" *) ?))) "acccccd"))
(check-false (regmatch'(("a" +) +) "aaaaaaaaaaaa!"))

Reference: Olivier Danvy’s paper Defunctionalization at Work contains nice implementations of regular expression matcher in both higher-order form and first-order form. The matcher presented here is essentially the higher-order one that metioned in the paper, but rewritten in Racket.