Regular Expression Engine in 14 Lines

My previous post introduces the basic operations of regualr expression. This article shows a very small yet complete implementation of regular expression engine in Racket. It supports primitve operations on regualr expressions: union operator |, Kleene star operator *, option operator ?, and at least once operator +. And it only takes 14 lines of code!

The syntax is S-Expression inspired. We use postfix style to follow the idioms when writing regular expressions in other mainstream languages.

<pat> ::= <string>
        | (<pat> *)
        | (<pat> +)
        | (<pat> ?)
        | (<pat> or <pat>)

The code is an actually pretty straightforward implementation to handle each case and operation. For example, the Kleene star operator _*_ means that the pattern may repeat zero or any times, so we just use an or expression to handle these two cases. There is also a auxiliary function substring=?.

; reg-match : (listof pat) string integer -> boolean
(define (reg-match pats str pos)
  (if (null? pats) (= pos (string-length str))
      (match (car pats)
        [(? string?) (let ([len (string-length (car pats))])
                       (and (>= (string-length str) (+ pos len))
                            (substring=? (car pats) str pos (+ pos len))
                            (reg-match (cdr pats) str (+ pos len))))]
        [`(,p1 or ,p2) (or (reg-match `(,p1 ,@(cdr pats)) str pos)
                           (reg-match `(,p2 ,@(cdr pats)) str pos))]
        [`(,p *) (or (reg-match (cdr pats) str pos)
                     (reg-match `(,p ,@pats) str pos))]
        [`(,p +) (reg-match `(,p (,p *) ,@(cdr pats)) str pos)]
        [`(,p ?) (reg-match `((,p or "") ,@(cdr pats)) str pos)]
        [else (error "bad pattern")])))

; auxiliary function
(define (substring=? pat str start end)
  (for/and ([i (in-range start end)]
            [j (in-naturals)])
    (char=? (string-ref pat j) (string-ref str i))))

Let’s do some tests for it!

(reg-match '("a" ("b"*) (("c"?) or ("d"+))) "abbc" 0)  ;#t
(reg-match '("a" ("b" or "c")) "ab" 0)                 ;#t
(reg-match '("a" ("b" or (("c"*)?))) "accccc" 0)       ;#t
(reg-match '("a" ("b" or (("c"*)?))) "acccccd" 0)      ;#f
(reg-match '("accc" ("c" ?) ("c"+) "d") "acccccd" 0)   ;#t