loreg

retoor retoor@molodetz.nl

A high-performance regular expression interpreter implemented from scratch in plain C. The engine uses Thompson's NFA construction algorithm for efficient pattern matching.

CI

The project includes Gitea Actions CI that runs on every push and pull request:

  • Build verification (release and debug)
  • Full test suite (569 tests)
  • Valgrind memory leak detection
  • Code coverage generation

Features

  • Full regex syntax support: literals, metacharacters, quantifiers, character classes, groups, alternation, anchors
  • NFA-based matching engine with Thompson construction
  • Capturing groups with match position tracking
  • Interactive REPL for testing patterns
  • Zero external dependencies
  • Comprehensive test suite with 569 tests
  • Memory-safe implementation verified with Valgrind

Building

make            # optimized release build
make debug      # debug build with symbols
make test       # run all tests
make coverage   # generate coverage report
make profile    # generate profiling report
make valgrind   # run under valgrind

Usage

Command Line

./loreg "pattern" "text"           # search for pattern in text
./loreg -m "pattern" "text"        # full match mode
./loreg -i                         # start REPL
./loreg                            # start REPL (default)

REPL Commands

:p <pattern>  compile and set pattern
:m <text>     match text (anchored)
:s <text>     search for pattern in text
<text>        search (default)
:h            help
:q            quit

C API

#include "loreg.h"

loreg_error_t err;
loreg_regex_t *re = loreg_compile("\\d{3}-\\d{4}", &err);
if (!re) {
    fprintf(stderr, "error: %s\n", loreg_error_string(err));
    return 1;
}

loreg_match_t result;
if (loreg_search(re, "call 555-1234 now", &result)) {
    printf("match at [%zu-%zu]\n", result.match_start, result.match_end);
}

loreg_free(re);

Supported Syntax

Pattern Description
. any character except newline
* zero or more
+ one or more
? zero or one
| alternation
() grouping and capture
[] character class
[^] negated character class
[a-z] character range
^ start anchor
$ end anchor
{n} exactly n
{n,} n or more
{n,m} n to m
\d digit [0-9]
\w word [a-zA-Z0-9_]
\s whitespace
\D non-digit
\W non-word
\S non-whitespace
*? +? ?? non-greedy quantifiers

Architecture

src/
├── lexer.c     tokenizer for regex patterns
├── parser.c    recursive descent parser producing AST
├── ast.c       abstract syntax tree node types
├── nfa.c       Thompson NFA construction
├── matcher.c   NFA simulation with epsilon closure
├── loreg.c     public API
├── repl.c      interactive REPL
└── main.c      CLI entry point

include/
├── loreg.h     public header
├── lexer.h     lexer interface
├── parser.h    parser interface
├── ast.h       AST types
├── nfa.h       NFA types
├── matcher.h   matcher interface
└── repl.h      REPL interface

tests/
├── test_lexer.c       lexer unit tests (10 tests)
├── test_parser.c      parser unit tests (20 tests)
├── test_nfa.c         NFA construction tests (14 tests)
├── test_matcher.c     matching tests (27 tests)
├── test_all.c         comprehensive tests (9 tests)
└── test_integration.c integration tests (489 tests)

Test Suite

The test suite contains 569 tests covering:

Category Description
Lexer Tokenization of patterns
Parser AST construction and error handling
NFA State machine construction
Matcher Pattern matching correctness
Integration Real-world regex patterns

Integration tests cover:

  • Literal matching and concatenation
  • Dot metacharacter and wildcards
  • Start/end anchors
  • All quantifiers (*, +, ?, {n,m})
  • Alternation and grouping
  • Character classes and ranges
  • Negated character classes
  • Escape sequences
  • Email, IP, URL, phone patterns
  • Greedy vs non-greedy matching
  • Nested groups and complex nesting
  • Edge cases and boundary conditions
  • Pathological/stress patterns

Run tests with Valgrind verification:

make test           # run all 569 tests
make valgrind       # verify zero memory leaks

Algorithm

The implementation uses Thompson's construction to convert regex patterns to NFAs:

  1. Lexer: Tokenizes the pattern into a stream of tokens
  2. Parser: Builds an AST using recursive descent parsing
  3. NFA Construction: Converts AST to NFA using Thompson's algorithm
  4. Matching: Simulates NFA with epsilon closure for linear-time matching

Time complexity: O(n*m) where n is pattern length and m is text length.

License

MIT