lorex

retoor retoor@molodetz.nl

A high-performance regular expression interpreter implemented from scratch in plain C. The engine uses Thompson's NFA construction algorithm with extensive optimizations for efficient pattern matching.

CI

The project includes Gitea Actions CI that runs on every push and pull request:

Build verification (release and debug)
Full test suite (545 tests)
Valgrind memory leak detection
Code coverage generation

Features

Full regex syntax support: literals, metacharacters, quantifiers, character classes, groups, alternation, anchors
NFA-based matching engine with Thompson construction
Capturing groups with match position tracking
Interactive REPL for testing patterns
Zero external dependencies
Comprehensive test suite with 545 tests
Memory-safe implementation verified with Valgrind

Performance

The engine includes multiple optimization techniques:

Optimization	Description
Literal prefix extraction	Uses `strstr`/`memchr` to skip non-matching positions
First character filtering	Bitmap-based filtering of potential match positions
Alternation dispatch table	256-byte lookup for fast alternation branch selection
End anchor backward search	Searches backward from suffix for `$` anchored patterns
Character class bitmaps	O(1) lookup tables for `\d`, `\w`, `\s` classes
Match context reuse	Pre-allocated buffers reduce per-match allocations
Cache-optimized structures	Field ordering minimizes padding waste

Benchmark results against POSIX regex (81 test patterns):

Category	Performance
Character classes	LOREX 1.24x faster
Groups	LOREX 1.12x faster
Real-world patterns	LOREX 1.05x faster
Nested groups	LOREX 2.7x faster
Complex email patterns	LOREX 1.8x faster

Building

make            # optimized release build
make debug      # debug build with symbols
make test       # run all tests
make benchmark  # run performance benchmark
make coverage   # generate gcov coverage report
make lcov       # generate html coverage report (requires lcov)
make valgrind   # run under valgrind

Dependencies

Build requirements:

GCC with C11 support
GNU Make

Optional:

valgrind (memory leak detection)
lcov (html coverage reports): apt install lcov

Usage

Command Line

./lorex "pattern" "text"           # search for pattern in text
./lorex -m "pattern" "text"        # full match mode
./lorex -i                         # start REPL
./lorex                            # start REPL (default)

REPL Commands

:p <pattern>  compile and set pattern
:m <text>     match text (anchored)
:s <text>     search for pattern in text
<text>        search (default)
:h            help
:q            quit

C API

#include "lorex.h"

lorex_error_t err;
lorex_regex_t *re = lorex_compile("\\d{3}-\\d{4}", &err);
if (!re) {
    fprintf(stderr, "error: %s\n", lorex_error_string(err));
    return 1;
}

lorex_match_t result;
if (lorex_search(re, "call 555-1234 now", &result)) {
    printf("match at [%zu-%zu]\n", result.match_start, result.match_end);
}

lorex_free(re);

Supported Syntax

Pattern	Description
`.`	any character except newline
`*`	zero or more
`+`	one or more
`?`	zero or one
`\|`	alternation
`()`	grouping and capture
`[]`	character class
`[^]`	negated character class
`[a-z]`	character range
`^`	start anchor
`$`	end anchor
`{n}`	exactly n
`{n,}`	n or more
`{n,m}`	n to m
`\d`	digit [0-9]
`\w`	word [a-zA-Z0-9_]
`\s`	whitespace
`\D`	non-digit
`\W`	non-word
`\S`	non-whitespace
`*?` `+?` `??`	non-greedy quantifiers

Architecture

src/
├── lexer.c     tokenizer for regex patterns
├── parser.c    recursive descent parser producing AST
├── ast.c       abstract syntax tree node types
├── nfa.c       Thompson NFA construction with optimizations
├── matcher.c   NFA simulation with epsilon closure
├── lorex.c     public API
├── repl.c      interactive REPL
└── main.c      CLI entry point

include/
├── lorex.h     public header
├── lexer.h     lexer interface
├── parser.h    parser interface
├── ast.h       AST types
├── nfa.h       NFA types and optimization metadata
├── matcher.h   matcher interface
└── repl.h      REPL interface

tests/
├── test_lexer.c       lexer unit tests
├── test_parser.c      parser unit tests
├── test_nfa.c         NFA construction tests
├── test_matcher.c     matching tests
├── test_all.c         comprehensive tests
├── test_integration.c integration tests (545 tests)
└── benchmark.c        performance benchmark vs POSIX regex

Test Suite

The test suite contains 545 tests covering:

Category	Description
Lexer	Tokenization of patterns
Parser	AST construction and error handling
NFA	State machine construction
Matcher	Pattern matching correctness
Integration	Real-world regex patterns

Integration tests cover:

Literal matching and concatenation
Dot metacharacter and wildcards
Start/end anchors
All quantifiers (*, +, ?, {n,m})
Alternation and grouping
Character classes and ranges
Negated character classes
Escape sequences
Email, IP, URL, phone patterns
Greedy vs non-greedy matching
Nested groups and complex nesting
Edge cases and boundary conditions
Pathological/stress patterns

Run tests with Valgrind verification:

make test           # run all 545 tests
make valgrind       # verify zero memory leaks

Algorithm

The implementation uses Thompson's construction to convert regex patterns to NFAs:

Lexer: Tokenizes the pattern into a stream of tokens
Parser: Builds an AST using recursive descent parsing
NFA Construction: Converts AST to NFA using Thompson's algorithm
Optimization: Extracts literal prefixes, suffixes, and first-char sets
Matching: Simulates NFA with epsilon closure for linear-time matching

Time complexity: O(n*m) where n is pattern length and m is text length.

License

MIT