All source listed below is under MIT license if no LICENSE file stating different is available.
lorex
retoor retoor@molodetz.nl
A high-performance regular expression interpreter implemented from scratch in plain C. The engine uses Thompson's NFA construction algorithm with extensive optimizations for efficient pattern matching.
CI
The project includes Gitea Actions CI that runs on every push and pull request:
- Build verification (release and debug)
- Full test suite (545 tests)
- Valgrind memory leak detection
- Code coverage generation
Features
- Full regex syntax support: literals, metacharacters, quantifiers, character classes, groups, alternation, anchors
- NFA-based matching engine with Thompson construction
- Capturing groups with match position tracking
- Interactive REPL for testing patterns
- Zero external dependencies
- Comprehensive test suite with 545 tests
- Memory-safe implementation verified with Valgrind
Performance
The engine includes multiple optimization techniques:
| Optimization | Description |
|---|---|
| Literal prefix extraction | Uses strstr/memchr to skip non-matching positions |
| First character filtering | Bitmap-based filtering of potential match positions |
| Alternation dispatch table | 256-byte lookup for fast alternation branch selection |
| End anchor backward search | Searches backward from suffix for $ anchored patterns |
| Character class bitmaps | O(1) lookup tables for \d, \w, \s classes |
| Match context reuse | Pre-allocated buffers reduce per-match allocations |
| Cache-optimized structures | Field ordering minimizes padding waste |
Benchmark results against POSIX regex (81 test patterns):
| Category | Performance |
|---|---|
| Character classes | LOREX 1.24x faster |
| Groups | LOREX 1.12x faster |
| Real-world patterns | LOREX 1.05x faster |
| Nested groups | LOREX 2.7x faster |
| Complex email patterns | LOREX 1.8x faster |
Building
make # optimized release build
make debug # debug build with symbols
make test # run all tests
make benchmark # run performance benchmark
make coverage # generate gcov coverage report
make lcov # generate html coverage report (requires lcov)
make valgrind # run under valgrind
Dependencies
Build requirements:
- GCC with C11 support
- GNU Make
Optional:
- valgrind (memory leak detection)
- lcov (html coverage reports):
apt install lcov
Usage
Command Line
./lorex "pattern" "text" # search for pattern in text
./lorex -m "pattern" "text" # full match mode
./lorex -i # start REPL
./lorex # start REPL (default)
REPL Commands
:p <pattern> compile and set pattern
:m <text> match text (anchored)
:s <text> search for pattern in text
<text> search (default)
:h help
:q quit
C API
#include "lorex.h"
lorex_error_t err;
lorex_regex_t *re = lorex_compile("\\d{3}-\\d{4}", &err);
if (!re) {
fprintf(stderr, "error: %s\n", lorex_error_string(err));
return 1;
}
lorex_match_t result;
if (lorex_search(re, "call 555-1234 now", &result)) {
printf("match at [%zu-%zu]\n", result.match_start, result.match_end);
}
lorex_free(re);
Supported Syntax
| Pattern | Description |
|---|---|
. |
any character except newline |
* |
zero or more |
+ |
one or more |
? |
zero or one |
| |
alternation |
() |
grouping and capture |
[] |
character class |
[^] |
negated character class |
[a-z] |
character range |
^ |
start anchor |
$ |
end anchor |
{n} |
exactly n |
{n,} |
n or more |
{n,m} |
n to m |
\d |
digit [0-9] |
\w |
word [a-zA-Z0-9_] |
\s |
whitespace |
\D |
non-digit |
\W |
non-word |
\S |
non-whitespace |
*? +? ?? |
non-greedy quantifiers |
Architecture
src/
├── lexer.c tokenizer for regex patterns
├── parser.c recursive descent parser producing AST
├── ast.c abstract syntax tree node types
├── nfa.c Thompson NFA construction with optimizations
├── matcher.c NFA simulation with epsilon closure
├── lorex.c public API
├── repl.c interactive REPL
└── main.c CLI entry point
include/
├── lorex.h public header
├── lexer.h lexer interface
├── parser.h parser interface
├── ast.h AST types
├── nfa.h NFA types and optimization metadata
├── matcher.h matcher interface
└── repl.h REPL interface
tests/
├── test_lexer.c lexer unit tests
├── test_parser.c parser unit tests
├── test_nfa.c NFA construction tests
├── test_matcher.c matching tests
├── test_all.c comprehensive tests
├── test_integration.c integration tests (545 tests)
└── benchmark.c performance benchmark vs POSIX regex
Test Suite
The test suite contains 545 tests covering:
| Category | Description |
|---|---|
| Lexer | Tokenization of patterns |
| Parser | AST construction and error handling |
| NFA | State machine construction |
| Matcher | Pattern matching correctness |
| Integration | Real-world regex patterns |
Integration tests cover:
- Literal matching and concatenation
- Dot metacharacter and wildcards
- Start/end anchors
- All quantifiers (*, +, ?, {n,m})
- Alternation and grouping
- Character classes and ranges
- Negated character classes
- Escape sequences
- Email, IP, URL, phone patterns
- Greedy vs non-greedy matching
- Nested groups and complex nesting
- Edge cases and boundary conditions
- Pathological/stress patterns
Run tests with Valgrind verification:
make test # run all 545 tests
make valgrind # verify zero memory leaks
Algorithm
The implementation uses Thompson's construction to convert regex patterns to NFAs:
- Lexer: Tokenizes the pattern into a stream of tokens
- Parser: Builds an AST using recursive descent parsing
- NFA Construction: Converts AST to NFA using Thompson's algorithm
- Optimization: Extracts literal prefixes, suffixes, and first-char sets
- Matching: Simulates NFA with epsilon closure for linear-time matching
Time complexity: O(n*m) where n is pattern length and m is text length.
License
MIT
| .gitea/workflows | |
| include | |
| src | |
| tests | |
| .gitignore | |
| CHANGELOG.md | |
| Makefile | |
| README.md |