2026-01-04 01:58:43 +01:00
# lorex
2026-01-04 00:04:48 +01:00
retoor < retoor @ molodetz . nl >
2026-01-04 01:58:43 +01:00
A high-performance regular expression interpreter implemented from scratch in plain C. The engine uses Thompson's NFA construction algorithm with extensive optimizations for efficient pattern matching.
2026-01-04 00:04:48 +01:00
## CI
The project includes Gitea Actions CI that runs on every push and pull request:
- Build verification (release and debug)
2026-01-04 01:58:43 +01:00
- Full test suite (545 tests)
2026-01-04 00:04:48 +01:00
- Valgrind memory leak detection
- Code coverage generation
## Features
- Full regex syntax support: literals, metacharacters, quantifiers, character classes, groups, alternation, anchors
- NFA-based matching engine with Thompson construction
- Capturing groups with match position tracking
- Interactive REPL for testing patterns
- Zero external dependencies
2026-01-04 01:58:43 +01:00
- Comprehensive test suite with 545 tests
2026-01-04 00:04:48 +01:00
- Memory-safe implementation verified with Valgrind
2026-01-04 01:58:43 +01:00
## Performance
The engine includes multiple optimization techniques:
| Optimization | Description |
|--------------|-------------|
| Literal prefix extraction | Uses `strstr` /`memchr` to skip non-matching positions |
| First character filtering | Bitmap-based filtering of potential match positions |
| Alternation dispatch table | 256-byte lookup for fast alternation branch selection |
| End anchor backward search | Searches backward from suffix for `$` anchored patterns |
| Character class bitmaps | O(1) lookup tables for `\d` , `\w` , `\s` classes |
| Match context reuse | Pre-allocated buffers reduce per-match allocations |
| Cache-optimized structures | Field ordering minimizes padding waste |
Benchmark results against POSIX regex (81 test patterns):
| Category | Performance |
|----------|-------------|
| Character classes | LOREX 1.24x faster |
| Groups | LOREX 1.12x faster |
| Real-world patterns | LOREX 1.05x faster |
| Nested groups | LOREX 2.7x faster |
| Complex email patterns | LOREX 1.8x faster |
2026-01-04 00:04:48 +01:00
## Building
```sh
make # optimized release build
make debug # debug build with symbols
make test # run all tests
2026-01-04 01:58:43 +01:00
make benchmark # run performance benchmark
make coverage # generate gcov coverage report
make lcov # generate html coverage report (requires lcov)
2026-01-04 00:04:48 +01:00
make valgrind # run under valgrind
```
2026-01-04 01:58:43 +01:00
### Dependencies
Build requirements:
- GCC with C11 support
- GNU Make
Optional:
- valgrind (memory leak detection)
- lcov (html coverage reports): `apt install lcov`
2026-01-04 00:04:48 +01:00
## Usage
### Command Line
```sh
2026-01-04 01:58:43 +01:00
./lorex "pattern" "text" # search for pattern in text
./lorex -m "pattern" "text" # full match mode
./lorex -i # start REPL
./lorex # start REPL (default)
2026-01-04 00:04:48 +01:00
```
### REPL Commands
```
:p < pattern > compile and set pattern
:m < text > match text (anchored)
:s < text > search for pattern in text
< text > search (default)
:h help
:q quit
```
### C API
```c
2026-01-04 01:58:43 +01:00
#include "lorex.h"
2026-01-04 00:04:48 +01:00
2026-01-04 01:58:43 +01:00
lorex_error_t err;
lorex_regex_t *re = lorex_compile("\\d{3}-\\d{4}", &err);
2026-01-04 00:04:48 +01:00
if (!re) {
2026-01-04 01:58:43 +01:00
fprintf(stderr, "error: %s\n", lorex_error_string(err));
2026-01-04 00:04:48 +01:00
return 1;
}
2026-01-04 01:58:43 +01:00
lorex_match_t result;
if (lorex_search(re, "call 555-1234 now", & result)) {
2026-01-04 00:04:48 +01:00
printf("match at [%zu-%zu]\n", result.match_start, result.match_end);
}
2026-01-04 01:58:43 +01:00
lorex_free(re);
2026-01-04 00:04:48 +01:00
```
## Supported Syntax
| Pattern | Description |
|---------|-------------|
| `.` | any character except newline |
| `*` | zero or more |
| `+` | one or more |
| `?` | zero or one |
| `\|` | alternation |
| `()` | grouping and capture |
| `[]` | character class |
| `[^]` | negated character class |
| `[a-z]` | character range |
| `^` | start anchor |
| `$` | end anchor |
| `{n}` | exactly n |
| `{n,}` | n or more |
| `{n,m}` | n to m |
| `\d` | digit [0-9] |
| `\w` | word [a-zA-Z0-9_] |
| `\s` | whitespace |
| `\D` | non-digit |
| `\W` | non-word |
| `\S` | non-whitespace |
| `*?` `+?` `??` | non-greedy quantifiers |
## Architecture
```
src/
├── lexer.c tokenizer for regex patterns
├── parser.c recursive descent parser producing AST
├── ast.c abstract syntax tree node types
2026-01-04 01:58:43 +01:00
├── nfa.c Thompson NFA construction with optimizations
2026-01-04 00:04:48 +01:00
├── matcher.c NFA simulation with epsilon closure
2026-01-04 01:58:43 +01:00
├── lorex.c public API
2026-01-04 00:04:48 +01:00
├── repl.c interactive REPL
└── main.c CLI entry point
include/
2026-01-04 01:58:43 +01:00
├── lorex.h public header
2026-01-04 00:04:48 +01:00
├── lexer.h lexer interface
├── parser.h parser interface
├── ast.h AST types
2026-01-04 01:58:43 +01:00
├── nfa.h NFA types and optimization metadata
2026-01-04 00:04:48 +01:00
├── matcher.h matcher interface
└── repl.h REPL interface
tests/
2026-01-04 01:58:43 +01:00
├── test_lexer.c lexer unit tests
├── test_parser.c parser unit tests
├── test_nfa.c NFA construction tests
├── test_matcher.c matching tests
├── test_all.c comprehensive tests
├── test_integration.c integration tests (545 tests)
└── benchmark.c performance benchmark vs POSIX regex
2026-01-04 00:04:48 +01:00
```
## Test Suite
2026-01-04 01:58:43 +01:00
The test suite contains 545 tests covering:
2026-01-04 00:04:48 +01:00
| Category | Description |
|----------|-------------|
| Lexer | Tokenization of patterns |
| Parser | AST construction and error handling |
| NFA | State machine construction |
| Matcher | Pattern matching correctness |
| Integration | Real-world regex patterns |
Integration tests cover:
- Literal matching and concatenation
- Dot metacharacter and wildcards
- Start/end anchors
- All quantifiers (*, +, ?, {n,m})
- Alternation and grouping
- Character classes and ranges
- Negated character classes
- Escape sequences
- Email, IP, URL, phone patterns
- Greedy vs non-greedy matching
- Nested groups and complex nesting
- Edge cases and boundary conditions
- Pathological/stress patterns
Run tests with Valgrind verification:
```sh
2026-01-04 01:58:43 +01:00
make test # run all 545 tests
2026-01-04 00:04:48 +01:00
make valgrind # verify zero memory leaks
```
## Algorithm
The implementation uses Thompson's construction to convert regex patterns to NFAs:
1. **Lexer** : Tokenizes the pattern into a stream of tokens
2. **Parser** : Builds an AST using recursive descent parsing
3. **NFA Construction** : Converts AST to NFA using Thompson's algorithm
2026-01-04 01:58:43 +01:00
4. **Optimization** : Extracts literal prefixes, suffixes, and first-char sets
5. **Matching** : Simulates NFA with epsilon closure for linear-time matching
2026-01-04 00:04:48 +01:00
Time complexity: O(n*m) where n is pattern length and m is text length.
## License
MIT