|
# lorex
|
|
|
|
retoor <retoor@molodetz.nl>
|
|
|
|
A high-performance regular expression interpreter implemented from scratch in plain C. The engine uses Thompson's NFA construction algorithm with extensive optimizations for efficient pattern matching.
|
|
|
|
## CI
|
|
|
|
The project includes Gitea Actions CI that runs on every push and pull request:
|
|
- Build verification (release and debug)
|
|
- Full test suite (545 tests)
|
|
- Valgrind memory leak detection
|
|
- Code coverage generation
|
|
|
|
## Features
|
|
|
|
- Full regex syntax support: literals, metacharacters, quantifiers, character classes, groups, alternation, anchors
|
|
- NFA-based matching engine with Thompson construction
|
|
- Capturing groups with match position tracking
|
|
- Interactive REPL for testing patterns
|
|
- Zero external dependencies
|
|
- Comprehensive test suite with 545 tests
|
|
- Memory-safe implementation verified with Valgrind
|
|
|
|
## Performance
|
|
|
|
The engine includes multiple optimization techniques:
|
|
|
|
| Optimization | Description |
|
|
|--------------|-------------|
|
|
| Literal prefix extraction | Uses `strstr`/`memchr` to skip non-matching positions |
|
|
| First character filtering | Bitmap-based filtering of potential match positions |
|
|
| Alternation dispatch table | 256-byte lookup for fast alternation branch selection |
|
|
| End anchor backward search | Searches backward from suffix for `$` anchored patterns |
|
|
| Character class bitmaps | O(1) lookup tables for `\d`, `\w`, `\s` classes |
|
|
| Match context reuse | Pre-allocated buffers reduce per-match allocations |
|
|
| Cache-optimized structures | Field ordering minimizes padding waste |
|
|
|
|
Benchmark results against POSIX regex (81 test patterns):
|
|
|
|
| Category | Performance |
|
|
|----------|-------------|
|
|
| Character classes | LOREX 1.24x faster |
|
|
| Groups | LOREX 1.12x faster |
|
|
| Real-world patterns | LOREX 1.05x faster |
|
|
| Nested groups | LOREX 2.7x faster |
|
|
| Complex email patterns | LOREX 1.8x faster |
|
|
|
|
## Building
|
|
|
|
```sh
|
|
make # optimized release build
|
|
make debug # debug build with symbols
|
|
make test # run all tests
|
|
make benchmark # run performance benchmark
|
|
make coverage # generate gcov coverage report
|
|
make lcov # generate html coverage report (requires lcov)
|
|
make valgrind # run under valgrind
|
|
```
|
|
|
|
### Dependencies
|
|
|
|
Build requirements:
|
|
- GCC with C11 support
|
|
- GNU Make
|
|
|
|
Optional:
|
|
- valgrind (memory leak detection)
|
|
- lcov (html coverage reports): `apt install lcov`
|
|
|
|
## Usage
|
|
|
|
### Command Line
|
|
|
|
```sh
|
|
./lorex "pattern" "text" # search for pattern in text
|
|
./lorex -m "pattern" "text" # full match mode
|
|
./lorex -i # start REPL
|
|
./lorex # start REPL (default)
|
|
```
|
|
|
|
### REPL Commands
|
|
|
|
```
|
|
:p <pattern> compile and set pattern
|
|
:m <text> match text (anchored)
|
|
:s <text> search for pattern in text
|
|
<text> search (default)
|
|
:h help
|
|
:q quit
|
|
```
|
|
|
|
### C API
|
|
|
|
```c
|
|
#include "lorex.h"
|
|
|
|
lorex_error_t err;
|
|
lorex_regex_t *re = lorex_compile("\\d{3}-\\d{4}", &err);
|
|
if (!re) {
|
|
fprintf(stderr, "error: %s\n", lorex_error_string(err));
|
|
return 1;
|
|
}
|
|
|
|
lorex_match_t result;
|
|
if (lorex_search(re, "call 555-1234 now", &result)) {
|
|
printf("match at [%zu-%zu]\n", result.match_start, result.match_end);
|
|
}
|
|
|
|
lorex_free(re);
|
|
```
|
|
|
|
## Supported Syntax
|
|
|
|
| Pattern | Description |
|
|
|---------|-------------|
|
|
| `.` | any character except newline |
|
|
| `*` | zero or more |
|
|
| `+` | one or more |
|
|
| `?` | zero or one |
|
|
| `\|` | alternation |
|
|
| `()` | grouping and capture |
|
|
| `[]` | character class |
|
|
| `[^]` | negated character class |
|
|
| `[a-z]` | character range |
|
|
| `^` | start anchor |
|
|
| `$` | end anchor |
|
|
| `{n}` | exactly n |
|
|
| `{n,}` | n or more |
|
|
| `{n,m}` | n to m |
|
|
| `\d` | digit [0-9] |
|
|
| `\w` | word [a-zA-Z0-9_] |
|
|
| `\s` | whitespace |
|
|
| `\D` | non-digit |
|
|
| `\W` | non-word |
|
|
| `\S` | non-whitespace |
|
|
| `*?` `+?` `??` | non-greedy quantifiers |
|
|
|
|
## Architecture
|
|
|
|
```
|
|
src/
|
|
├── lexer.c tokenizer for regex patterns
|
|
├── parser.c recursive descent parser producing AST
|
|
├── ast.c abstract syntax tree node types
|
|
├── nfa.c Thompson NFA construction with optimizations
|
|
├── matcher.c NFA simulation with epsilon closure
|
|
├── lorex.c public API
|
|
├── repl.c interactive REPL
|
|
└── main.c CLI entry point
|
|
|
|
include/
|
|
├── lorex.h public header
|
|
├── lexer.h lexer interface
|
|
├── parser.h parser interface
|
|
├── ast.h AST types
|
|
├── nfa.h NFA types and optimization metadata
|
|
├── matcher.h matcher interface
|
|
└── repl.h REPL interface
|
|
|
|
tests/
|
|
├── test_lexer.c lexer unit tests
|
|
├── test_parser.c parser unit tests
|
|
├── test_nfa.c NFA construction tests
|
|
├── test_matcher.c matching tests
|
|
├── test_all.c comprehensive tests
|
|
├── test_integration.c integration tests (545 tests)
|
|
└── benchmark.c performance benchmark vs POSIX regex
|
|
```
|
|
|
|
## Test Suite
|
|
|
|
The test suite contains 545 tests covering:
|
|
|
|
| Category | Description |
|
|
|----------|-------------|
|
|
| Lexer | Tokenization of patterns |
|
|
| Parser | AST construction and error handling |
|
|
| NFA | State machine construction |
|
|
| Matcher | Pattern matching correctness |
|
|
| Integration | Real-world regex patterns |
|
|
|
|
Integration tests cover:
|
|
- Literal matching and concatenation
|
|
- Dot metacharacter and wildcards
|
|
- Start/end anchors
|
|
- All quantifiers (*, +, ?, {n,m})
|
|
- Alternation and grouping
|
|
- Character classes and ranges
|
|
- Negated character classes
|
|
- Escape sequences
|
|
- Email, IP, URL, phone patterns
|
|
- Greedy vs non-greedy matching
|
|
- Nested groups and complex nesting
|
|
- Edge cases and boundary conditions
|
|
- Pathological/stress patterns
|
|
|
|
Run tests with Valgrind verification:
|
|
```sh
|
|
make test # run all 545 tests
|
|
make valgrind # verify zero memory leaks
|
|
```
|
|
|
|
## Algorithm
|
|
|
|
The implementation uses Thompson's construction to convert regex patterns to NFAs:
|
|
|
|
1. **Lexer**: Tokenizes the pattern into a stream of tokens
|
|
2. **Parser**: Builds an AST using recursive descent parsing
|
|
3. **NFA Construction**: Converts AST to NFA using Thompson's algorithm
|
|
4. **Optimization**: Extracts literal prefixes, suffixes, and first-char sets
|
|
5. **Matching**: Simulates NFA with epsilon closure for linear-time matching
|
|
|
|
Time complexity: O(n*m) where n is pattern length and m is text length.
|
|
|
|
## License
|
|
|
|
MIT
|