lorex/README.md at master

 # lorex
 retoor <retoor@molodetz.nl>
 A high-performance regular expression interpreter implemented from scratch in plain C. The engine uses Thompson's NFA construction algorithm with extensive optimizations for efficient pattern matching.
 ## CI
 The project includes Gitea Actions CI that runs on every push and pull request:
 - Build verification (release and debug)
 - Full test suite (545 tests)
 - Valgrind memory leak detection
 - Code coverage generation
 ## Features
 - Full regex syntax support: literals, metacharacters, quantifiers, character classes, groups, alternation, anchors
 - NFA-based matching engine with Thompson construction
 - Capturing groups with match position tracking
 - Interactive REPL for testing patterns
 - Zero external dependencies
 - Comprehensive test suite with 545 tests
 - Memory-safe implementation verified with Valgrind
 ## Performance
 The engine includes multiple optimization techniques:
 | Optimization | Description |
 |--------------|-------------|
 | Literal prefix extraction | Uses `strstr`/`memchr` to skip non-matching positions |
 | First character filtering | Bitmap-based filtering of potential match positions |
 | Alternation dispatch table | 256-byte lookup for fast alternation branch selection |
 | End anchor backward search | Searches backward from suffix for `$` anchored patterns |
 | Character class bitmaps | O(1) lookup tables for `\d`, `\w`, `\s` classes |
 | Match context reuse | Pre-allocated buffers reduce per-match allocations |
 | Cache-optimized structures | Field ordering minimizes padding waste |
 Benchmark results against POSIX regex (81 test patterns):
 | Category | Performance |
 |----------|-------------|
 | Character classes | LOREX 1.24x faster |
 | Groups | LOREX 1.12x faster |
 | Real-world patterns | LOREX 1.05x faster |
 | Nested groups | LOREX 2.7x faster |
 | Complex email patterns | LOREX 1.8x faster |
 ## Building
 ```sh
 make            # optimized release build
 make debug      # debug build with symbols
 make test       # run all tests
 make benchmark  # run performance benchmark
 make coverage   # generate gcov coverage report
 make lcov       # generate html coverage report (requires lcov)
 make valgrind   # run under valgrind
 ```
 ### Dependencies
 Build requirements:
 - GCC with C11 support
 - GNU Make
 Optional:
 - valgrind (memory leak detection)
 - lcov (html coverage reports): `apt install lcov`
 ## Usage
 ### Command Line
 ```sh
 ./lorex "pattern" "text"           # search for pattern in text
 ./lorex -m "pattern" "text"        # full match mode
 ./lorex -i                         # start REPL
 ./lorex                            # start REPL (default)
 ```
 ### REPL Commands
 ```
 :p <pattern>  compile and set pattern
 :m <text>     match text (anchored)
 :s <text>     search for pattern in text
 <text>        search (default)
 :h            help
 :q            quit
 ```
 ### C API
 ```c
 #include "lorex.h"
 lorex_error_t err;
 lorex_regex_t *re = lorex_compile("\\d{3}-\\d{4}", &err);
 if (!re) {
     fprintf(stderr, "error: %s\n", lorex_error_string(err));
     return 1;
 }
 lorex_match_t result;
 if (lorex_search(re, "call 555-1234 now", &result)) {
     printf("match at [%zu-%zu]\n", result.match_start, result.match_end);
 }
 lorex_free(re);
 ```
 ## Supported Syntax
 | Pattern | Description |
 |---------|-------------|
 | `.` | any character except newline |
 | `*` | zero or more |
 | `+` | one or more |
 | `?` | zero or one |
 | `\|` | alternation |
 | `()` | grouping and capture |
 | `[]` | character class |
 | `[^]` | negated character class |
 | `[a-z]` | character range |
 | `^` | start anchor |
 | `$` | end anchor |
 | `{n}` | exactly n |
 | `{n,}` | n or more |
 | `{n,m}` | n to m |
 | `\d` | digit [0-9] |
 | `\w` | word [a-zA-Z0-9_] |
 | `\s` | whitespace |
 | `\D` | non-digit |
 | `\W` | non-word |
 | `\S` | non-whitespace |
 | `*?` `+?` `??` | non-greedy quantifiers |
 ## Architecture
 ```
 src/
 ├── lexer.c     tokenizer for regex patterns
 ├── parser.c    recursive descent parser producing AST
 ├── ast.c       abstract syntax tree node types
 ├── nfa.c       Thompson NFA construction with optimizations
 ├── matcher.c   NFA simulation with epsilon closure
 ├── lorex.c     public API
 ├── repl.c      interactive REPL
 └── main.c      CLI entry point
 include/
 ├── lorex.h     public header
 ├── lexer.h     lexer interface
 ├── parser.h    parser interface
 ├── ast.h       AST types
 ├── nfa.h       NFA types and optimization metadata
 ├── matcher.h   matcher interface
 └── repl.h      REPL interface
 tests/
 ├── test_lexer.c       lexer unit tests
 ├── test_parser.c      parser unit tests
 ├── test_nfa.c         NFA construction tests
 ├── test_matcher.c     matching tests
 ├── test_all.c         comprehensive tests
 ├── test_integration.c integration tests (545 tests)
 └── benchmark.c        performance benchmark vs POSIX regex
 ```
 ## Test Suite
 The test suite contains 545 tests covering:
 | Category | Description |
 |----------|-------------|
 | Lexer | Tokenization of patterns |
 | Parser | AST construction and error handling |
 | NFA | State machine construction |
 | Matcher | Pattern matching correctness |
 | Integration | Real-world regex patterns |
 Integration tests cover:
 - Literal matching and concatenation
 - Dot metacharacter and wildcards
 - Start/end anchors
 - All quantifiers (*, +, ?, {n,m})
 - Alternation and grouping
 - Character classes and ranges
 - Negated character classes
 - Escape sequences
 - Email, IP, URL, phone patterns
 - Greedy vs non-greedy matching
 - Nested groups and complex nesting
 - Edge cases and boundary conditions
 - Pathological/stress patterns
 Run tests with Valgrind verification:
 ```sh
 make test           # run all 545 tests
 make valgrind       # verify zero memory leaks
 ```
 ## Algorithm
 The implementation uses Thompson's construction to convert regex patterns to NFAs:
 . **Lexer**: Tokenizes the pattern into a stream of tokens
 . **Parser**: Builds an AST using recursive descent parsing
 . **NFA Construction**: Converts AST to NFA using Thompson's algorithm
 . **Optimization**: Extracts literal prefixes, suffixes, and first-char sets
 . **Matching**: Simulates NFA with epsilon closure for linear-time matching
 Time complexity: O(n*m) where n is pattern length and m is text length.
 ## License
 MIT