# loreg retoor A high-performance regular expression interpreter implemented from scratch in plain C. The engine uses Thompson's NFA construction algorithm for efficient pattern matching. ## CI The project includes Gitea Actions CI that runs on every push and pull request: - Build verification (release and debug) - Full test suite (569 tests) - Valgrind memory leak detection - Code coverage generation ## Features - Full regex syntax support: literals, metacharacters, quantifiers, character classes, groups, alternation, anchors - NFA-based matching engine with Thompson construction - Capturing groups with match position tracking - Interactive REPL for testing patterns - Zero external dependencies - Comprehensive test suite with 569 tests - Memory-safe implementation verified with Valgrind ## Building ```sh make # optimized release build make debug # debug build with symbols make test # run all tests make coverage # generate coverage report make profile # generate profiling report make valgrind # run under valgrind ``` ## Usage ### Command Line ```sh ./loreg "pattern" "text" # search for pattern in text ./loreg -m "pattern" "text" # full match mode ./loreg -i # start REPL ./loreg # start REPL (default) ``` ### REPL Commands ``` :p compile and set pattern :m match text (anchored) :s search for pattern in text search (default) :h help :q quit ``` ### C API ```c #include "loreg.h" loreg_error_t err; loreg_regex_t *re = loreg_compile("\\d{3}-\\d{4}", &err); if (!re) { fprintf(stderr, "error: %s\n", loreg_error_string(err)); return 1; } loreg_match_t result; if (loreg_search(re, "call 555-1234 now", &result)) { printf("match at [%zu-%zu]\n", result.match_start, result.match_end); } loreg_free(re); ``` ## Supported Syntax | Pattern | Description | |---------|-------------| | `.` | any character except newline | | `*` | zero or more | | `+` | one or more | | `?` | zero or one | | `\|` | alternation | | `()` | grouping and capture | | `[]` | character class | | `[^]` | negated character class | | `[a-z]` | character range | | `^` | start anchor | | `$` | end anchor | | `{n}` | exactly n | | `{n,}` | n or more | | `{n,m}` | n to m | | `\d` | digit [0-9] | | `\w` | word [a-zA-Z0-9_] | | `\s` | whitespace | | `\D` | non-digit | | `\W` | non-word | | `\S` | non-whitespace | | `*?` `+?` `??` | non-greedy quantifiers | ## Architecture ``` src/ ├── lexer.c tokenizer for regex patterns ├── parser.c recursive descent parser producing AST ├── ast.c abstract syntax tree node types ├── nfa.c Thompson NFA construction ├── matcher.c NFA simulation with epsilon closure ├── loreg.c public API ├── repl.c interactive REPL └── main.c CLI entry point include/ ├── loreg.h public header ├── lexer.h lexer interface ├── parser.h parser interface ├── ast.h AST types ├── nfa.h NFA types ├── matcher.h matcher interface └── repl.h REPL interface tests/ ├── test_lexer.c lexer unit tests (10 tests) ├── test_parser.c parser unit tests (20 tests) ├── test_nfa.c NFA construction tests (14 tests) ├── test_matcher.c matching tests (27 tests) ├── test_all.c comprehensive tests (9 tests) └── test_integration.c integration tests (489 tests) ``` ## Test Suite The test suite contains 569 tests covering: | Category | Description | |----------|-------------| | Lexer | Tokenization of patterns | | Parser | AST construction and error handling | | NFA | State machine construction | | Matcher | Pattern matching correctness | | Integration | Real-world regex patterns | Integration tests cover: - Literal matching and concatenation - Dot metacharacter and wildcards - Start/end anchors - All quantifiers (*, +, ?, {n,m}) - Alternation and grouping - Character classes and ranges - Negated character classes - Escape sequences - Email, IP, URL, phone patterns - Greedy vs non-greedy matching - Nested groups and complex nesting - Edge cases and boundary conditions - Pathological/stress patterns Run tests with Valgrind verification: ```sh make test # run all 569 tests make valgrind # verify zero memory leaks ``` ## Algorithm The implementation uses Thompson's construction to convert regex patterns to NFAs: 1. **Lexer**: Tokenizes the pattern into a stream of tokens 2. **Parser**: Builds an AST using recursive descent parsing 3. **NFA Construction**: Converts AST to NFA using Thompson's algorithm 4. **Matching**: Simulates NFA with epsilon closure for linear-time matching Time complexity: O(n*m) where n is pattern length and m is text length. ## License MIT