# loreg

retoor <retoor@molodetz.nl>

A high-performance regular expression interpreter implemented from scratch in plain C. The engine uses Thompson's NFA construction algorithm for efficient pattern matching.

## CI

The project includes Gitea Actions CI that runs on every push and pull request:
- Build verification (release and debug)
- Full test suite (569 tests)
- Valgrind memory leak detection
- Code coverage generation

## Features

- Full regex syntax support: literals, metacharacters, quantifiers, character classes, groups, alternation, anchors
- NFA-based matching engine with Thompson construction
- Capturing groups with match position tracking
- Interactive REPL for testing patterns
- Zero external dependencies
- Comprehensive test suite with 569 tests
- Memory-safe implementation verified with Valgrind

## Building

```sh
make            # optimized release build
make debug      # debug build with symbols
make test       # run all tests
make coverage   # generate coverage report
make profile    # generate profiling report
make valgrind   # run under valgrind
```

## Usage

### Command Line

```sh
./loreg "pattern" "text"           # search for pattern in text
./loreg -m "pattern" "text"        # full match mode
./loreg -i                         # start REPL
./loreg                            # start REPL (default)
```

### REPL Commands

```
:p <pattern>  compile and set pattern
:m <text>     match text (anchored)
:s <text>     search for pattern in text
<text>        search (default)
:h            help
:q            quit
```

### C API

```c
#include "loreg.h"

loreg_error_t err;
loreg_regex_t *re = loreg_compile("\\d{3}-\\d{4}", &err);
if (!re) {
    fprintf(stderr, "error: %s\n", loreg_error_string(err));
    return 1;
}

loreg_match_t result;
if (loreg_search(re, "call 555-1234 now", &result)) {
    printf("match at [%zu-%zu]\n", result.match_start, result.match_end);
}

loreg_free(re);
```

## Supported Syntax

| Pattern | Description |
|---------|-------------|
| `.` | any character except newline |
| `*` | zero or more |
| `+` | one or more |
| `?` | zero or one |
| `\|` | alternation |
| `()` | grouping and capture |
| `[]` | character class |
| `[^]` | negated character class |
| `[a-z]` | character range |
| `^` | start anchor |
| `$` | end anchor |
| `{n}` | exactly n |
| `{n,}` | n or more |
| `{n,m}` | n to m |
| `\d` | digit [0-9] |
| `\w` | word [a-zA-Z0-9_] |
| `\s` | whitespace |
| `\D` | non-digit |
| `\W` | non-word |
| `\S` | non-whitespace |
| `*?` `+?` `??` | non-greedy quantifiers |

## Architecture

```
src/
├── lexer.c     tokenizer for regex patterns
├── parser.c    recursive descent parser producing AST
├── ast.c       abstract syntax tree node types
├── nfa.c       Thompson NFA construction
├── matcher.c   NFA simulation with epsilon closure
├── loreg.c     public API
├── repl.c      interactive REPL
└── main.c      CLI entry point

include/
├── loreg.h     public header
├── lexer.h     lexer interface
├── parser.h    parser interface
├── ast.h       AST types
├── nfa.h       NFA types
├── matcher.h   matcher interface
└── repl.h      REPL interface

tests/
├── test_lexer.c       lexer unit tests (10 tests)
├── test_parser.c      parser unit tests (20 tests)
├── test_nfa.c         NFA construction tests (14 tests)
├── test_matcher.c     matching tests (27 tests)
├── test_all.c         comprehensive tests (9 tests)
└── test_integration.c integration tests (489 tests)
```

## Test Suite

The test suite contains 569 tests covering:

| Category | Description |
|----------|-------------|
| Lexer | Tokenization of patterns |
| Parser | AST construction and error handling |
| NFA | State machine construction |
| Matcher | Pattern matching correctness |
| Integration | Real-world regex patterns |

Integration tests cover:
- Literal matching and concatenation
- Dot metacharacter and wildcards
- Start/end anchors
- All quantifiers (*, +, ?, {n,m})
- Alternation and grouping
- Character classes and ranges
- Negated character classes
- Escape sequences
- Email, IP, URL, phone patterns
- Greedy vs non-greedy matching
- Nested groups and complex nesting
- Edge cases and boundary conditions
- Pathological/stress patterns

Run tests with Valgrind verification:
```sh
make test           # run all 569 tests
make valgrind       # verify zero memory leaks
```

## Algorithm

The implementation uses Thompson's construction to convert regex patterns to NFAs:

1. **Lexer**: Tokenizes the pattern into a stream of tokens
2. **Parser**: Builds an AST using recursive descent parsing
3. **NFA Construction**: Converts AST to NFA using Thompson's algorithm
4. **Matching**: Simulates NFA with epsilon closure for linear-time matching

Time complexity: O(n*m) where n is pattern length and m is text length.

## License

MIT