diff --git a/regex.md b/regex.md new file mode 100644 index 0000000..8c6154b --- /dev/null +++ b/regex.md @@ -0,0 +1,99 @@ +# Regex + +## Obsession + +If you looked good around my repositories, you've probably seen that I have a special thing for regex interpreters. +I love writing them. It's the most underestimated skill there is—to write one from scratch. + +Yes, you can follow some basic tutorial on the internet and learn how to do it the way everyone does. +But the real game? It's writing something you can't find anywhere else. + +And I've done that. Several times. + +Compiled, bytecode, even used regex itself as bytecode—that one was very special. +Nice interpreters, fast interpreters, winning, losing... But the end product is not the interpreter. +It's your own brain. + +## Why Do It? + +Thinking and problem solving is actually one of the best things there is. +And with problem solving, I do **not** mean solving it using Google or some book. +Pure thinking. With good understanding of language basics, you're able to write a regex interpreter. +It takes a serious—do not underestimate—amount of time. + +But more than being hardcore at the basics (yes, that's a thing), you don’t need. +The beautiful thing is, once you get into it, you can keep going on without having to Google or read a book. +It's all in your head. + +## The Trap of Research + +The most fun is when you haven’t researched regex or interpreters beforehand. +It makes you **extra creative** and lets your brain think freely. + +Solutions from others can be inspiring... but they can also *pollute* your thought process. +You can get stuck in someone else's way of thinking and end up building the same thing they did. + +For me, the target is not to create a regex engine that beats everyone else's. +That comes with many factors. In certain scenarios, I've even beaten the original glibc regex. +Cool? Sure. But not the point. + +The goal is: **write something decent and unique**. +Own design. No influence from others. That's it. + +## Questions Worth Asking + +Do you know what an AST is? +Will you use one? Or will you just interpret the regex directly? + +The easiest way must be the fastest, right? +Actually... no. + +I've benchmarked interpreters a lot, and performance really depends on the regexes themselves. +There's no one-size-fits-all solution. + +An advanced byte-compiled one with JIT will always be slower on the first pass than a dumb interpreter that just walks character by character. +But after parsing several lines? That JIT version takes the lead. + +## Performance Myths + +Validating strings is actually such a small task for a computer. +When it comes to performance, for most users, **it doesn’t matter** which parser you pick. + +That’s probably why everyone just uses the one bundled with their favorite programming language. + +But I had a parser that could parse an entire book. +We can’t say that for everyone—looking at you, glibc regex interpreter. +That one dies at around 10MB of content, if I remember correctly. Something like that. +So yeah, even things like that can be a target. + +## Wild Ideas + +What also could be fun? +Using a parser that validates while walking a file descriptor. + +By doing that, you can parse files of unlimited size—or even live network streams. +James Bond stuff. Real-time regex over TCP. Tapping into streaming data. + +And now we’re getting close to my next hobby: **protocol design**. +But that’s a story for another time. + +--- + +I don’t even expect people to read this far. + +--- + +## Code you should never read + +At least, not until you’re ready. + +I'm talking about a basic regex interpreter in C and it is written in around 30 lines. +I've read it in a book called Beautiful Code. The source was written by Brian Kernighan. + +I'm not posting the source because it probably would destroy your +creaitivty. It’s easy to find if you want to. + +Once you’ve seen it, you can’t unsee it. + +What he built? That’s the level I aim for. +