Regex

Obsession

If you looked good around my repositories, you've probably seen that I have a special thing for regex interpreters.
I love writing them. It's the most underestimated skill there is—to write one from scratch.

Yes, you can follow some basic tutorial on the internet and learn how to do it the way everyone does.
But the real game? It's writing something you can't find anywhere else.

And I've done that. Several times.

Compiled, bytecode, even used regex itself as bytecode—that one was very special.
Nice interpreters, fast interpreters, winning, losing... But the end product is not the interpreter.
It's your own brain.

Why Do It?

Thinking and problem solving is actually one of the best things there is.
And with problem solving, I do not mean solving it using Google or some book.
Pure thinking. With good understanding of language basics, you're able to write a regex interpreter.
It takes a serious—do not underestimate—amount of time.

But more than being hardcore at the basics (yes, that's a thing), you dont need.
The beautiful thing is, once you get into it, you can keep going on without having to Google or read a book.
It's all in your head.

The Trap of Research

The most fun is when you havent researched regex or interpreters beforehand.
It makes you extra creative and lets your brain think freely.

Solutions from others can be inspiring... but they can also pollute your thought process.
You can get stuck in someone else's way of thinking and end up building the same thing they did.

For me, the target is not to create a regex engine that beats everyone else's.
That comes with many factors. In certain scenarios, I've even beaten the original glibc regex.
Cool? Sure. But not the point.

The goal is: write something decent and unique.
Own design. No influence from others. That's it.

Questions Worth Asking

Do you know what an AST is?
Will you use one? Or will you just interpret the regex directly?

The easiest way must be the fastest, right?
Actually... no.

I've benchmarked interpreters a lot, and performance really depends on the regexes themselves.
There's no one-size-fits-all solution.

An advanced byte-compiled one with JIT will always be slower on the first pass than a dumb interpreter that just walks character by character.
But after parsing several lines? That JIT version takes the lead.

Performance Myths

Validating strings is actually such a small task for a computer.
When it comes to performance, for most users, it doesnt matter which parser you pick.

Thats probably why everyone just uses the one bundled with their favorite programming language.

But I had a parser that could parse an entire book.
We cant say that for everyone—looking at you, glibc regex interpreter.
That one dies at around 10MB of content, if I remember correctly. Something like that.
So yeah, even things like that can be a target.

Wild Ideas

What also could be fun?
Using a parser that validates while walking a file descriptor.

By doing that, you can parse files of unlimited size—or even live network streams.
James Bond stuff. Real-time regex over TCP. Tapping into streaming data.

And now were getting close to my next hobby: protocol design.
But thats a story for another time.


I dont even expect people to read this far.


Code you should never read

At least, not until youre ready.

I'm talking about a basic regex interpreter in C and it is written in around 30 lines. I've read it in a book called Beautiful Code. The source was written by Brian Kernighan.

I'm not posting the source because it probably would destroy your creaitivty. Its easy to find if you want to.

Once youve seen it, you cant unsee it.

What he built? Thats the level I aim for.