Nice! In the why part you speculatively mention better speed. Do you have any concrete benchmark results?
If fast, the tool would be very useful for my work where I run an anonimization sed script with hundreds of transformations on millions of lines that takes hours to run.
But even if it is not fast, this is a fun project.
> no pattern/hold space overflow checks, currently both limited to 8192 bytes as per the POSIX spec requirement. Going over that limit will most likely cause a segfault.
> The C code is very rough around the edges (by that I mean dirty and unsafe, for instance allocating everything on the stack without checking any overflow), I'm still working on it, but contributions (issues/comments/pull requests) are also welcomed :)
Thus, so far it's just for fun, not suitable for profit yet.
To clarify my statement POSIX requires at least 8192 bytes for the pattern and hold space, and I chose to allow at most 8192 bytes here, which indeed means that this will prevent working on really long lines or storing entire files in the hold space for instance.
No concrete benchmarks no, I just timed a few scripts I had on hand and generated a bunch of others, and I saw compiled ones performed slightly faster than GNU and toybox sed, so nothing serious.
Unfortunately, it's also pretty hard to find big POSIX sed scripts in the wild, so my speed observations are centered around my own scripts. I would definitely be interested in learning more about sed scripts taking hours to run though, if you have something that I could check out that would be awesome!
And talking about speed, I think there's also a small margin for improvement in this project, like avoiding compiling the same regex multiple times (if it appears in different places in the script), and some places that could probably benefit from using hash tables instead of static arrays (address ranges for instance). More work could be done on the translation side regarding backrefs, which are parsed on the C side for now.
...an anonimization sed script with hundreds of transformations on millions of lines that takes hours to run.
Wow! Who decided this would be written in sed? Was the decision made in this century?
I actually wrote a small translation utility in sed for a client in 1999. Anderson claimed the utility couldn't be written, and my boss didn't want to support it. So, write it in sed! Then the Androids had to translate it into C++ by hand.
If you specify many replacement rules in sed, it will run in O(num_replacements * num_lines). You should be able to use a state machine to do it in O(num_lines), so I google a bit and found this amazing gem: https://unix.stackexchange.com/a/137932
tl;dr: it uses lex to create a custom C program that does "just what you need", compiles it, and runs it on your input. The whole thing wrapped in a bash function. Bonus: it's POSIX compliant.
There's already an awk-to-C compiler [1], even if it seems to be unmaintained. Python has a couple of them, more or less constrained, including Nuitka [2].
If fast, the tool would be very useful for my work where I run an anonimization sed script with hundreds of transformations on millions of lines that takes hours to run.
But even if it is not fast, this is a fun project.