Show HN: Sed to C translator written in sed

ketanmaheshwari · on July 12, 2020

Nice! In the why part you speculatively mention better speed. Do you have any concrete benchmark results?

If fast, the tool would be very useful for my work where I run an anonimization sed script with hundreds of transformations on millions of lines that takes hours to run.

But even if it is not fast, this is a fun project.

segfaultbuserr · on July 12, 2020

Not the author, but I see that

> no pattern/hold space overflow checks, currently both limited to 8192 bytes as per the POSIX spec requirement. Going over that limit will most likely cause a segfault.

> The C code is very rough around the edges (by that I mean dirty and unsafe, for instance allocating everything on the stack without checking any overflow), I'm still working on it, but contributions (issues/comments/pull requests) are also welcomed :)

Thus, so far it's just for fun, not suitable for profit yet.

lhoursquentin · on July 12, 2020

Yep that's exactly right :) !

To clarify my statement POSIX requires at least 8192 bytes for the pattern and hold space, and I chose to allow at most 8192 bytes here, which indeed means that this will prevent working on really long lines or storing entire files in the hold space for instance.

lhoursquentin · on July 12, 2020

No concrete benchmarks no, I just timed a few scripts I had on hand and generated a bunch of others, and I saw compiled ones performed slightly faster than GNU and toybox sed, so nothing serious.

Unfortunately, it's also pretty hard to find big POSIX sed scripts in the wild, so my speed observations are centered around my own scripts. I would definitely be interested in learning more about sed scripts taking hours to run though, if you have something that I could check out that would be awesome!

And talking about speed, I think there's also a small margin for improvement in this project, like avoiding compiling the same regex multiple times (if it appears in different places in the script), and some places that could probably benefit from using hash tables instead of static arrays (address ranges for instance). More work could be done on the translation side regarding backrefs, which are parsed on the C side for now.

jessaustin · on July 12, 2020

...an anonimization sed script with hundreds of transformations on millions of lines that takes hours to run.

Wow! Who decided this would be written in sed? Was the decision made in this century?

I actually wrote a small translation utility in sed for a client in 1999. Anderson claimed the utility couldn't be written, and my boss didn't want to support it. So, write it in sed! Then the Androids had to translate it into C++ by hand.

ketanmaheshwari · on July 12, 2020

I personally decided to write the anonymization script in sed. The decision was made in 2020, so yes, this century.

alphaBetaGamma · on July 12, 2020

Not the author, but I had a similar problem.

If you specify many replacement rules in sed, it will run in O(num_replacements * num_lines). You should be able to use a state machine to do it in O(num_lines), so I google a bit and found this amazing gem: https://unix.stackexchange.com/a/137932

tl;dr: it uses lex to create a custom C program that does "just what you need", compiles it, and runs it on your input. The whole thing wrapped in a bash function. Bonus: it's POSIX compliant.

girst · on July 12, 2020

wow--impressive!

the core is here (https://github.com/lhoursquentin/sed-bin/blob/master/par.sed), and it's 600 lines of beautifully commented sed

tgv · on July 12, 2020

So it can translate itself?

> Translate the translator (par.sed) with itself:

Yes! Congrats!

rkeene2 · on July 12, 2020

And for the other direction (C->sed): https://github.com/shinh/elvm

easytiger · on July 12, 2020

Genuinely very cool. Nice work.

sheerun · on July 12, 2020

Now one for Awk and Python please :)

mftrhu · on July 12, 2020

There's already an awk-to-C compiler [1], even if it seems to be unmaintained. Python has a couple of them, more or less constrained, including Nuitka [2].

[1] http://awka.sourceforge.net/index.html

[2] https://github.com/Nuitka/Nuitka

sheerun · on July 13, 2020

I mean awk-to-Python transcompiler, it would be useful for datascience. I wonder if it's possible to use awka to create such thing.