Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

JSON is not a friendly format to the Unix shell — it’s hierarchical, and cannot be reasonably split on any character

Yes, shell is definitely too weak to parse JSON!

(One reason I started https://oils.pub is because I saw that bash completion scripts try to parse bash in bash, which is an even worse idea than trying to parse JSON in bash)

I'd argue that Awk is ALSO too weak to parse JSON

The following code assumes that it will be fed valid JSON. It has some basic validation as a function of the parsing and will most likely throw an error if it encounters something strange, but there are no guarantees beyond that.

Yeah I don't like that! If you don't reject invalid input, you're not really parsing

---

OSH and YSH both have JSON built-in, and they have the hierarchical/recursive data structures you need for the common Python/JS-like API:

    osh-0.33$ var d = { date: $(date --iso-8601) }

    osh-0.33$ json write (d) | tee tmp.txt
    {
      "date": "2025-06-28"
    }
Parse, then pretty print the data structure you got:

    $ cat tmp.txt | json read (&x)

    osh-0.33$ = x
    (Dict)  {date: '2025-06-28'}
Create a JSON syntax error on purpose:

    osh-0.33$ sed 's/"/bad/"' tmp.txt | json read (&x)
    sed: -e expression #1, char 9: unknown option to `s'
      sed 's/"/bad/"' tmp.txt | json read (&x)
                                     ^~~~
    [ interactive ]:20: json read: Unexpected EOF while parsing JSON (line 1, offset 0-0: '')
(now I see the error message could be better)

Another example from wezm yesterday: https://mastodon.decentralised.social/@wezm/1147586026608361...

YSH has JSON natively, but for anyone interested, it would be fun to test out the language by writing a JSON parser in YSH

It's fundamentally more powerful than shell and awk because it has garbage-collected data structures - https://www.oilshell.org/blog/2024/09/gc.html

Also, OSH is now FASTER than bash, in both computation and I/O. This is despite garbage collection, and despite being written in typed Python! I hope to publish a post about these recent improvements



> Yes, shell is definitely too weak to parse JSON!

Parsing is a trivial, rejecting invalid input is trivial, the problem is representing the parsed content in a meaningful way.

> bash completion scripts try to parse bash in bash

You're talking about ble.sh, right? I investigated it as well.

I think they made some choices that eventually led to the parser being too complex, largely due to the problem of representing what was parsed.

> Also, OSH is now FASTER than bash, in both computation and I/O.

According to my tests, this is true. Congratulations!


> I think they made some choices that eventually led to the parser being too complex, largely due to the problem of representing what was parsed.

No, the complexity of the parser can be attributed to the incremental parsing. ble.sh implements an incremental parser where one can update only the necessary parts of the previous syntax tree when a part of the command line is modified. I'd probably use the same data structure (but better abstracted using classes) even if I could implement the parser in C or in higher-level languages.


That makes sense, thanks for clarifying it!


I was referring to the bash-completion project, the default on Debian/Ubuntu - https://github.com/scop/bash-completion/

But yes, ble.sh also has a shell parser in shell, although it uses a state machine style that's more principled than bash regex / sed crap.

---

Also, distro build systems like Alpine Linux and others tend to parse shell in shell (or with sed).

They often need package metadata without executing package builds, so they do that by trying to parse shell.

In YSH, you will be able to do that with reflection, basically like Lisp/Python/Ruby, rather than ad hoc parsing.

---

I'm glad to hear you can see the effect of the optimizations ! That took a long time :-)

Some more benchmarks here, which I'll write about: https://oils.pub/release/0.33.0/benchmarks.wwz/osh-runtime/


> uses a state machine style

That's the way to go. I don't even consider other shallow and ad-hoc approaches as actually parsing it.

I've been working on a state-machine based parser of my own. It's hard, I'm targetting very barebones interpreters such as posh and dash. Here's what it looks like

https://gist.github.com/alganet/23df53c567b8a0bf959ecbc7b689...

(not fully working example, but it gives an idea of what pure POSIX shell parsing looks like, ignore the aliases, they'll not be in the final version).

> I'm glad to hear you can see the effect of the optimizations ! That took a long time :-)

Yep, been testing osh since 0.9! Still a long way to go to catch up with ksh93 though, it's the fastest of all shells (even dash) by a wide margin.

By beating bash, you also have beaten zsh (it's one of the slowest shells around).


Yes thanks for testing it. I hope we can get to "awk/Python speed" eventually, but I'm happy to have exceeded "bash and zsh speed"!

And I did notice that zsh can be incredibly slow -- just the parser itself is many times slower than other shells

A few years ago a zsh dev came on Zulip and wished us luck, probably because they know the zsh codebase has a bunch of technical debt

i.e. these codebases are so old that the maintainers only have so much knowledge/time to improve things ... Every once in awhile I think you have to start from scratch :)


You may well already be aware, but just in case you aren't, your bin-true benchmark mostly measures dynamic loader overhead, not fork-exec (e.g., I got 5.2X faster using a musl-gcc statically linked true vs. glibc dynamic coreutils). { Kind of a distro/cultural thing what you want to measure (static linking is common on Alpine Linux, BSDs, less so on most Linux), but good to know about the effect. }


Yup, I added an osh-static column there, because I know dynamic linking slows things down. (With the latest release, we have a documented build script to make osh-static, which I tested with GNU libc and musl: https://oils.pub/release/latest/doc/help-mirror.html)

Although I think the CALLING process (the shell) being dynamically linked affects the speed too, not just the CALLED process (/bin/true)

I'd like to read an analysis of why that is! And some deeper measurements


The calling process being dynamically linked might impact fork() a lot to copy the various page table setups and then a tiny bit more in exec*() to tear them down. Not sure something like a shell has vfork() available as an option, but I saw major speed-ups for Python launching using vfork vs. fork. Of course, a typical Python instance has many more .so's linked in than osh probably has.

One could probably set up a simple linear regression to get a good estimate of added cost-per-loaded .so on various OS-CPU combos, but I am unaware of a write up of such. It'd be a good assignment for an OS class, though.


I don't really buy that shell / awk is "too weak" to deal with JSON, the ecosystem of tools is just fairly immature as most of the shells common tools predate JSON by at least a decade. `jq` being a pretty reasonable addition to the standard set of tools included in environments by default.

IMO the real problem is that JSON doesn't work very well at as a because it's core abstraction is objects. It's a pain to deal with in pretty much every statically typed non-object oriented language unless you parse it into native, predefined data structures (think annotated Go structs, Rust, etc.).


I'd say that awk really is too weak. Awk has a grand total of 2 data types: strings, and associative arrays mapping strings to strings. There is no support for arbitrarily nested data structures. You can simulate them with arrays if you really want to, or you could shell out to jq, but it's definitely swimming upstream.

Most languages aren't quite that bad. Even if they can't handle JSON very ergonomically, almost every language has at least some concept of nesting objects inside other objects.

What about shell? Just like awk, bash and zsh have a limited number of data types (the same two as awk plus non-associative arrays). So arguably it has the same problem. On the other hand, as you say, in shell it's perfectly idiomatic to use external tools, and jq is one such tool, available on an increasing number of systems. So you may as well store JSON data in your string variables and use jq to access it as needed. Probably won't be any slower than the calls to sed or awk or cut that fill out most shell scripts.

Now, personally, I've gotten into the habit of writing shell scripts with minimal use of external tools. If you stick to shell builtins, your script will run much faster. And both bash and zsh have a pretty decent suite of string manipulation tools, including some regex support, so you often don't actually need sed or awk or cut. However, this also rules out jq, and neither shell has any remotely comparable builtin.

But you might reasonably object that if I care about speed, I would be better off using a real programming language!


The same author already had made the more thorough jawk. They explicitly said they wanted a cut down version. It's not illegal to want a cut down version of something.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: