I've been noticing more swiss army-knife-like cli tools in the last few years. I...

IanCal · on Aug 31, 2019

Big fan of visidata https://visidata.org/ - one of those tools that does a lot but easily provides useful things with little complexity. For example open, select column, shift-f gives a histogram. Enter on one of those rows filters the data to just them.

Can load a variety of data formats + anything pandas can (so parquet works).

Edit - you can do a lot with it, quickly creating columns with python expressions, plots, joining tables, etc. but it's great for things like "pipe in kubectl output - get numbers of each job status, wait why are 10 failed, filter to those..." which easily works with any fixed width output.

austinjp · on Sept 1, 2019

Thanks for this. Looks perfect for my needs. I've got to analyse a hideous generated CSV file with thousands of columns. I was planning to load it into SQLite3 but it has too many columns for import. I was also unpleasantly surprised to find that the column names were too long for PostegresQL too. Python + pandas can handle it, but a tool for quick interactive exploration is just what I need. Hopefully visidata will provide!

IanCal · on Sept 1, 2019

If you still hit size issues, you might be able to throw things through xsv first (list columns, select just a few columns and pipe out, get stats on contents of columns, samples, etc). It's one of my other goto things https://github.com/BurntSushi/xsv

DSingularity · on Aug 31, 2019

I’ve been looking for exactly this!

etatoby · on Sept 1, 2019

I recommend "gron"

https://github.com/tomnomnom/gron

> gron transforms JSON into discrete assignments to make it easier to grep for what you want and see the absolute 'path' to it. It eases the exploration of APIs that return large blobs of JSON but have terrible documentation.

    ▶ gron "https://api.github.com/repos/tomnomnom/gron/commits?per_page=1" | fgrep "commit.author"
    json[0].commit.author = {};
    json[0].commit.author.date = "2016-07-02T10:51:21Z";
    json[0].commit.author.email = "mail@tomnomnom.com";
    json[0].commit.author.name = "Tom Hudson";

mirashii · on Sept 1, 2019

https://github.com/dflemstr/rq

jq but supports avro, messagepack, protocol buffers, etc.

_pvxk · on Sept 1, 2019

My go-to tools:

    * vd (visidata)
    * xsv (often piping into vd)
    * jq (though I often forget the syntax =P)
    * gron (nice for finding the json-path to use in the jq script)
    * xmlstarlet (think jq for xml)
    * awk (or mawk when I need even more speed)
    * GNU parallel (or just xargs for simple stuff)

kasperset · on Sept 1, 2019

https://www.gnu.org/software/datamash/

mbostock · on Aug 31, 2019

I wrote one for working with newline-delimited JSON and JavaScript expressions: https://github.com/mbostock/ndjson-cli

mechatroner · on Aug 31, 2019

RBQL: https://github.com/mechatroner/RBQL - SQL dialect with Python or JS expressions. I am the author, BTW

mooss · on Sept 1, 2019

https://github.com/johnkerl/miller

Miller (mlr) is now my goto tool for csv/tsv manipulation. I've used it from time to time during the past year and it's been a joy to use. I must admit that I did not try any other tool for the purpose so I don't know how it compares.

It has a nice verb system, verbs can be combined with "then" statements : https://johnkerl.org/miller/doc/reference-verbs.html . There is also a small DSL to use with the "put" and "filter" verbs: https://johnkerl.org/miller/doc/reference-dsl.html . Another verb I find very useful is "join", which does pretty much what you would expect it to do.

sstephenson · on Aug 31, 2019

jwalk parses JSON into a stream of TSV records: https://github.com/shellbound/jwalk/

DonHopkins · on Sept 1, 2019

>parses large documents slowly, but steadily, in memory space proportional to the key depth of the document

If parsing JSON with shell scripts and awk is your idea of the most ideal way to "slowly, but steadily" get the job done.

https://github.com/shellbound/jwalk/blob/master/lib/jwalk/co...

I know that everything looks like a nail if your only tool is a hammer, and it's fun to nail together square wheels out of plywood, but there are actually other tools out there with built-in, compiled, optimized, documented, tested, well maintained, fully compliant JSON parsers.

uhhyeahdude · on Sept 1, 2019

CRUSH looks interesting, but you’ll have to poke around to find out what the tool(s) can do:

https://github.com/google/crush-tools/blob/wiki/CrushTutoria...

bradknowles · on Sept 2, 2019

Gron -- Make JSON Greppable https://github.com/TomNomNom/gron

j88439h84 · on Aug 31, 2019

If you know a bit of Python, you can use it directly on streams in your shell.

https://github.com/python-mario/mario

agumonkey · on Sept 1, 2019

if one gets to port it to java, wario will make a great cousin

trimbo · on Sept 1, 2019

Crush Tools

No idea how it compares to others but has been very handy.

https://github.com/google/crush-tools

Arkanosis · on Aug 31, 2019

http://xmlstar.sourceforge.net/

akbo · on Sept 1, 2019

These tools probably do a good job at processing CSV/TSV/DSV (haven't tried them). However, I would love if we could just stop using delimiter-separated value files alltogether.

Why? Because the file format is an underspecified, unsafe mess. When you get this kind of a file you have to manually specify its schema when reading it because the file doesn't contain it. Also, due to its underspecification, there are many unsafe implementations that produce broken files that cannot be read without manual fixing in a text editor. Let's just start using safe, well-specified file formats like AVRO, Parquet or ORC.

As a data scientist, I have had lots of issues because the data I got for a project was a CSV/TSV/DSV file. I recently spat out a rant on this topic, so if you want more details, check out https://haveagooddata.net/posts/why-you-dont-want-to-use-csv...

burntsushi · on Sept 1, 2019

This is addressed in the xsv readme: https://github.com/BurntSushi/xsv#motivation

We can't stop, because it's a de facto standard format for exchange with spreadsheet programs. So long as that's ubiquitous, we might as well write tools to make processing them easier.

Also, I'm not sure why you called CSV unsafe. It's certainly the case that it's severely under-specified, but I don't think there's anything unsafe about it.

akbo · on Sept 1, 2019

> Also, I'm not sure why you called CSV unsafe.

One example of it being unsafe that happened to me: I got a CSV file written by a program with a broken implementation of a CSV writer that didn't quote string fields when there was a newline in them (in my case only the first half of a newline: carriage return). Then I read the file with a broken implementation of a CSV reader that assumed that the carriage return meant a new record and filled both parts of the broken line with N/As instead of throwing an error. This way the data in the sink didn't match the data in the source. This is the loss of data integrity, which I would call unsafe. It doesn't happen if you have a file format that serializes your data safely.

Due to the format being underspecified, many people roll their own unsafe CSV writer or CSV reader, thus every CSV file (where you don't completely control the source) is potentially broken.

Edit: Browsing your Github account I found that you implemented a CSV parser in Rust. I didn't know that when I wrote the above comment, so I was definitely not trying to imply that your particular CSV parser is unsafe.

heavenlyblue · on Sept 2, 2019

What makes you think that if people manage to misimplement CSV parsers and generators they are not going to misimplement other formats? At least with CSV it’s always easy to implement some sort of heuristic that splits the rows correctly.

The only times when I had to deal with the issues you describe I was supplied with the data from a literally dying company. They just didn’t give a damn. Changing the file formats wouldn’t change anything - they would still find a way to mess it up.

burntsushi · on Sept 3, 2019

Ah I see, yeah, from where I come from "unsafe" has a bit more weight to it. I'd call what you describe "silently incorrect." Which is also quite bad, to be fair!

mhd · on Sept 1, 2019

> These tools probably do a good job at processing CSV/TSV/DSV (haven't tried them). However, I would love if we could just stop using delimiter-separated value files alltogether.

I hear ya. I have no doubt that "we" could, as in IT professionals. Maybe even the "surrounding" science fields that provide data could make an effort. But you're out of luck when it comes to almost any other field that serves you data, in my experience.

Any org that you can't even tell details about the CSV you want (what's the "C"? UTF8? Quoting?) will have no chance of providing you with something more complex. It's partly our fault: The tools we provide them with suck. Excel's CSV handling is atrocious. Salesforce and similar tools seem to spit out barely consistent data dumps.

Sometimes I feel like 80% of the industry is dealing with sanitizing input.

cube2222 · on Sept 1, 2019

Another one is the recent https://github.com/cube2222/octosql/ with a sql-based approach (I'm one of the authors)

petre · on Sept 2, 2019

Miller

https://github.com/johnkerl/miller

I'd normally pass but this is written in D so I'm looking forward to try it out.

vips7L · on Sept 1, 2019

PowerShell.