I've been noticing more swiss army-knife-like cli tools in the last few years. It would be cool if there were some that could support avro/parquet/orc formats. This one is notable because it's written in D lang by a mega corp.
Big fan of visidata https://visidata.org/ - one of those tools that does a lot but easily provides useful things with little complexity. For example open, select column, shift-f gives a histogram. Enter on one of those rows filters the data to just them.
Can load a variety of data formats + anything pandas can (so parquet works).
Edit - you can do a lot with it, quickly creating columns with python expressions, plots, joining tables, etc. but it's great for things like "pipe in kubectl output - get numbers of each job status, wait why are 10 failed, filter to those..." which easily works with any fixed width output.
Thanks for this. Looks perfect for my needs. I've got to analyse a hideous generated CSV file with thousands of columns. I was planning to load it into SQLite3 but it has too many columns for import. I was also unpleasantly surprised to find that the column names were too long for PostegresQL too. Python + pandas can handle it, but a tool for quick interactive exploration is just what I need. Hopefully visidata will provide!
If you still hit size issues, you might be able to throw things through xsv first (list columns, select just a few columns and pipe out, get stats on contents of columns, samples, etc). It's one of my other goto things https://github.com/BurntSushi/xsv
> gron transforms JSON into discrete assignments to make it easier to grep for what you want and see the absolute 'path' to it. It eases the exploration of APIs that return large blobs of JSON but have terrible documentation.
* vd (visidata)
* xsv (often piping into vd)
* jq (though I often forget the syntax =P)
* gron (nice for finding the json-path to use in the jq script)
* xmlstarlet (think jq for xml)
* awk (or mawk when I need even more speed)
* GNU parallel (or just xargs for simple stuff)
Miller (mlr) is now my goto tool for csv/tsv manipulation.
I've used it from time to time during the past year and it's been a joy to use.
I must admit that I did not try any other tool for the purpose so I don't know how it compares.
I know that everything looks like a nail if your only tool is a hammer, and it's fun to nail together square wheels out of plywood, but there are actually other tools out there with built-in, compiled, optimized, documented, tested, well maintained, fully compliant JSON parsers.
These tools probably do a good job at processing CSV/TSV/DSV (haven't tried them). However, I would love if we could just stop using delimiter-separated value files alltogether.
Why? Because the file format is an underspecified, unsafe mess. When you get this kind of a file you have to manually specify its schema when reading it because the file doesn't contain it. Also, due to its underspecification, there are many unsafe implementations that produce broken files that cannot be read without manual fixing in a text editor. Let's just start using safe, well-specified file formats like AVRO, Parquet or ORC.
As a data scientist, I have had lots of issues because the data I got for a project was a CSV/TSV/DSV file. I recently spat out a rant on this topic, so if you want more details, check out https://haveagooddata.net/posts/why-you-dont-want-to-use-csv...
We can't stop, because it's a de facto standard format for exchange with spreadsheet programs. So long as that's ubiquitous, we might as well write tools to make processing them easier.
Also, I'm not sure why you called CSV unsafe. It's certainly the case that it's severely under-specified, but I don't think there's anything unsafe about it.
One example of it being unsafe that happened to me: I got a CSV file written by a program with a broken implementation of a CSV writer that didn't quote string fields when there was a newline in them (in my case only the first half of a newline: carriage return). Then I read the file with a broken implementation of a CSV reader that assumed that the carriage return meant a new record and filled both parts of the broken line with N/As instead of throwing an error. This way the data in the sink didn't match the data in the source. This is the loss of data integrity, which I would call unsafe. It doesn't happen if you have a file format that serializes your data safely.
Due to the format being underspecified, many people roll their own unsafe CSV writer or CSV reader, thus every CSV file (where you don't completely control the source) is potentially broken.
Edit: Browsing your Github account I found that you implemented a CSV parser in Rust. I didn't know that when I wrote the above comment, so I was definitely not trying to imply that your particular CSV parser is unsafe.
What makes you think that if people manage to misimplement CSV parsers and generators they are not going to misimplement other formats? At least with CSV it’s always easy to implement some sort of heuristic that splits the rows correctly.
The only times when I had to deal with the issues you describe I was supplied with the data from a literally dying company. They just didn’t give a damn. Changing the file formats wouldn’t change anything - they would still find a way to mess it up.
Ah I see, yeah, from where I come from "unsafe" has a bit more weight to it. I'd call what you describe "silently incorrect." Which is also quite bad, to be fair!
> These tools probably do a good job at processing CSV/TSV/DSV (haven't tried them). However, I would love if we could just stop using delimiter-separated value files alltogether.
I hear ya. I have no doubt that "we" could, as in IT professionals. Maybe even the "surrounding" science fields that provide data could make an effort. But you're out of luck when it comes to almost any other field that serves you data, in my experience.
Any org that you can't even tell details about the CSV you want (what's the "C"? UTF8? Quoting?) will have no chance of providing you with something more complex.
It's partly our fault: The tools we provide them with suck. Excel's CSV handling is atrocious. Salesforce and similar tools seem to spit out barely consistent data dumps.
Sometimes I feel like 80% of the industry is dealing with sanitizing input.
Some useful cli data wrangling tools --
https://github.com/BurntSushi/xsv
https://github.com/dinedal/textql
https://github.com/n3mo/data-science
https://stedolan.github.io/jq/
https://gitlab.redox-os.org/redox-os/parallel
https://github.com/willghatch/racket-rash
Would you have any others you recommend?