Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Standard awk warning: it's tempting to try to use awk on csv files. You'll even get good results on simple csv files that leave you encouraged to go further. Don't.

Csv is not standardized and the quoting rules are weird (and not standardized).

If you can live with a certain amount of loss of fidelity in your output, you can get away with using awk. If you want a coarse prototype, use awk.

If you need robust, production-grade handling of csv files, use (or write) something else.

Csv files are a little bit like like dates: superficially simple, with lots of corner cases. Largely for the same reason: lack of standardization.

That said, awk is awesome. It's small enough to fit in your brain, unlike Perl (maybe yours is larger than mine?). It's also pretty universally available, with few massive incompatibilities between versions, unlike shell (provided you avoid the gawk-specific features). I love it.



I have used awk a lot on csv files. I basically agree with your comments, for me the use case is preliminary exploration, not really much more. There is also the issue of one liners not really ending themselves to good coding practice generally, but it sure is great to dive into a file quickly.

What I came here to say is that when I have a csv file I'm looking at with awk, thos first thing I do is a `wc -l`, then get NF (number of fields) for the first row of the csv (here e.g. 80), followed by `awk -F, 'NF==80'` | wc -l`. If the numbers don't match, then I know it's not parsing properly.

The most common issue of course is commas in quotes strings. I have a small script that removes these, so I can still use awk easily. Anything more complex like newlines in quotes strings and maybe it's time to question if awk is still worth it.


You can parse it with awk though.

https://github.com/Nomarian/Awk-Batteries/blob/master/Units/...

gawk -f $AWK/Units/csv.awk -e '{print NF}'


    $ ruby -rcsv -ne 'puts $_.parse_csv[1]' education.csv
    Enrolment in primary, secondary and tertiary education levels
    Total, all countries or areas
AWK is a nice little language. Perl is all of it and shell, sed, grep. I am not proficient in Perl but I know some Ruby

    $ ruby -rEnglish -ne 'puts $_ if $NR <= 5' education.csv 
    $ ruby -rEnglish -aF, -ne 'puts $F[1] if $NR <= 5' education.csv 
    $ ruby -rEnglish -ne 'END { puts $NR }' education.csv
    $ ruby -rEnglish -ne 'puts $_ if $NR % 500 == 0' education.csv
Module English provides AWK names

    $ perl -MEnglish -ne 'print if $NR < 5' education.csv
https://ruby-doc.org/stdlib-2.3.0/libdoc/English/rdoc/Englis...


Ah Perl, the Swiss Army Chainsaw.

My first big kid job was taking over ownership of a Perl-based Oracle-backed data warehouse. Mostly it was SQL queries wrapped up with a Perl script executed by cron that output excel workbooks or csvs and emailed or dumped to a file server.

Most of the pivots and reporting tables were actually generated in Perl because it was just nicer to work with than Excel.

It was wonderful, I learned so much, mostly how to love Perl and CPAN.

We merged our telco billing system with our new parent company with some Perl, cron, a couple SQL queries and an FTP server.

I have said it for years and I will continue to repeat. If you could snap your fingers and delete all the Perl code in the world your lights would turn off.


There is https://tools.ietf.org/html/rfc4180

Most CSV files do not follow this standard of course. But you could normalize all CSV files to RFC4180 (or any other consistent format) as the first step of your processing pipeline.


The issue with encountering CSV in the wild is that everybody who appreciates standards and interoperability ditched it a long time ago. If you are consuming CSV files in the wild, you can be sure that whoever is supplying them to you is using horrible tools to create them and will be unwilling or unable to address issues you find in them.


> The issue with encountering CSV in the wild is that everybody who appreciates standards and interoperability ditched it a long time ago.

I worked on a team that used CSV somewhat extensively. For the data we generated, it was RFC complaint. It's pretty trivial to get RFC-compliant CSVs, too; most languages have a library — ours was in the standard library, too.

We also had a ("terrible", as we joked) idea to create a subset of CSV that would contain typing information in a required header row. (We never did it, and it is a bad idea.)

> If you are consuming CSV files in the wild, you can be sure that whoever is supplying them to you is using horrible tools to create them and will be unwilling or unable to address issues you find in them.

…but this is absolutely true. We also consumed CSVs from external sources and contractors, and this was an absolute drain on our productivity. I've also worked with engineers of this caliber, and changing CSV wouldn't change the terrible output. I've seen folks approach eMail, HTTP with a cavalier "oh, it's a trivial text format, I don't need a library!" attitude, and inevitably get it wrong. Pointing out the flaws in their implementation and that a library would fulfill their use-case just fine is just met with more hacks (not fixes) to try to further munge the output into shape. It is decidedly not software engineering. I've seen this even with JSON.

But yeah, even with RFC standard CSV, you shouldn't be parsing it with awk. It is the wrong tool.


I generally went for turning them into either tab-separated files or used the ASCII codes for record separator and its brethren depending on the job. I never wanted to touch CSV again after parsing it once.


But if you’re writing code to normalize before sending to awk, why not just process in the normalization program instead of using awk’s bizarre syntax?


Depends on what you're using to normalize I suppose. Maybe that's Awk too! Maybe Awk is easier for exploration when the data is already clean but you have to write some parsing layer in Java or something and that's not conducive to one-liner exploration.


> Csv is not standardized and the quoting rules are weird (and not standardized)

There is almost always some normalisation ("scrubbing") needed to prep a CSV. But CSV is viable and awk can rip through massive amounts of data. It is a brilliant and powerful tool.

> It's small enough to fit in your brain, unlike Perl

Perl is also brilliant and powerful. Setting aside the bigotry of people who dislike sigils and using braces for scope, many people who fail to learn Perl well have not tried to use Perl-OOP objects as primitives. Once you do this, Perl's versatility and speed are hard to beat.


I suffered with Perl until I stopped assuming I could learn it by reading code, and read the Camel book. At about page 85 it says something approximating: "Everything in Perl happens in a context (scalar or Array) and until you realize that, you will be miserable." It still didn't come easy: I've realized we all have different minds. My mind doesn't find Perl natural, but loves Ruby. YMMV.


Tangent, but another interesting problem with date parsing besides lack of standardization is that it's ridiculously context-sensitive.

For example, if I schedule an event for Oct 1-Jan 1 right now, the most likely (least surprising) parse would be Oct 1, 2021 through January 1, 2022. Which is both surprising because parsing Oct 1 depends on the current date, and because the parsing of Jan 1 depends on the parse of Oct 1, and because both of these depend on the fact that you're scheduling (in all likelihood) a future event, rather than describing a past event. So whatever date parser you use (no matter how good it is at handling different notations) will return garbage if it doesn't take all these into account.


Heh, interesting, I actually assumed it would be oct 2020 to Jan 2021, and that you were describing an event already in progress.


Sure :-) even then, you have even more fun... imagine if the next item was Sep 1-Dec 1 and the list was known to be sorted! The parser would need to be stateful to be able to disambiguate!!

Even worse (well, maybe similar, but more surprising): imagine if on Feb 28, you parse "Feb 29-Mar 1"... both of those dates could land in an entirely different year than "Feb 28-Mar 1" would, depending on whether the current year is a leap year...

I dare say I have not yet seen a single parser in my life that handles such issues. In fact I don't think I've seen a parser that can parse a date interval, or that can even do "parse this string assuming it is after that date", or anything like that.

And all of these problems are before we even consider time zones, leap seconds, daylight savings, syntactic ambiguities, etc... not just how they affect individual dates, but also ordered dates (/intervals) like above...


I would say that you are mixing up "parsing" and "calendaring (or something of the sort)". As far as I understand parsing is syntactic analysis, i.e. going from a linear structure to a more complex structure (usually a tree); it should not add to the tree anything that was not in the linear structure. It shouldn't consider a semantic context (such as the current date) to produce an ast.


By parsing I mean it in the usual sense for a date... strptime, Date.Parse, etc... i.e. turning a string to a date (or multiple dates). You can call it something else if you'd like.


> I dare say I have not yet seen a single parser in my life that handles such issues. In fact I don't think I've seen a parser that can parse a date interval, or that can even do "parse this string assuming it is after that date", or anything like that.

You can google "nlp date extraction" maybe? I've used libraries in the past that do this.

Note: they're far from being perfect. I ended up not using any, as each had their weird corner cases.

Here's an old one: http://natty.joestelmach.com/try.jsp#

I tried with: "first week of december to end of january"

And it gave me: Tue Dec 01 15:31:30 UTC 2020 Fri Jan 31 15:31:30 UTC 2020

Edit: this one seems more polished: http://nlp.stanford.edu:8080/sutime/process


Cool, thanks for the tip!


Parsing anything that was originally developed by humans writing it on paper (or clay tablet, or whatever) is a nightmare. Natural means chaotic.

If your CSV file contains any field entered by humans AWK isn't going to be powerful enough to parse it at scale. Someone somewhere is going to have the name 'Mbat"a, Sho,dlo' in some bizarre ass romanization (and this assume you're not accepting Unicode, which is a whole other can of worms that AWK is not prepared to deal with) that breaks your parser.


I'm saving that as a test case name but making some small adjustments.

'Mbaät\"a, Sho,dló'

"a" followed by "ä" because some suggest encoding umlauts as double characters. When you decode that does it go first or second?

Answer: use Unicode.

Throw an escape character in there, before the quote character just to make it interesting.

This is a good time for everyone to review: https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-...


Yeah, here's what a robust solution for parsing csv with awk looks like: https://stackoverflow.com/questions/45420535/whats-the-most-...


I was looking for it so thanks a lot!


I would love to have a command-line tool that reads CSV and has a ton of features to cover different quirks and errors, which can output cleaner formats that I can pipe into other command-line tools.

csvkit [0] might be that tool; I discovered it after my last painful encounter with CSV files and haven't used it in anger yet. Among other things, it translates CSV to JSON, so you can compose it with jq.

[0] https://csvkit.readthedocs.io/en/latest/index.html


I love these TSV utilities: https://github.com/eBay/tsv-utils Granted they're for "T"sv files, not "C"sv, but there's a handy `csv2tsv` utility included.

There's also `xsv`, from the author of `ripgrep`: https://github.com/BurntSushi/xsv



Seconding the recommendation for xsv; I've used it extensively and it works great.


Use miller and never look back.

https://miller.readthedocs.io/en/latest/10min.html

It so much faster than csvkit


At my last employer, I built a filter program, creatively called CSVTools[0], to do something like this. One piece of the project parses CSVs and replaces the commas/newlines (in an escaping- and multiline-aware manner, of course) with ASCII record/unit separator characters[1] (0x1E and 0x1F); the other piece converts that format back into well-formed CSV files. I usually used this with GNU awk, and reconfigured RS[2] and FS[3] appropriately. Or you can just set the input separators (IRS/IFS) and produce plaintext output from AWK.

[0]: https://bitbucket.org/rbr/csvtools

[1]: https://en.wikipedia.org/wiki/Delimiter#ASCII_delimited_text

[2]: https://www.gnu.org/software/gawk/manual/html_node/awk-split...

[3]: https://www.gnu.org/software/gawk/manual/html_node/Field-Sep...


Good idea! Looks similar to something I wrote called csvquote https://github.com/dbro/csvquote , which enables awk and other command line text tools to work with CSV data that contains embedded commas and newlines.


csvtool is also nice.[0][1] csvkit is very flexible and can certainly be used in anger, but is a bit finicky; you almost always want to use the -I (--no-inference) option. Additionally, I wrote a tiny Perl script for quick awk-like oneliners.[2]

[0] https://github.com/Chris00/ocaml-csv [1] https://colin.maudry.com/csvtool-manual-page/ [2] https://github.com/gpvos/csved/blob/master/csved


"q" is the tool you're looking for http://harelba.github.io/q/ . Impossible to Google for, indispensable for CSV manipulation


> As of version 2.0.9, there's no need for any external dependency. Python itself (3.7), and any needed libraries are self-contained inside the installation, isolated from the rest of your system.

Oh, sh*t. I will look for something else.


Is your concern that you don't want to use a library that depends on python?


Nope. I don't want to use the tool which depends on an unsupported copy of Python.


I parse a lot of CSV files and few are well-formed.

People are careless when writing CSV files because they think it is simple: just put a comma between columns, right?


I just fixed a bug at work, parsing a rather opinionated csv file of products for a web shop. It had mostly good quoting, headers for columns - and uses semicolon for field separation (so, not technically csv, but..).

Funny thing was, a lot of the product names contained an ampersand (no problem there). But one product had an html entity encoded ampersand (&amp;). I have no idea how that semicolon escaped, eh, escaping - but that one line suddenly had most of the columns off by one...

I can see how the entity got into the db (probably errant cutnpaste) - but I wonder at the csv writer that gleefully copied the extra separator to the csv export...


Everything you just described seems like reasons to not use CSV files in general, rather than not use AWK on CSV.


I'm disappointed we didn't use ASCII-separated values.

https://news.ycombinator.com/item?id=7474600


And apparently an even better solution is just putting the size in the format itself. No need to parse when you can just seek.

https://blog.kowalczyk.info/article/fc9203f7c72a4532b1ae51d0...


Did it multiple times and it works great, but companies you send raw data to often get huffy. So, back it goes into CSV or some poorly defined fixed width text file.


I don't see what issue they solve. You still need escaping, because eventually you will want to put an ascii-separated value into your spreadsheet.


CSV is the default interchange in many fields (e.g. a lot of machine learning or academia). You don't get to choose what formats your colleagues send to you.


Segue to something that I've been fighting with for years: Do you have a suggestion for working with CSVs in the CLI?

Something that will show a table, with a single pixel border between cells, that allows searching for a value, copying a cell?


It depends what I'm doing, but I typically use one of the following (starting with the most frequent):

1. IPython with Pandas if I expect to do any in-depth exploration/manipulation of the data.

2. A short Python script like `python -c 'import csv, sys; r = csv.reader(sys.stdin); ...' <data.csv` if it needs to run on someone else's machine.

3. https://github.com/BurntSushi/xsv if I want to quickly munge some data or extract a field.

4. Gnumeric if I want a GUI.

5. https://github.com/andmarti1424/sc-im if I want a TUI. This is closest to what you were asking for.


Just a heads-up, if you're interested. Only SC-IM was fit for the non-database-format files I need to work with. But it does not have a search function! So I'm writing an app in Python for browsing spreadsheets: https://github.com/dotancohen/osheet


  > 5. https://github.com/andmarti1424/sc-im if I want
  > a TUI. This is closest to what you were asking for.
Thank you! Yes, this seems to be almost perfect. I cannot believe that there is no "arbitrary string search" feature, but I can grep in another terminal window at least.

Thank you.


Sqlite3 can import and export CSV via the CLI application. Both Ruby and Python provide CSV parsers in their standard library.


Thank you. I've used sqlite, but being able to query the data is not the same as being able to browse that data.

For instance, consider a project that I'm currently working on. Google Docs sheet exported to CSV. The file documents dozens of servers used as load balancers, websites, and database servers. The files has gone from MS Excel to Google Docs, and been maintained by four admins over 12 years. There are no columns per se, rather "sections of rows", each of which isn't even consistent in itself. One section is double columned, with the top row listing public IP address, login name, etc and the second row listing LAN IP address, login password, etc.

sqlite is useless on these files and frankly I've seen these things more often than well-formatted spreadsheets at every company I've ever worked with. Nobody but software developers treat spreadsheet columns as database fields.


I know, there is no easy fix for that.


Just a heads-up, if you're interested. I'm writing an app in Python for browsing spreadsheets:

https://github.com/dotancohen/osheet


Doesn't it convert everything to strings including numbers? Or can you specify types?


You have to provide a schema anyway, so you can treat specific columns as integers.


sqlite won't fail on bad types but you can do maths (e.g. sum, etc) on numeric columns,


I pretty much exclusively rely on:

  csvlook -I | less -S
  or
  csvformat -T | less -S
(-I to avoid mixed data-types being displayed weird, but I don't consistently use that)

from csvkit - https://csvkit.readthedocs.io/en/1.0.2/scripts/csvlook.html

Tons of useful CSV tools in there - csvcut, csvjoin, csvgrep.

  csvformat -T 
Makes most CSV much more manageable for a shell pipeline, as tabs are easy to deal with and unlikely to be in the actual data


Just a reply to myself, for anybody else looking for an application to browse spreadsheets on the CLI.

There are a few CLI applications for working with data stored in a CSV file, so long as the data is formed as a database: each row as a data entry, each column as a database field. The single CLI app that run on Linux for browsing a spreadsheet that is not formed as a database, SC-IM, does not support searching for arbitrary text!

Therefore I am writing osheet: A Linux CLI app for browsing a spreadsheet. It currently supports only XLS files, because that is what I need. It currently crashes a lot. It currently has no features other than searching. It won't even scroll yet to display all the rows and columns. I just banged it out last night when I should have been sleeping. But I'm working on it, you are invited to test it, report bugs, and pull requests are of course welcome. It's written in Python and Curses.

https://github.com/dotancohen/osheet


If I'm deep enough into a csv file that I'm worried about everything I noted above, I usually just load the data into Postgres and trust that their devs got it right. Pretty presentation is free at that point :-)


Postgress or even sqlite are great if the CSV is formatted like a database, with each column a distinct field. Alas, most spreadsheets created by non-software-devs are not so strict.

I need to browse the cells in a something like curses.


There is https://github.com/BurntSushi/xsv but it doesn't seem to print with borders between cells.


That's good for data formatted like a database. I need something to just browse the cells, so I'm writing an app in Python for browsing spreadsheets:

https://github.com/dotancohen/osheet



That's good for data formatted like a database. I need something to just browse the cells, so I'm writing an app in Python for browsing spreadsheets:

https://github.com/dotancohen/osheet


The best description of dates and CSV files I've heard by far. Bravo.

"Csv files are a little bit like like dates: superficially simple, with lots of corner cases. Largely for the same reason: lack of standardization."


I just attempted to run some government data through Python using the csv module. It worked like a charm, until it told me that a field had exceeded the maximum allowable length. A close look showed that a field began but did not end with a quotation mark. Using the csv.QUOTE_NONE flag resolved that, but did not remedy such quirks as short or long records.

CSV files, depending on who generates them, are a bit like dates if the status of a year as leap or not depended on whether the date of Easter were 0 mod 4.


My gripe with the Python csv.reader module is that it has no skip_lines parameter. Ill-formatted lines at the top/bottom of a CSV file are quite common


What format of data file is best for working with Awk? I assume some kind of tab delimited file? With a reliable ingestion layer you can convert all your files CSV-ish files to something ergonomic for Awk scripting.

JSON data files seem to be a similar deal. Sometimes they are actually properly formed JSON arrays and sometimes they are individual line-delimited objects.

Data files are a mess, is there a command line tool that is well suited to taking in many inconsistent formats and outputting something ergonomic?


CSV standards are whatever the application implemented. We'd pull a "CSV" out of one system, then have to massage it before uploading to system A and then make other changes so it would be recognized by system B.

I'm torn. On one hand, it's easy. It lets people work with the data, albeit using error-prone tools like NotePad++ and Excel.

On the other hand, it's just flat out lazy to use.


It depends what you are doing. in my experience when you write a script to do something with a CSV it's been created by a specific piece of software and you only feed that csv to the script. I doubt anyone is writing anything generic in ask for parsing a multitude of different csvs


CSV to TSV via robust program, use AWK, and convert back if needed. Then you only have to remember the \\t\n\r escapes, not the whole quoting shenanigans.


Even better just use TSV. Tabs are better separator for real world data, you can strip tabs unless you need to store verbatim string data.


q is a great tool for CSV manipulation http://harelba.github.io/q/ . Awk is not




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: