Stop this horrible genre now! As discussed in the recent video programming post ...

donatj · on Dec 27, 2016

I mostly just thought it was fun to write. Put my pain down on a page, get a little bit of catharsis. I actually had written out some "counterexamples or suggestions" but decided to keep it terse and with the general style of the other similar posts.

I was not aware of the scorn for the style, I just knew I enjoy reading them.

As opposed to your μTSV may I suggest DSV. The Art of Unix Programming makes a strong case for how it's superior to CSV [1]

1. http://www.catb.org/esr/writings/taoup/html/ch05s02.html#id2...

vram22 · on Dec 27, 2016

Heh, by coincidence, I had recently written two posts about DSV, and had mentioned the same TAOUP DSV link you gave above:

Processing DSV data (Delimiter-Separated Values) with Python:

https://jugad2.blogspot.in/2016/11/processing-dsv-data-delim...

The example program in the post lets you use either a "-c delimiter_char" or a "-n delimiter_ASCII_code" to specify the delimiter for your DSV data being processed by it.

and

[xtopdf] Publish DSV data (Delimiter-Separated Values) to PDF:

https://jugad2.blogspot.in/2016/12/xtopdf-publish-dsv-data-d...

groovy2shoes · on Dec 27, 2016

You're both wrong. EDN is the best.

https://github.com/edn-format/edn

vram22 · on Dec 27, 2016

>You're both wrong.

Bah. How am I wrong when I haven't even claimed DSV to be the best? Just quoting that section in ESR's book does not imply I endorse or agree with it. Logic ...

vram22 · on Dec 27, 2016

Plus, just providing that EDN link is hardly proof that it is the best ... And further, "best" in many cases is subjective. You want to claim EDN is the best, you lay out an objective proof.

zdkl · on Dec 27, 2016

I get that the parent response about EDN isn't particularily enticing but you really shoild give it a look.

vram22 · on Dec 27, 2016

I do intend to. Thank you for the comment though. I had starred EDN on github just now before replying to groovy2shoes, i.e., I took an initial brief look at it. Saw a commit by richhickey (Clojure creator). Since I'm into data / file formats, I will definitely check out EDN.

groovy2shoes · on Dec 28, 2016

Sorry, my intent (beside bringing up EDN) was to be playful. Tone is often difficult to communicate in writing. I would have stuck in a `:p` if I hadn't been on mobile, but it slipped my mind. My apologies; I did not mean to insult or to otherwise be combative or critical.

I don't have objective proof that it's the best. It's a mix of personal preference and empirical "evidence" based on past success with both EDN and its granddaddy s-expressions for data exchange and serialization.

As for why I personally like EDN so much:

1. It offers the malleability, flexibility, and extensibility of XML, while being

2. even more concise than JSON,

3. straightforward to parse, and

4. precisely defined in a rigorous and language-independent specification.

Traits 1-3 it shares with the traditional s-expressions of the Lisp family, but in contrast to s-expressions it's specifically designed as a data exchange format rather than as a lexical/syntactic format for programming languages. The reason this matters is because the traditional Lisp reader is itself extensible, and extensions can't be guaranteed to carry over from dialect to dialect or even from implementation to implementation. Many Lisp readers go so far as to even allow the execution of arbitrary code at read time, which is desirable for source code as it enables metaprogramming, but it's not so desirable when you're parsing data from an arbitrary source, due to security concerns.

While EDN does have its roots in Clojure's reader, EDN is not exactly Clojure's external representation. Rather, it's a subset of Clojure's surface syntax in much the same way that JSON is a subset of JavaScript's. Like JSON, EDN works great outside of its "native language" (in fact, I've never used EDN from Clojure itself; I've only used it from Lua, Scheme, Python, C, C++, C#, Java, and JavaScript (not necessarily in that order)).

vram22 · on Dec 28, 2016

>Sorry, my intent (beside bringing up EDN) was to be playful. Tone is often difficult to communicate in writing.

I understand, and agree it can be difficult. Thanks for clearing that up, and no worries at all. Interesting info, BTW, about EDN. Will reply again in more detail later today with any comments/questions.

vram22 · on Dec 29, 2016

I'll need some time to check out EDN, so will message you with any comments later, since this thread might expire. Thanks again for letting me know about it.

zdkl · on Dec 30, 2016

I'm also interested to hear your perspective

vram22 · on Dec 30, 2016

Cool. Will try to comment before the thread expires, then (not sure what the period is), else will post the points here as a new thread under my id (vram22).

vram22 · on Dec 29, 2016

Which Python library did you use for EDN (if any)? I had quickly googled for that, and saw a few, including one by Swaroop C H. Got any preferences? And same question for C.

vram22 · on Dec 31, 2016

Replying to myself as a way to quickly locate this lib in context:

https://github.com/swaroopch/edn_format

groovy2shoes · on Dec 31, 2016

Sorry, I just saw this question of yours!

I was an early adopter of EDN, and wound up rolling my own parser in Python. EDN is simple enough that you can realistically roll a parser for it, complete with tests, in one or two working days. I find this to be a huge advantage—EDN doesn't really have any corner cases, so implementation is very straightforward.

vram22 · on Jan 1, 2017

No problem, and thanks for the answer.

vram22 · on Jan 1, 2017

I tried out https://github.com/swaroopch/edn_format some (in Python). Will read up more on EDN, try things a bit more, and then reply here with any comments or impressions.

joshuata · on Dec 27, 2016

Many of these problems could be avoided if we used the character codes for unit (31) and record (30) separator. This would avoid any problems with tabs or line feeds, as well as better matching the intended semantics of ASCII/UTF-8

kedean · on Dec 27, 2016

That's suggested often, and the obvious rebuttal is that CSV is intended to be easily edited and relatively human-friendly. The unit/record characters are not on most(all?) keyboards, so editing the file in a text editor becomes pretty much impossible.

stouset · on Dec 27, 2016

As a counterpoint, nobody seems to be editing CSV files by hand any more. It appears at this point to be an interchange format between different spreadsheets, row-oriented databases, etc.

kedean · on Dec 27, 2016

I promise they're still used by hand all the time, usually as a form of supplying data to a program in cases where a database is just too much. CSV is still a very common base for data-driven unit tests.

Tloewald · on Dec 27, 2016

If you're going to replace CSV then why not use the ASCII record and field delimiter characters and avoid using characters like tab and new line which are in band and actually useful?

And if you're using tabs as field delimiters then why do you need quotation marks in the mix?

amyjess · on Dec 27, 2016

In practice, I made heavy use of this format at my last company. I was dealing with lots and lots of Google Sheets where I knew there wouldn't be any tabs or returns in the columns, and my code was all ad-hoc stuff that would never see production, so I'd just read in files, line-by-line, split on "\t", and bam instant parser.

azernik · on Dec 27, 2016

Yeah, the genre was good for non-standardized things like names and (to a lesser extent) time, but file formats have specs and reference implementations. Refer to those.

duskwuff · on Dec 27, 2016

Good file formats have specifications and reference implementations.

By this standard, CSV is a terrible format.

webmaven · on Dec 27, 2016

ObNitPick: "Standard" is a bit confusing in this context. I suggest using "measure", "criterion", or something similar.

julg · on Dec 28, 2016

PSV (https://github.com/jgis/psv-spec) allows "\t" and "\n" in values.

mnarayan01 · on Dec 27, 2016

Using the "μ" in a header (e.g. Content-Type) is at least historically compliant as it exists in ISO-8859-1...but doing so needlessly would be insane.

pjc50 · on Dec 27, 2016

Rather like the space in "Program Files", the "μ" would serve the function of making sure that the caller has at least some working UTF-8.

mnarayan01 · on Dec 27, 2016

AFAIK using the UTF-8 encoding of "μ" would not be standard compliant.

lisivka · on Dec 27, 2016

Much easier solution is to develop a standard for meta-field, which will describe CSV features used. Example:

    #!CSV: escape=:\\', quote='"', quotealg='c-like', separator='\t', eof='\n\r', comment='#', encoding='utf-8'
    foo     bar     baz

Such standard (Lets call it mCSV) can be incorporated into existing solutions. Moreover, existing files can be converted from/to a proprietary flavor of CSV to/from metainized CSV.

seagreen · on Dec 27, 2016

Self-describing data is generally a bad idea. If you can't get programs to agree on the original document format, what makes you think you'll be able to get them to agree on this meta format?

EDIT: The right solution is the simple one. Follow RFC4180 exactly. Reject everything that doesn't match it.

lisivka · on Dec 27, 2016

CSV is format for information interchange. General rule for II formats is: be strict in what you emit, be liberal in what you accept.

So, yes, follow RFC4180 strictly in your own code, but be able to read CSV files produced by others too. Single line of metainformation can greatly reduce confusion and is easy to throw out, when it is unnecessary.

x1798DE · on Dec 27, 2016

This rule sounds nice, but you end up with a bunch of malformed documents out there because people emitting it never get feedback about what they are doing wrong.

I think I've said this in another thread, but I think rejecting malformed documents even though you could "figure out what they meant" is the equivalent of not telling people when they have food on their face.

I think the right pattern for something like this is to complain loudly when you are given a malformed document, then say, "If you really want me to parse this malformed document, use --force and I'll try to make some guesses, but really you should just ask for a well-formed document."

lisivka · on Dec 28, 2016

Yep, we have bunch of incompatible formats and tons of legacy documents and systems, but I'm 100% sure that me is not a person to blame for that.

x1798DE · on Dec 28, 2016

Well, I'm saying the attitude of "be liberal in what you accept" is to blame for the fact that malformed documents are being emitted in the first place. A huge number of people will just hack away at something until it works rather than implementing their emitters to spec, so accepting malformed documents is training people to emit malformed documents.

That is unrelated to well-formed documents in incompatible formats, though.