Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Stop this horrible genre now! As discussed in the recent video programming post (https://news.ycombinator.com/item?id=13259686), this neither helps the beginner because it is unnecessarily overwhelming, nor the expert because he has already read https://tools.ietf.org/html/rfc4180 and is aware of these problems with bad CSV parsers/emitters. In the intermediate case, it could be useful for someone maintaining a CSV parser or emitter who is not aware of a few of these points and wants 100% compatibility with all CSV parsers/emitters in existence, but the list style is confusing grammatically and offers no counterexamples or suggestions for improvement.

With that said, if you are interested in using CSV for encoding and decoding in your own environment, I suggest a new file standard called μTSV.

1. All μTSV files are UTF-8.

2. All values are delimited by "\t", and all lines must end with "\n" (including the last one).

3. If you want to use a "\t" or "\n" character in your value, tough luck---use JSON.



I mostly just thought it was fun to write. Put my pain down on a page, get a little bit of catharsis. I actually had written out some "counterexamples or suggestions" but decided to keep it terse and with the general style of the other similar posts.

I was not aware of the scorn for the style, I just knew I enjoy reading them.

As opposed to your μTSV may I suggest DSV. The Art of Unix Programming makes a strong case for how it's superior to CSV [1]

1. http://www.catb.org/esr/writings/taoup/html/ch05s02.html#id2...


Heh, by coincidence, I had recently written two posts about DSV, and had mentioned the same TAOUP DSV link you gave above:

Processing DSV data (Delimiter-Separated Values) with Python:

https://jugad2.blogspot.in/2016/11/processing-dsv-data-delim...

The example program in the post lets you use either a "-c delimiter_char" or a "-n delimiter_ASCII_code" to specify the delimiter for your DSV data being processed by it.

and

[xtopdf] Publish DSV data (Delimiter-Separated Values) to PDF:

https://jugad2.blogspot.in/2016/12/xtopdf-publish-dsv-data-d...


You're both wrong. EDN is the best.

https://github.com/edn-format/edn


>You're both wrong.

Bah. How am I wrong when I haven't even claimed DSV to be the best? Just quoting that section in ESR's book does not imply I endorse or agree with it. Logic ...


Plus, just providing that EDN link is hardly proof that it is the best ... And further, "best" in many cases is subjective. You want to claim EDN is the best, you lay out an objective proof.


I get that the parent response about EDN isn't particularily enticing but you really shoild give it a look.


I do intend to. Thank you for the comment though. I had starred EDN on github just now before replying to groovy2shoes, i.e., I took an initial brief look at it. Saw a commit by richhickey (Clojure creator). Since I'm into data / file formats, I will definitely check out EDN.


Sorry, my intent (beside bringing up EDN) was to be playful. Tone is often difficult to communicate in writing. I would have stuck in a `:p` if I hadn't been on mobile, but it slipped my mind. My apologies; I did not mean to insult or to otherwise be combative or critical.

I don't have objective proof that it's the best. It's a mix of personal preference and empirical "evidence" based on past success with both EDN and its granddaddy s-expressions for data exchange and serialization.

As for why I personally like EDN so much:

1. It offers the malleability, flexibility, and extensibility of XML, while being

2. even more concise than JSON,

3. straightforward to parse, and

4. precisely defined in a rigorous and language-independent specification.

Traits 1-3 it shares with the traditional s-expressions of the Lisp family, but in contrast to s-expressions it's specifically designed as a data exchange format rather than as a lexical/syntactic format for programming languages. The reason this matters is because the traditional Lisp reader is itself extensible, and extensions can't be guaranteed to carry over from dialect to dialect or even from implementation to implementation. Many Lisp readers go so far as to even allow the execution of arbitrary code at read time, which is desirable for source code as it enables metaprogramming, but it's not so desirable when you're parsing data from an arbitrary source, due to security concerns.

While EDN does have its roots in Clojure's reader, EDN is not exactly Clojure's external representation. Rather, it's a subset of Clojure's surface syntax in much the same way that JSON is a subset of JavaScript's. Like JSON, EDN works great outside of its "native language" (in fact, I've never used EDN from Clojure itself; I've only used it from Lua, Scheme, Python, C, C++, C#, Java, and JavaScript (not necessarily in that order)).


>Sorry, my intent (beside bringing up EDN) was to be playful. Tone is often difficult to communicate in writing.

I understand, and agree it can be difficult. Thanks for clearing that up, and no worries at all. Interesting info, BTW, about EDN. Will reply again in more detail later today with any comments/questions.


I'll need some time to check out EDN, so will message you with any comments later, since this thread might expire. Thanks again for letting me know about it.


I'm also interested to hear your perspective


Cool. Will try to comment before the thread expires, then (not sure what the period is), else will post the points here as a new thread under my id (vram22).


Which Python library did you use for EDN (if any)? I had quickly googled for that, and saw a few, including one by Swaroop C H. Got any preferences? And same question for C.


Replying to myself as a way to quickly locate this lib in context:

https://github.com/swaroopch/edn_format


Sorry, I just saw this question of yours!

I was an early adopter of EDN, and wound up rolling my own parser in Python. EDN is simple enough that you can realistically roll a parser for it, complete with tests, in one or two working days. I find this to be a huge advantage—EDN doesn't really have any corner cases, so implementation is very straightforward.


No problem, and thanks for the answer.


I tried out https://github.com/swaroopch/edn_format some (in Python). Will read up more on EDN, try things a bit more, and then reply here with any comments or impressions.


Many of these problems could be avoided if we used the character codes for unit (31) and record (30) separator. This would avoid any problems with tabs or line feeds, as well as better matching the intended semantics of ASCII/UTF-8


That's suggested often, and the obvious rebuttal is that CSV is intended to be easily edited and relatively human-friendly. The unit/record characters are not on most(all?) keyboards, so editing the file in a text editor becomes pretty much impossible.


As a counterpoint, nobody seems to be editing CSV files by hand any more. It appears at this point to be an interchange format between different spreadsheets, row-oriented databases, etc.


I promise they're still used by hand all the time, usually as a form of supplying data to a program in cases where a database is just too much. CSV is still a very common base for data-driven unit tests.


If you're going to replace CSV then why not use the ASCII record and field delimiter characters and avoid using characters like tab and new line which are in band and actually useful?

And if you're using tabs as field delimiters then why do you need quotation marks in the mix?


In practice, I made heavy use of this format at my last company. I was dealing with lots and lots of Google Sheets where I knew there wouldn't be any tabs or returns in the columns, and my code was all ad-hoc stuff that would never see production, so I'd just read in files, line-by-line, split on "\t", and bam instant parser.


Yeah, the genre was good for non-standardized things like names and (to a lesser extent) time, but file formats have specs and reference implementations. Refer to those.


Good file formats have specifications and reference implementations.

By this standard, CSV is a terrible format.


ObNitPick: "Standard" is a bit confusing in this context. I suggest using "measure", "criterion", or something similar.


PSV (https://github.com/jgis/psv-spec) allows "\t" and "\n" in values.


Using the "μ" in a header (e.g. Content-Type) is at least historically compliant as it exists in ISO-8859-1...but doing so needlessly would be insane.


Rather like the space in "Program Files", the "μ" would serve the function of making sure that the caller has at least some working UTF-8.


AFAIK using the UTF-8 encoding of "μ" would not be standard compliant.


Much easier solution is to develop a standard for meta-field, which will describe CSV features used. Example:

    #!CSV: escape=:\\', quote='"', quotealg='c-like', separator='\t', eof='\n\r', comment='#', encoding='utf-8'
    foo     bar     baz
Such standard (Lets call it mCSV) can be incorporated into existing solutions. Moreover, existing files can be converted from/to a proprietary flavor of CSV to/from metainized CSV.


Self-describing data is generally a bad idea. If you can't get programs to agree on the original document format, what makes you think you'll be able to get them to agree on this meta format?

EDIT: The right solution is the simple one. Follow RFC4180 exactly. Reject everything that doesn't match it.


CSV is format for information interchange. General rule for II formats is: be strict in what you emit, be liberal in what you accept.

So, yes, follow RFC4180 strictly in your own code, but be able to read CSV files produced by others too. Single line of metainformation can greatly reduce confusion and is easy to throw out, when it is unnecessary.


This rule sounds nice, but you end up with a bunch of malformed documents out there because people emitting it never get feedback about what they are doing wrong.

I think I've said this in another thread, but I think rejecting malformed documents even though you could "figure out what they meant" is the equivalent of not telling people when they have food on their face.

I think the right pattern for something like this is to complain loudly when you are given a malformed document, then say, "If you really want me to parse this malformed document, use --force and I'll try to make some guesses, but really you should just ask for a well-formed document."


Yep, we have bunch of incompatible formats and tons of legacy documents and systems, but I'm 100% sure that me is not a person to blame for that.


Well, I'm saying the attitude of "be liberal in what you accept" is to blame for the fact that malformed documents are being emitted in the first place. A huge number of people will just hack away at something until it works rather than implementing their emitters to spec, so accepting malformed documents is training people to emit malformed documents.

That is unrelated to well-formed documents in incompatible formats, though.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: