Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I don't get it. In 1 single millisecond, a processor does on the order of a million operations, and if it's being run that tightly the code should be close to the CPU, perhaps in L2 or L3 cache -- what on earth requires parsing 62x faster than that, for date parsing? 320 ns * 62 is 19 microseconds for the version that yuo said was too slow for you, which is 0.019 milliseconds. Like, what even generates formated date data that fast, or why can't you wait 0.019 ms to parse it out again? Genuine question.

EDIT: I got downvoted, but it's just a question. If it's logs, is it old logs, or streaming from somewheere? Rather than speculate I'd like to hear the actual answer from thomas-st for their use case. 62x is just under 1.8 orders of magnitude, so I wouldn't think it matters in comparison with typical CPU and RAM speeds, disk, network, or other bottlenecks...



Good CSV parsers reach 200MB/s, a formatted datetime is under 40 bytes so assuming you only get dates your input is ~5 million dates per second, 19µs/date is ~50000 dates per second. The date parsing is three orders of magnitude behind the data source.

Even allowing for other data e.g. a timestamp followed by some other data, at 19µs/datetime you can easily end up with that bottlenecking your entire pipeline if the datasource spews (which is common in contexts like HFT, aggregated logs and the like)


> at 19µs/datetime you can easily end up with that bottlenecking your entire pipeline if the datasource spews (which is common in contexts like HFT, aggregated logs and the like)

+1

This is why a little ELT goes a long way.

>Good CSV parsers reach 200MB/s

By good (and open source) we're talking about libcsv, rust-csv, and rust quick-csv[1]. If you're doing your own custom parsing you can write your own numeric parsers to remove support for parsing nan, inf, -inf, etc and drop scientific notation which will claw back a lot of the time. If you also know the exact width of the date field then you can also shave plenty of time parsing datetimes. But at that point, maybe write data to disk as protobuf or msgpack or avro, or whatever.

[1] https://bitbucket.org/ewanhiggs/csv-game


> If you're doing your own custom parsing you can write your own numeric parsers to remove support for parsing nan, inf, -inf, etc and drop scientific notation which will claw back a lot of the time.

The 200MB/s, at least for rust-csv, is for "raw" parsing (handling CSV itself) not field parsing and conversions, so those would be additional costs.

> If you also know the exact width of the date field then you can also shave plenty of time parsing datetimes.

Yes if you can have fixed-size fields and remove things like escaping and quoting and the like things get much faster.


It depends what exactly you mean by CSV parsing, but I've done record scanning on CSVs at >1GB/s on a 3GHz CPU.


It is a lot faster to split lines on , than to handle quoting with embedded commas, returns and so on.


Interesting. I'd wait to hear their answer - if it's a CSV bottlenecking at 0.2 MB/sec instead of 200 MB/sec it would be interesting (which is the three orders of magnitude you state). But with all that said, they reported a 62x improvement, so 1.8 orders of magnitude.


0.2MB/s parsing logs or basically any file is absolutely terrible. Hence why thomas-st wrote his own parser. Parsing at 2MB/s (rounding up the 1.8 to 2 order of magnitude) is also pretty poor.


Let me ask a converse question: Why should a program be slow when it can be fast?


Not to defend op, just answering this question: code clarity, readability, simplicity, correctness, clearer error handling, ......

Depending on how much faster is "fast", these things can (dare I say, should) trump efficiency.

Essentially this is about optimisation vs premature optimisation. That's a horse dead for decades now.



that's amazing, and incredibly surprising. It's one of the operations I would ahve thought was essentiall free, and completely dwarfed by other bottlenecks and latencies. I mean it's just a few characters, right! I'm really suprised by the result you report in your link - thanks for writing it up!


Any data set with a lot of records with few fields, one of which is a timestamp is going to spend some time formatting/parsing that timestamp.


Parsing logs? Geo location?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: