Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I do a lot of log crunching with pypy, and datetime.strftime is a really big part of the cost. I wish someone could contribute faster datetime routines to the standard CPython and pypy.

Other parts that are always hot include split() and string concatenation. Java compilers can substitute StringBuffers when they see naive string concatenation, but in Python there's no easy way to build a string in a complex loop and you end up putting string fragments into a list and then finally join()ing them. Madness!



One trick that helped when I was doing log crunching (where time parsing was a good 10%) was to cache the parse.

All log lines began with a date+time like "2015-12-10 14:42:54.432" and there's maybe 100 lines per second. You can therefore just take the first 19 characters, parse that to a millisecond unix time and then separately parse the milliseconds to an int and add that. All you need is one cache entry (since logs are mostly in order) and then you can just do a string comparison (i.e. no hashmap lookup) to check the cache - instantly 100x fewer time parsing calls.

The best way to speed up a function is to not call it!


That's called memoization, and it works really well in quite a number of places.

There are some compilers that can do it automatically with some hints, but I think they are mostly experimental not production.


Well, it's half memoization, and half knowing that you can split the work into smaller parts, which allows you to take advantage of the memoization.

Just adding memoization to date and time parsing gets you very little when there's little duplication of the inputs, and without the breaking apart of the data could very likely have yielded worse performance.


Here are some microbenchmarks:

In [63]: timeit dateutil.parser.parse('2013-05-11T21:23:58.970460+07:00') 10000 loops, best of 3: 89.5 µs per loop

In [64]: timeit arrow.get('2013-05-11T21:23:58.970460+07:00') 10000 loops, best of 3: 62.1 µs per loop

In [65]: timeit numpy.datetime64('2013-05-11T21:23:58.970460+07:00') 1000000 loops, best of 3: 714 ns per loop

In [66]: timeit iso8601.parse_date('2013-05-11T21:23:58.970460+07:00') 10000 loops, best of 3: 23.9 µs per loop

> Other parts that are always hot include split() and string concatenation. Java compilers can substitute StringBuffers when they see naive string concatenation, but in Python there's no easy way to build a string in a complex loop and you end up putting string fragments into a list and then finally join()ing them. Madness!

The Python solution you describe is the same as in Java. If you have `String a = b + c + d;` then the compiler may optimize this using a StringBuffer as you say[1]. In Python it's also pretty cheap to do `a = b + c + d` to concatenate strings (or `''.join([b, c, d])`; but you should run a little microbenchmark to see which works best). But if it's in a "complex loop" as you opine then Java will certainly not do this. So you have to build a buffer using StringBuilder and then use toString() which is basically the same exact process except it has the name `builder.toString` instead of `''.join(builder)`

Unless of course you have some interesting insights into the jvm internals about string concatenation optimizations.

[1]http://docs.oracle.com/javase/specs/jls/se8/html/jls-15.html...


Of course you have a different machine but the OP was getting 2.5 us per parse in .NET versus your 89.5 us in Python. I wouldn't have expected such a difference. No wonder it's hot path


Well that's dateutil (installed from pip) and not datetime (std). As part of log ingestion I would, of course, convert to UTC and drop the timezone distinctions since it does slow down python a lot when it has to worry about timezones. Working within the same units and no DST issues is much nicer/quicker.

Anyway, if you're installing packages from pip, may as well just install iso8601 and get the best performance - possibly beating .Net (who knows? as you said, I have a different machine than OP).


The numpy version seems to be about 30 times faster than the iso8601 version - note the result there is in nanosecs, not microsecs like the others.


Yeah but OP is using pypy and I don't know if numpy works on pypy fully. I think I read it does but I haven't tried it.


dateutil spends most of its time inferring the format, it's not really designed as a performance component, it's designed as a "if it looks like a date we'll give you a datetime" style component.


Is there a particular reason that all of the loops are 10k except numpy which is 1M?


The TimeIt macro runs snippets for a variable number of iterations based on how long the snippet takes.


Because it can do so many more iterations in a similar amount of time, it just does them.


timeit does a run to decide how many loops to do, since it was much faster it ran it more times.


The Java story is slightly more complex. Javac has emitted code to make a StringBuilder and call append multiple times on it, and then the JIT has spotted this construction and optimised that.

This is, as you can guess, somewhat fragile especially when some of the parts may be constants. So JEP 280 has changed javac to emit an invokeDynamic instruction with information about the constant and dynamic parts of the string so the optimisation strategy can be chosen at run time and can change over time without requiring everyone to recompile their java code.


However one should mention that this optimization is Hotspot specific and other Java compilers will behave differently.


For your point about Python string concatenation, I believe that's what io.StringIO is for (https://docs.python.org/3/library/io.html#io.StringIO).


If you are using CPython 2.4 or 2.5 or later, use the clearest code. In earlier versions using join on a list was faster, but the inplace add operation was optimized, turning the stuff about string.join being faster mostly into mythlore.

https://bugs.python.org/issue980695

PyPy and other implementations may do better with the join idiom though.

Of course someone with code spending a lot of time on joining strings can measure which is best for their situation, but += is fine for most things.


http://pypy.org/performance.html says "String concatenation is expensive" and suggests using join() in loops.

Its an old problem https://bitbucket.org/pypy/pypy/issues/1925/very-slow-string...

So whilst pypy is otherwise much faster than CPython, its missing of this kind of optimisation is why actually CPython can be faster for parsing my logs.

I know about this because I've been bitten by it :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: