Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

A point of clarification here - numpy's reshape operation stays fast as long as the array is a numpy array.

Which is exactly what the parent comment was all about - the author figured that the reason numpy was significantly faster was because it was accessing / working with the data in a different fashion.

So, in order to test that theory, he converted the numpy.array into a normal python array before he proceeded to do any timed operations with zip vs. numpy.reshape, etc.

This is a more realistic playing field if you're considering data that was created outside of the numpy environment. At some point, if you're going to work with numpy.reshape, it will need to be type converted / "imported" into numpy data types.

For the purposes of this test, it's much more "fair" to include both the time numpy spent on splitting the array as well as that conversion time. The reshape process in numpy had essentially O(1) time with native data types indicating that it had done some behind the scenes work that allowed for such speed. The parent example is much more realistic in capturing the time of the behind the scenes work by forcing each method to start from the same exact same data objects.



My reply was in response to the statement "numpy is two orders of magnitude faster here; it's evidently using a highly optimized internal codepath for random sequence generation", which is false, it's not because of highly optimized internal codepaths for random sequence generation, it's because the code produced a numpy array (or didn't have to do type conversion). But I agree, when using numpy to produce a timing comparison, it would be fair to start with a numpy array, or to show the time involved in the creation of the array.


Thanks for your comments. I hope it didn't sound like I was negatively comparing numpy's array/sequence operations to anything. I know very little about numpy, and I assume that "real" numpy solutions don't look anything like what's being discussed here. I only included those measurements since the article's author did.

To clarify my points a bit, the optimizations I alluded to (in "highly optimized internal codepath") were meant to include things like using a generator, i.e. at no point is there an actual array of input random numbers. The fact that in numpy the 300-element "array" and the 3,000,000-element "array" had identical timings suggests exactly that; I disagree that it's an issue of internal representation, unless the concept of a numpy array subsumes the concept of a generator, in which case I think we're all saying the same thing.

That kind of optimization is only possible in this case because by the definition of randomness nobody could know what the values were until they were enumerated, so it's 100% transparent to use a generator. That's not how real-world data works, hence my forced-native-array measurement and pudquick's reply.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: