Hacker Newsnew | past | comments | ask | show | jobs | submit | nfa_backward's commentslogin

http://www.eclipse.org/openj9/

"Shared classes and Ahead-of-Time (AOT) technologies typically provide a 20-40% reduction in start-up time while improving the overall ramp-up time of applications. This capability is crucial for short-running Java applications or for horizontal scalability solutions that rely on the frequent provisioning and deprovisioning of JVM instances to manage workloads."


Facebook Presto - See my comment here: https://news.ycombinator.com/item?id=13626109


Facebook Presto, a MPP SQL Engine written in Java.

https://github.com/prestodb/presto

I have learned a lot from reading the source code and watching it develop. It is written in modern Java 8. The authors are obviously experts of the language, JVM and ecosystem. Since it is an MPP SQL engine performance is very important. The authors have been able to strike a good balance between performance and clean abstractions. I have also learned a lot about how to evolve a product. Large features are added iteratively. In my own code I often found myself going from Feature 1.0 -> Feature 2.0. Following Presto PRs, I have seen how for large features they go from Feature 1.0 -> Feature 1.1 -> Feature 1.2 -> ... Feature 2.0 very quickly. This is much more difficult than it sounds. How can I implement 10% of a feature, still have it provide benefits and still be able to ship it? I have seen how this technique allows for code to make it into production quickly where it is validated and hardened. In some ways it reminds me of this: https://storify.com/jrauser/on-the-big-rewrite-and-bezos-as-.... You shouldn't be asking for a rewrite. Know where you want to go and carefully plan small steps from here to there.


Looks really interesting and not another SQL on Hadoop solution. The benchmarks look impressive, but all of the queries were aggregations of a single table. I did not see any joins. I wonder how mature the optimizer is.


I think it's because the docs say "If there isn't enough memory, you can't run a JOIN." . While SQL on Hadoop solutions work also without enough RAM by spilling on disk. I don't think a comparison with JOINs would be fair in this case.



Does Kudu colocate data from different tables with equal keys? If not, is this or a similar feature on the road map?


It doesn't yet. It's on our nebulous "we'd like to do this some time" roadmap, but currently concentrating on some more basic stuff around stability and time series features.

Of course this is a huge optimization for data warehousing applications, where two co-partitioned tables can be joined without any network data transfer, and in some cases could even use merge join instead of hash based strategies. But, it's the usual time/scope/quality trinity, and we'd rather not compromise the third element.


Glad to hear this is at least being considered. The optimizations for data warehousing you mentioned are my use case. I understand the it is a very active project with a lot on the road map. It's a very cool project and I follow you guys on http://gerrit.cloudera.org/#/q/status:open


Also worth noting it's an open source project so if you're interested in contributing in this area, we'd love to have you on board.


This looks interesting and something I will definitely watch, but at this point I think I will still stick with http://h2o.ai/ (another JVM based ML open source project that integrates well with 'Hadoop'). I have been really impressed with the quality of the product and even more so with the quality of the people behind the it.


Does Kudu colocate data sets with identical keys? If so, are there plans to have Impala take advantage of this?


From my experience and the experience of others ( https://www.eecs.berkeley.edu/~keo/publications/nsdi15-final... ) current big data solutions are more often CPU bound not IO. I think that we will be seeing more and more of big data architecture moving to C++. For example: http://www.scylladb.com/


Kudu is being positioned as filling the gap between HDFS and HBase. After reading the overview I see this more as bringing features from HDFS+Parquet+HBase. Does that sound reasonable?

Super excited about this and even more so since it is open source. Thank you!


Yep, that's correct. HDFS+Parquet is more accurate but doesn't fit quite as well on slides and short descriptions.

The idea is to get the analytic scan performance of Parquet while still allowing for in-place updates and row-by-row access like HBase.

HDFS (with Parquet or other formats) will still be better for unstructured or fully immutable datasets. HBase will still be better when your top priority is ingest rate, random access, and semi-structured data. Kudu should be good when you've got tabular data as described above.


Impala has an in-memory columnar format on its road map for 2016. Is that format being design with Kudu in mind?

Edit: I understand that the formats, while both columnar, serve different purposes. I am more curious about overlap if any between the two.


Yep, I've been taking part in those design discussions. We hope to have Kudu tablet servers support generating this in-memory format in shared memory as the result of scans, so the Impala server (client from Kudu's perspective) can directly operate on the data. We're expecting a 20-30% speed boost from this for some queries, though haven't done any tests at scale of the prototype.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: