"Shared classes and Ahead-of-Time (AOT) technologies typically provide a 20-40% reduction in start-up time while improving the overall ramp-up time of applications. This capability is crucial for short-running Java applications or for horizontal scalability solutions that rely on the frequent provisioning and deprovisioning of JVM instances to manage workloads."
I have learned a lot from reading the source code and watching it develop. It is written in modern Java 8. The authors are obviously experts of the language, JVM and ecosystem. Since it is an MPP SQL engine performance is very important. The authors have been able to strike a good balance between performance and clean abstractions. I have also learned a lot about how to evolve a product. Large features are added iteratively. In my own code I often found myself going from Feature 1.0 -> Feature 2.0. Following Presto PRs, I have seen how for large features they go from Feature 1.0 -> Feature 1.1 -> Feature 1.2 -> ... Feature 2.0 very quickly. This is much more difficult than it sounds. How can I implement 10% of a feature, still have it provide benefits and still be able to ship it? I have seen how this technique allows for code to make it into production quickly where it is validated and hardened. In some ways it reminds me of this: https://storify.com/jrauser/on-the-big-rewrite-and-bezos-as-.... You shouldn't be asking for a rewrite. Know where you want to go and carefully plan small steps from here to there.
Looks really interesting and not another SQL on Hadoop solution. The benchmarks look impressive, but all of the queries were aggregations of a single table. I did not see any joins. I wonder how mature the optimizer is.
I think it's because the docs say "If there isn't enough memory, you can't run a JOIN." . While SQL on Hadoop solutions work also without enough RAM by spilling on disk. I don't think a comparison with JOINs would be fair in this case.
It doesn't yet. It's on our nebulous "we'd like to do this some time" roadmap, but currently concentrating on some more basic stuff around stability and time series features.
Of course this is a huge optimization for data warehousing applications, where two co-partitioned tables can be joined without any network data transfer, and in some cases could even use merge join instead of hash based strategies. But, it's the usual time/scope/quality trinity, and we'd rather not compromise the third element.
Glad to hear this is at least being considered. The optimizations for data warehousing you mentioned are my use case. I understand the it is a very active project with a lot on the road map. It's a very cool project and I follow you guys on http://gerrit.cloudera.org/#/q/status:open
This looks interesting and something I will definitely watch, but at this point I think I will still stick with http://h2o.ai/ (another JVM based ML open source project that integrates well with 'Hadoop'). I have been really impressed with the quality of the product and even more so with the quality of the people behind the it.
Kudu is being positioned as filling the gap between HDFS and HBase. After reading the overview I see this more as bringing features from HDFS+Parquet+HBase. Does that sound reasonable?
Super excited about this and even more so since it is open source. Thank you!
Yep, that's correct. HDFS+Parquet is more accurate but doesn't fit quite as well on slides and short descriptions.
The idea is to get the analytic scan performance of Parquet while still allowing for in-place updates and row-by-row access like HBase.
HDFS (with Parquet or other formats) will still be better for unstructured or fully immutable datasets. HBase will still be better when your top priority is ingest rate, random access, and semi-structured data. Kudu should be good when you've got tabular data as described above.
Yep, I've been taking part in those design discussions. We hope to have Kudu tablet servers support generating this in-memory format in shared memory as the result of scans, so the Impala server (client from Kudu's perspective) can directly operate on the data. We're expecting a 20-30% speed boost from this for some queries, though haven't done any tests at scale of the prototype.
"Shared classes and Ahead-of-Time (AOT) technologies typically provide a 20-40% reduction in start-up time while improving the overall ramp-up time of applications. This capability is crucial for short-running Java applications or for horizontal scalability solutions that rely on the frequent provisioning and deprovisioning of JVM instances to manage workloads."