Hacker Newsnew | past | comments | ask | show | jobs | submit | carsonfarmer's commentslogin

Pretty cool to see this here. My co-author and I haven't actually officially "released" this yet, so quite neat to see where it is organically showing up! Feedback appreciated y'all!


Thanks, that's all due to my co-author. Here's what we used: https://news.ycombinator.com/item?id=41589728 (link is to a post from @msy)


You're right, we should delve into a comparison more with respect to prolly trees. We actually have a lot of experience with prolly trees, and have found, in practice, that you need to do a lot of the things that folks like dolt have had to do to make them work nicely. Whereas with G-trees, the basic implementation turns out to be quite nice (and extremely easy to reason about).

One of the biggest benefits of G-trees in my mind, is their ease of implementation. Additionally, we did a lot of work to explore their statistical properties, which doesn't exist for prolly trees (though in hindsight, we have done this, so should probably write it up formally).


Another thing that's worth investigating:

As the name implies, the sizes of nodes of Prolly trees and geometric search trees are geometrically distributed. My question is: is this really the right distribution to use? The larger nodes get, the larger the probability is that they get mutated. This means that in a content addressed storage system, there will be more large objects than small ones. My gut feeling tells me that the distribution should be uniform, with the spread between min/max sizes bound by a small constant factor (2x? 4x?).

Some time ago I experimented with this, where I implemented a content defined chunking algorithm that chunks inputs at locations where the value of a rolling hash is maximal, as opposed to finding offsets at which the first/last n bits are zeros/ones. My observation was that this led to a 2-3% reduction in storage space usage. The source code for this can be found here:

https://github.com/buildbarn/go-cdc

Would it also be possible to model trees around this approach as well? If so, would this lead to better deduplication rates than Prolly/geometric search trees?


Yes this exactly. Another really simple way to do this, is to use alternating leading and trailing zero counts in the hash in your nested G-trees. Simple, and pretty effective.


Hmmm... if you need to go deeper (because 1/4 of all hashes have zero leading zeros and zero trailing zeros), you can generalize this by converting the hash into its run-length encoding to get a sequence of rank functions where finding values with the same rank for all rank functions is equivalent to finding hash collisions. Very nice.


Whoah, I totally hadn't taken the thought experiment this far. This is fantastic! I'd like to explore this further, interested in a quick research chat/sync some time? My email is linked from the paper.


Whoah, cool. I'm one of the authors of the geometric search tree paper, and we totally hadn't see that paper, but will for sure dig in! Thanks for mentioning it.


Ok, this is quite cool. Also love the name.


Super fun example of the combination of AI and DB. You want to get back exactly what you put in, use a DB! You want something extrapolated from what you put in, AI is pretty great! Also Fireproof is :fire:


That's awesome, dig right in! And don't hesitate to reach out on our slack: slack.textile.io. We're under active development, so if there's something missing, or a feature that's lacking in documentation (there are, but we're working on it!) let us know.


Just a bunch of useful tools and ideas our team has picked up along the way while being remote only for the past few years.


Super excited to see this out the in the wild! Been following this project for a while now - huge fan of the vision, huge fan of the UX. Congrats!


Thanks, Carson!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: