Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Coudl you eleborate what you mean by the last sentence?


Originally Ceph divided big objects into 4MB chunks, sending each chunk to an OSD server which replicated it to 2 more servers. 4MB was chosen because it was several drive rotations, so the seek+ rotational delay didn’t affect the throughput very much.

Now the first OSD splits it into k data chunks plus d parity chunks, so the disk write size isn’t 4MB, it’s 4MB/k, while the efficient write size has gone up 2x? 4x? since the original 4MB decision as drive transfer rates increase.

You can change this, but still the tuning is based on the size of the block to be coded, not the size of the chunks to be written to disk. (and you might have multiple pools with much different values of k)


I'm still not sure which exact Ceph concept you are referring to. Thre is the "minimum allocation size" [1], but that is currently 4 KB (not MB).

There is also striping [2], which is the equivalent of RAID-10 functionality to split a large file into independent segments that can be written in parallel. Perhaps you are referring to RGW's default stripe size of 4 MB [3]?

If yes, I can understand your point about one 4 MB RADOS object being erasure-coded to e.g. 6 = 4+2 "parity chunks", making it < 1 MB writes that are not efficient on HDDs.

But would you not simply raise `rgw_obj_stripe_size` to address that, according to the k you choose? E.g. 24 MB? You mention it can be changed, but I don't understand the "but still the tuning is based on the size of the block to be coded" part, (why) is that a problem?

Also, how else would you do it when designing EC writes?

Thanks!

[1]: https://docs.ceph.com/en/squid/rados/configuration/bluestore...

[2]: https://docs.ceph.com/en/squid/architecture/#data-striping

[3]: https://docs.ceph.com/en/squid/radosgw/config-ref/#confval-r...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: