I see. From what I know about Cassandra, this is a much more expensive write than doing it as a new row.
To do this he has to be using dynamic columns, and those are stored as one serialized blob per row. So the more data you have in the row, the more expensive the deserialization/reserialization is with each column you add. For very large series this could be an issue.
But it sounds like this is tolerable for his app because the writes are distributed over time in a predictable fashion.
I am a little surprised though at the author's claim that fetching a single big row results in "huge IO efficiency" over a range of small rows. I'd expect a small amount of overhead, but isn't it more or less the same amount of data being retrieved? What am I missing?
EDIT: I see the author mentioned that it reduces disk seeks because it's all serialized together already. Sort of like you're defragging the series data on every write. I guess that makes sense.
Personally I would probably look at using SSDs and keep the schema more "sane" and have more scalable writes, but that's just me.
1.) You do not have to use dynamic columns for this. Unfortunately I've found in my own experience, as Cassandra has matured over the last year, alot of terminology has fallen in and out of fashion and its hard to recognize what is actually current. Dynamic columns in CQL3 has nothing to do with the behavior OP is talking about and in dynamic columns are sort-of a deprecated feature in Cassandra 1.2. In CQL3, OP's use-pattern in actually hidden if you didn't know any better.
In short, there is no deserialization/reserializaion. OP's writes are append only. I have a similar use pattern to OP, and I haven't seen any performance issues with 100,000s of columns (on SSDs)
2.) The "huge IO efficiency" is similar to what you would see in any columnar data store. Wikipedia has a good walkthrough of it (http://en.wikipedia.org/wiki/Column-oriented_DBMS). The short story is now there is fewer meta data between his values.
--
In any case, it works out because Cassandra is far more well suited for this type use pattern than Mongo is. We migrated from MongoDB (on SSDs) to Cassandra for similar reasons. The perf-killer on Mongo in this scenario is the write lock.
To do this he has to be using dynamic columns, and those are stored as one serialized blob per row. So the more data you have in the row, the more expensive the deserialization/reserialization is with each column you add. For very large series this could be an issue.
But it sounds like this is tolerable for his app because the writes are distributed over time in a predictable fashion.
I am a little surprised though at the author's claim that fetching a single big row results in "huge IO efficiency" over a range of small rows. I'd expect a small amount of overhead, but isn't it more or less the same amount of data being retrieved? What am I missing?
EDIT: I see the author mentioned that it reduces disk seeks because it's all serialized together already. Sort of like you're defragging the series data on every write. I guess that makes sense.
Personally I would probably look at using SSDs and keep the schema more "sane" and have more scalable writes, but that's just me.