This looks like a very nice implementation of a graph database. However a 6-machine cluster barely qualifies as "distributed" for the purposes of a graph database. You will experience almost no sub-linearity at this scale no matter how poorly the algorithms distribute. I am not intimately familiar with the transactional implementation but the graph processing algorithms described in Titan are not significantly distributable.
For graph queries, linearity can be definitely achieved on parallel systems with tens of thousands of compute nodes using the state-of-the-art algorithms. However, Titan does not use any of those algorithms and will be very lucky to exhibit linearity at a couple dozen nodes. Not to take away from the implementation but people looking for a scalable graph database will not be satisfied with Titan.
As an aside, the best algorithms and data structures for doing massively parallel graph operations are not published but obviously exist. The fastest system in the Graph500 benchmark uses a half million cores on a trillion edge graph. That is a several order of magnitude gap between what the best open source systems can do and what systems developed by closed research organizations can do as published in public benchmarks.
(Disclosure: I invented one of the aforementioned families of massively parallel graph processing algorithms in 2009, and not the first. The published literature has not even caught up with the first generation of such algorithms. A significant portion of advanced computer science research is no longer published.)
I think you are looking at a very different use case here. The systems that I think you are referring to analyze a static graph representation. The Graph500 benchmark in particular loads one big static, unlabeled, undirected, property-free graph and then runs extensive (BFS) analysis algorithms on it. The fact that the graph is not changing allows significant investment into building locality optimizing data structures (which is essentially what space decomposition is all about).
Titan on the other hand is a transactional database system to handle large, multi-relational (labeled) graphs with heterogeneous properties. A Titan graph is constantly evolving (as in the posted benchmark).
For graphs (unlike geo-spatial domains), applying space decomposition techniques first requires a metric space embedding which is a non-trivial and computationally expensive process. For changing graphs, this embedding will change as well making this very difficult to use in practice. The best approaches I know of for achieving locality therefore use adaptive graph partitioning techniques instead. However, for the types of OLTP workloads that Titan is optimized for, this would be overkill in the sense that the time spend on partitioning will likely exceed the time saved at runtime. At very large scale, it is most important for OLTP systems to focus on access path optimization based on the ACTUAL query load experienced by the system and not some perceived sense of locality based on connectedness. I published a paper a while ago suggesting one approach to do so:
http://www.knowledgefrominformation.com/2010/08/01/cosi-clou...
The Graph500 benchmark explicitly prohibits this optimization ("The first kernel constructs an undirected graph in a format usable by all subsequent kernels. No subsequent modifications are permitted to benefit specific kernels").
I was using Graph500 as a decently documented public example more than the only example. There are other problems based on real-world data in the trillion edge range that serve as "hello world" models for testing massively parallel graph algorithms. Directed and undirected, cyclic, and acyclic, properties and property-free. Semantic databases and entity analytics are popular test cases.
In the specific case of Graph500, the graph is significantly cyclic which creates coordination issues if you simply denormalize the data (e.g. replicating edges around a graph cut). Being able to do a massively parallel BFS from any vertex in the graph and producing the globally correct result without replicating edges means that you cannot know how to optimize the organization ahead of time. This was an intentional part of the benchmark design. The Graph 500 does not lend itself to optimizing for a particular set of adaptive graph cuts in any conventional sense; the algorithms used need to be general over the 64 randomized runs and that benchmark was designed to favor non-replicated edges when using massively parallel systems (the coordination costs of edge replicas will kill your performance). However, obviously the massively parallel systems are partitioning up the graph in some sense.
In the specific case of the work I did a couple years ago, the systems can ingest tens of millions of new edges per second concurrent with parallel queries (not serializable multi-statement transactions, obviously). The ingest structure can be identical to the structure against which ad hoc queries are run without any kind of query optimization. The fact that ingest rates that high are sustained effectively precludes dynamically reorganizing the data to satisfy particular queries more optimally. In truth, it could be made more optimal for batch-y type workloads (maybe 2-3x faster versus the dynamic version?) but the point was to be able to throw massive amounts of hardware at arbitrary graph data models rather than optimizing it for a specific query.
BTW, metric space embedding is non-trivial algorithmically but can also be computationally inexpensive. The Macbook Air I am using now can do tens of millions of those embedding operations per second on a single core for moderately complex spaces and data models. Maybe an order of magnitude or two slower if dealing with complex, non-Euclidean geometries. However, I also spent a couple years of computer science R&D developing the algorithms to make that fast. :-) I have been working in this particular area for a bit over half a decade now so my perspective takes some things for granted I think. There isn't just one problem you have to solve, there are actually several if you are starting from scratch.
Like I said, I didn't want to take anything away from Titan and true OLTP-oriented systems have their own complex problems, not the least of which is that they don't scale too far beyond a couple hundred compute nodes for the current state-of-the-art. Not my specialty. I work in a world of more basic consistency guarantees.
Yes. The R&D was almost entirely funded by multiple private research organizations over several years at relatively high cost (and not always the ones you would think). As a practical matter, many algorithm patents have not been enforceable for years even though they are well-supported in international law. Consequently, almost everything related to massively parallel graph technologies is closely held as a trade secret. Academia is pretty far behind the state-of-the-art in this area.
I've done work in other algorithm areas where the situation is similar. It often costs millions of dollars to develop major computer science advances but the best way to recover that investment is to leverage it without publishing it. Reverse engineering a thoroughly obfuscated technology takes a lot longer than when someone publishes the blueprints.
I will add that a lot of these algorithms are not at all obvious until you understand how they work. While knowing something is possible helps, it still requires a fair amount of theoretical computer science cleverness.
I understand you may be under NDA, as well as the general pressure to not reveal knowledge that is the basis of your career, but would you care to point out any open literature you think is worthwhile to study?
For massively distributed and parallel algorithms, a core part of the computer science is based around space decomposition data structures. If your algorithm is not based on a space decomposition structure, it won't linearly parallelize across a vast number of cores. When I say some algorithm won't distribute or parallelize, this is what I am usually (but not always) looking at.
Space decomposition structures take a metric space and decompose it using an invariant space partitioning function. The partition points are fixed regardless of data distribution, the only decision process is when to apply the partitioning function.
The two well-known examples in literature are quad-trees and distributed hash tables. These are both very simple, narrow examples of space decomposition structures in a universe of possible metric spaces and possible partitioning functions. Many parallel data problems can't be made to fit into the simple space decomposition structures computer scientists are familiar with.
When massively parallel machinery is the foundational assumption, every computational structure must be represented in terms of space decomposition primitives. Right now, most computer scientists treat space decomposition structures as dumb data buckets and are largely oblivious to their expressiveness. Most published parallel algorithm computer science is based on computational structures that will only sort of parallelize on sufficiently small systems.
I'd like to see a few other details that aren't mentioned:
- What's the following distribution end up looking like? Does it have a similar fraction of 'celebrity' users with huge follower counts? Or more technically, does the russian roulette against the recommendation sampling end up producing a network similar to a scale free graph grown via preferential attachment (Barabási–Albert model)? It looks like your mean fanout is about 20, which is smaller than what twitter has published, but I'd be more interested in knowing how many 10k+ follower users are in the graph.
- What's the write amplification like? ~1.6 Billion per tweet per follower edges stored daily seems like it could burn a lot of capacity quickly, though most of it will grow cold quickly and could be pushed to archive. Making a rough guess from your disk write monitoring line graph, it looks like you'd be putting down about 16GB a day? It'd be interesting to see a comparison between this run and one done where streams are built indirectly via follower links alone.
- the data we used was crawled by Kwak et. al in 2009. We wanted to use a real social network dataset for the experiment and that was the largest/most useful one we could find. Other than de-duplication we did not make any modifications to the dataset, so the statistics reported in their paper still hold:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.153....
- You mean what is the overhead induced by pre-computing the stream edge rather than collecting the relevant streams at query time? You are right that this requires a significant amount of storage space, however, as you also pointed out, this will get cold quickly and be sitting on disk only (i.e. not taking up space in the valuable cache). The reason this is very efficient is because of the time-based vertex centric index we build for the stream edges. This allows us to quickly pull out the most recent tweets for any user. If we had to compute those at query time, we would have to traverse to each person followed, get their 10 most recent tweets and then merge those in-memory. That would be significantly more expensive and since stream reading is probably the most frequent activity on twitter, pre-computing it saves a lot of time at the expense of inexpensive disk storage.
Right now Titan traversals run locally and communicate with the back-end Cassandra or HBase datastore via Thrift, but they are working on moving the traversal code inside the Cassandra/HBase JVM so you can run Gremlin traversals inside the distributed cluster rather than than pulling the elements over the wire.
Cassandra is turning out to be one killer data backend. It's really exciting to see what's being built on top of it.
Hadoop, Solr/Lucene, and now Blueprints/Grapb DB operations are all available on the same Cassandra cluster, in addition to the stuff Cassandra does quote-unquote natively. Add Zookeeper for the few times you need an honest-to-goodness transaction and it's just crazy how good the tech has gotten on the backend in the last 10 years. :-)
Absolutely, without NoSQL solutions like Cassandra Titan would not be possible.
Regarding Zookeeper: We actually build a locking system into Titan that uses quorum reads/writes with time-outs and cleanup to ensure consistency for certain edge/property types as defined inside Titan. This gives you consistency guarantees out of the box without having to introduce another component (like Zookeeper) into your deployment. For infrequent lock usage (which I strongly encourage ;-) this should be sufficient. For frequent locking, something like Zookeeper is far superior.
Titan is distributed and has much higher write performance (5000 transactions/sec). In the words of Matthias Broecheler, the creator of Titan, you would use Titan to implement a social network (https://twitter.com/MBroecheler/statuses/213350753031569409).
Titan has pluggable storage options: Cassandra, HBase, Berkeley DB, and some have been working on adapters for DynamoDB and App Engine Datastore. However, it's not clear how well App Engine and DynamoDB will work with the upcoming Titan changes which will move the traversal engine into the cluster's JVM.
For graph queries, linearity can be definitely achieved on parallel systems with tens of thousands of compute nodes using the state-of-the-art algorithms. However, Titan does not use any of those algorithms and will be very lucky to exhibit linearity at a couple dozen nodes. Not to take away from the implementation but people looking for a scalable graph database will not be satisfied with Titan.
As an aside, the best algorithms and data structures for doing massively parallel graph operations are not published but obviously exist. The fastest system in the Graph500 benchmark uses a half million cores on a trillion edge graph. That is a several order of magnitude gap between what the best open source systems can do and what systems developed by closed research organizations can do as published in public benchmarks.
(Disclosure: I invented one of the aforementioned families of massively parallel graph processing algorithms in 2009, and not the first. The published literature has not even caught up with the first generation of such algorithms. A significant portion of advanced computer science research is no longer published.)