I have never used Samza but have build similar pipelines using Kafka,Storm,Hadoo...

boredandroid · on July 2, 2014

I am the author of the post.

I think in cases where you are running totally different computations in different systems the Lambda architecture may make a lot of sense.

However one assumption you may be making is that the stream processing system must be limited to non-blocking, in-memory computations like sketches. A common pattern for people using Samza is actually to accumulate a large window of data and then rank using a complex brute force algorithm that may take 5 mins or so to produce results.

One of the points I was hoping to make is that many of the limitations people think stream processing systems must have (e.g. can never block, can't process large windows of data, can't manage lots of state) have nothing to do with the stream processing model and are just weaknesses of the frameworks they have used.

kiyoto · on July 2, 2014

>A common pattern for people using Samza is actually to accumulate a large window of data and then rank using a complex brute force algorithm that may take 5 mins or so to produce results.

I think the argument eventually comes down to what people mean by "batch" and "stream". Some people might describe the aforementioned Samza use case as (micro-)batch as opposed to stream processing.

All in all, I found your post insightful. If a system with fewer moving parts can handle the same data processing requirements, that will be appealing to the majority of users.

boredandroid · on July 2, 2014

Yes, totally. In my definition the difference is that a stream processing system let's you define the frequency with which output is produced rather than forcing it to be "at the end of the data". This doesn't preclude blocking operations.

ot · on July 2, 2014

> training and running complex ML is not currently feasible on Streaming Frameworks we have today to use them for both realtime and batch.

Have you had a look at Samoa? It is a streaming machine learning library for Storm and S4.

http://yahoo.github.io/samoa/

hatred · on July 2, 2014

How different is it from mllib ( https://spark.apache.org/mllib/ )

I understand that mllib is strictly for use as a batch data library.

TallGuyShort · on July 2, 2014

I don't know much about mllib specifically, but I was expecting to come here and see more comments about Spark - as it does both batch and stream processing relatively well, which allows you to reuse a lot of code between the two pipelines. It seems the primary original motivation was to "beat the CAP theorem" by using different distributed systems that had different characteristics, so this would defeat the point, but like the author I don't think "beating the CAP theorem" this way is going to produce results that warrant the work.