Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I have never used Samza but have build similar pipelines using Kafka,Storm,Hadoop etc. In my experience you almost always have to do your transformation logic twice one for batch and one for real time and with that setup Jay's setup look exactly like Lamda Architecture with your stream processing framework doing real time and batch computation.

Using stream processing framework like Storm maybe fine when you are running exactly the same code for both real time and batch but it breakdown in more complex cases when code is not exactly the same. Let say we need to calculate Top K trending item from now to last 30 mins, One day and One week. We also know that simple count will always make socks and underwear trend for an ecom shops and Justin Bieber and Lady Gaga for twitter(http://goo.gl/1SColQ). So we use count min sketch for realtime and a sligtly more complex ML algorithm for batch using Hadoop and merge the result in the end. IMO, training and running complex ML is not currently feasible on Streaming Frameworks we have today to use them for both realtime and batch.

edited for typos.



I am the author of the post.

I think in cases where you are running totally different computations in different systems the Lambda architecture may make a lot of sense.

However one assumption you may be making is that the stream processing system must be limited to non-blocking, in-memory computations like sketches. A common pattern for people using Samza is actually to accumulate a large window of data and then rank using a complex brute force algorithm that may take 5 mins or so to produce results.

One of the points I was hoping to make is that many of the limitations people think stream processing systems must have (e.g. can never block, can't process large windows of data, can't manage lots of state) have nothing to do with the stream processing model and are just weaknesses of the frameworks they have used.


>A common pattern for people using Samza is actually to accumulate a large window of data and then rank using a complex brute force algorithm that may take 5 mins or so to produce results.

I think the argument eventually comes down to what people mean by "batch" and "stream". Some people might describe the aforementioned Samza use case as (micro-)batch as opposed to stream processing.

All in all, I found your post insightful. If a system with fewer moving parts can handle the same data processing requirements, that will be appealing to the majority of users.


Yes, totally. In my definition the difference is that a stream processing system let's you define the frequency with which output is produced rather than forcing it to be "at the end of the data". This doesn't preclude blocking operations.


> training and running complex ML is not currently feasible on Streaming Frameworks we have today to use them for both realtime and batch.

Have you had a look at Samoa? It is a streaming machine learning library for Storm and S4.

http://yahoo.github.io/samoa/


How different is it from mllib ( https://spark.apache.org/mllib/ )

I understand that mllib is strictly for use as a batch data library.


I don't know much about mllib specifically, but I was expecting to come here and see more comments about Spark - as it does both batch and stream processing relatively well, which allows you to reuse a lot of code between the two pipelines. It seems the primary original motivation was to "beat the CAP theorem" by using different distributed systems that had different characteristics, so this would defeat the point, but like the author I don't think "beating the CAP theorem" this way is going to produce results that warrant the work.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: