Extremely well designed framework. It can cover more than 90% of use cases in my...

stummjr · on Jan 20, 2016

There is an idea to make Scrapy support Spiders written in other languages[1]. It has been featured on the Google Summer of Code 2015[2].

[1] https://github.com/scrapy/scrapy/issues/1125

[2] http://gsoc2015.scrapinghub.com/ideas/#other-languages

horva · on Jan 20, 2016

I think you can still combine the two. For example Scrapy can be behind service/server to which you'd send request (with same args as if you were running it as a script + callback url) and after items get collected Scrapy can call your callback url sending all items in json format to your Scala app. Or if you want to avoid memory issues for sure, you can send each item to Scala app as it gets collected. Basically, idea is to wrap Scrapy spiders with web service features - then you can use it in combination with any other technology. Or you can use Scrapy Cloud to run your spiders at http://scrapinghub.com/.

darkrho · on Jan 20, 2016

There is ScrapyRT: http://blog.scrapinghub.com/2015/01/22/introducing-scrapyrt-...

In the project I work on we do have the usual periodic crawls and use ScrapyRT to let the frontend trigger realtime scrapes of specific items, all of this using the same spider code.

Edit: Worth nothing that we trigger the realtime scrapes via AMQP.

IanCal · on Jan 20, 2016

> Or if you want to avoid memory issues for sure, you can send each item to Scala app as it gets collected.

I've done something similar, you can just add a Pipeline (Pipeline Task? not sure of the right terminology) which posts off the data somewhere else. You can also then store the full item logs in S3 so you can go back occasionally and check nothing has been missed. Works an absolute treat.

Cyph0n · on Jan 20, 2016

I might have a look into that option actually. I'm planning on building a pseudo-framework on top of Akka, so it might make sense to simply communicate with a Scrapy app and handle the results.

logn · on Jan 20, 2016

Being that you're already on a JVM, you might like my project https://github.com/machinepublishers/jbrowserdriver