Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Extremely well designed framework. It can cover more than 90% of use cases in my opinion. I'm currently working on a project written in Scala that requires a lot of scraping, and I feel really guilty that I'm not using Scrapy :(


There is an idea to make Scrapy support Spiders written in other languages[1]. It has been featured on the Google Summer of Code 2015[2].

[1] https://github.com/scrapy/scrapy/issues/1125

[2] http://gsoc2015.scrapinghub.com/ideas/#other-languages


I think you can still combine the two. For example Scrapy can be behind service/server to which you'd send request (with same args as if you were running it as a script + callback url) and after items get collected Scrapy can call your callback url sending all items in json format to your Scala app. Or if you want to avoid memory issues for sure, you can send each item to Scala app as it gets collected. Basically, idea is to wrap Scrapy spiders with web service features - then you can use it in combination with any other technology. Or you can use Scrapy Cloud to run your spiders at http://scrapinghub.com/.


There is ScrapyRT: http://blog.scrapinghub.com/2015/01/22/introducing-scrapyrt-...

In the project I work on we do have the usual periodic crawls and use ScrapyRT to let the frontend trigger realtime scrapes of specific items, all of this using the same spider code.

Edit: Worth nothing that we trigger the realtime scrapes via AMQP.


> Or if you want to avoid memory issues for sure, you can send each item to Scala app as it gets collected.

I've done something similar, you can just add a Pipeline (Pipeline Task? not sure of the right terminology) which posts off the data somewhere else. You can also then store the full item logs in S3 so you can go back occasionally and check nothing has been missed. Works an absolute treat.


I might have a look into that option actually. I'm planning on building a pseudo-framework on top of Akka, so it might make sense to simply communicate with a Scrapy app and handle the results.


Being that you're already on a JVM, you might like my project https://github.com/machinepublishers/jbrowserdriver




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: