Introducing the Priceonomics Analysis Engine and API

ecesena · on Dec 18, 2014

Some ideas/feature requests:

1. extract twitter/fb/ecc. contacts

2. (optionally) follow about/contact pages to improve contact information

3. improve Facebook likes if the website declares a FB page/app (as an example, check out theneeds.com, we have 14k likes, but priceonomics says 203)

rohin · on Dec 18, 2014

Thanks for the feedback! Right now the Social analyzer pulls how many people shared that one specific page on FB (or liked it or comment it), rather than how many followers a Page has or Users an FB App has. We could add that per your suggestion though.

We currently use it to see how popular various articles are on the web since FB likes is a good proxy for overall traffic. FB like are public whereas page views typically aren't.

ecesena · on Dec 18, 2014

I ment more about extracting that the page's twitter account is @..., and the Facebook page is fb.com/...

jobposter1234 · on Dec 18, 2014

Love what you guys are doing! In a similar space in terms of structuring the web (although I'm not doing general purpose tools like you are).

Can you comment on why I'd use your API instead of writing my own stuff? I'm very comfortable with crawlers and data extractors, don't mind running my own infrastructure. I've played with Kimono and Import.IO quite a bit, but they get in my way and actually slow down my work.

Also, do you have any advice on gathering info from social media? I've found some places are fairly liberal (facebook), some are locked down tightly. I'd love to know what the professionals do in terms of infrastructure and scraping policy too...

bwood · on Dec 19, 2014

The main reason we think there needs to be a platform for crawling and structuring data is that it's a huge hassle to reinvent the wheel every time. If you have experience writing crawlers and have been able to afford the time to learn to write them well, you've basically already reinvented the wheel. But now you also have to maintain the wheel. We think it would be better if people started working together on the problem of structuring the web. Besides running infrastructure (which has it's own challenges), there is really no reason why everyone needs to build their own extractor for every website. People generally want to extract the same things, and just a handful of quality implementations should be enough to satisfy almost everyone.

One of the keys to actually having an extractor repository is providing incentive for people to build and maintain the extractors, which is actually where we're headed once we allow developers to start building their own applications on the Analysis Engine.

If you're good at extracting structured data from HTML, you should continue writing your own extractors and even consider selling access to them. On the other hand, if you can't be bothered learning the art of extracting data, why not pay someone to use their commercial grade extractor?

So, to summarize, you should use our API so that you can better leverage your time and contribute to the creation of even better structuring tools.

We really like the stuff that Kimono and Import.IO are working on. They are significantly lowering the barriers to getting started at extracting/structuring data, which is great for everyone. Of course there are limitations to what their tools can do, but that's what you'll see every time someone attempts to simplify a complicated process. We aim to be the glue that connects people with data acquisition, structuring, and analysis tools.

Crawling social media is pretty tough because social platforms tend to be very reserved about access to their data (probably due to the enormous amount of personal data they have). It's a disturbing trend for sites to require an account before they even let you past the landing page. It's basically not publicly available data anymore. That's actually not something we have much experience doing, since we haven't had any customers ask for it yet.

mbesto · on Dec 19, 2014

How do you get around sites blocking the IP of your engine?

zeeshanm · on Dec 18, 2014

One thing you may want to do is parse html to get relevant data into a structured format so end-user doesn't have to write a post-parser.

bwood · on Dec 19, 2014

Thanks zeeshanm! We actually have a lot of parsers that do just that, we just haven't yet made them available on the engine. We're also looking at ways to let people create their own and make it available to other users, since we think that's the best way to get to a fully structured web. Are there any sites in particular that you think we should focus on first?

zeeshanm · on Dec 19, 2014

I was thinking more along in the lines of creating a dynamic parser. It's an interesting and a challenging problem and I have thought about it for some time. There may need to be some human intervention involved but by design it should be dynamic.