Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Self-host analytics for better privacy and accuracy (filippo.io)
193 points by FiloSottile on May 9, 2016 | hide | past | favorite | 89 comments


Piwik is a great project, but it tends not to work well for handling sites with millions of events per day. Your MySQL table starts to bust at the seams pretty quickly.

For big sites, you'll want that event data in GiB's of plain raw logs that you can bulk load into tools like BigQuery or Redshift for analysis.

My team has built/delivered a SaaS web content analytics platform for the past few years called Parse.ly. We instrument pageview events (like Google Analytics) automatically, and we also instrument time-on-page using heartbeat events. We collect 50 billion monthly page events for over 600 top-traffic sites, and display it all in a real-time dashboard.

To emphasize that our customers own the data they send us, we recently launched a Raw Data Pipeline product:

http://parse.ly/data-pipeline

Basically, we host a secure S3 bucket and Kinesis stream for customers, and deliver their raw (enriched) event data there. From there, they typically load it into their own BigQuery or Redshift instance, or they analyze pieces of it directly with Python/R/Excel/etc.

Our customers tell us this strikes the right balance among data ownership, query flexibility, and hassle-free infrastructure.


I've read parse.ly tech blog a few times, it was a great so I immediately recognized your company. As someone who's working in the publishing as well and gradually moving to a more data driven approach - thanks! :)


Is their any difference in data ownership between GA and parsely?


Data ownership in GA is a "gray area" that becomes less gray if you pay $150K/yr for "GA Premium".

Google has mixed incentives in running its free analytics service. It gets web-wide analytics data, it uses data to help it sell more AdWords to customers, and it integrates GA with other services, like their display advertising products (DFP, etc.)

From a practical standpoint, you don't "own" analytics data when a) you can't easily access it in raw form and b) the SaaS provider "leaks" your data to dilute its value to you. We address (a) and (b) directly through our products and public data privacy stance. See this blog post for our public view on analytics data privacy:

http://blog.parsely.com/post/3394/analytics-privacy-without-...


While I find your stance on privacy very refreshing for an analytics company, hiding your pricing info behind a sales rep is a huge turnoff for me. If you feel your pricing is reasonable for the service that you provide, I really don't see why you can't just display it proudly on your site.

If SpaceX can afford to not hide their pricing behind a sales rep, so can you: http://www.spacex.com/about/capabilities


Whether to display pricing on the website is something we debated in the past, and continue to debate. (Your comment may wake up the debate for me.)

Pricing for analytics services (in the marketplace) is all over the map. Google picked $150K/year as the price for GA Premium because that's the low end of an Adobe Analytics contract, who is the market leader. We're typically cheaper than existing Adobe/GA contracts. Non-competitive "event analytics" companies like MixPanel and Heap have variable per-event pricing that would break the bank for the customers we serve. We have a bit of an aversion to per-event pricing because it feels like "punishing customers for success".

Meanwhile, per seat pricing, though attractive on the surface and popular in the SaaS segment, has several concerns in our space. First, we want customers to feel free to hand out access to our platform: part of our value proposition is democratizing access to analytics data. So we don't want "stingy seat quotas" typical with tools like Salesforce. Second, for an analytics tool, seats are a bit easier to "hack" for a pricing model -- though our dashboards can be customized per user, a single shared account can access all the data. Meanwhile, our costs don't scale with seats, but with site traffic/users instead.

For these reasons, and more, we've settled on "tiered pricing". Roughly speaking, our service is offered in three tiers. Each tier supports a larger class of site (more monthly uniques), which also bestows more features (e.g. more data retention in higher tiers). To work within the budget constraints of some companies, we will discount tiers while removing cost-affecting features, e.g. maybe you are in the highest class of site, but we disable API access and limit data retention. Because this is a tad more complex than a pricing page could express easily, and also because we think the value of the product comes through best in a guided demo, we made the decision to hide pricing and instead responsively provide demos on-demand.

So, the tl;dr is, pricing, and the display of it, is definitely something we think about, and we have (IMO valid) reasons for not displaying pricing right now, but you make a fair point: if Musk can price his rockets publicly, maybe we can figure something out, too :)


I avoid all services with prices behind sales rep when possible. I always feel I'm not an astute negotiator, the sales rep will be ergo he will see me as a mug and take me for all he can get. If I do buy, even if I'm happy I'm always left wondering if I'm paying 2x as much as other customers.


Im sure parsle.ys analytics system told them that their potentical customers were getting sticker shock; hence the sales guys need to explain the value propesition to them properly before releaving the figure ;)

Sometimes i feel Google should just blanket replace their disclosure statements across the board with this classic video; https://www.youtube.com/watch?v=8fvTxv46ano. Less beating around the bush.


Liked the comparison with SpaceX. Never heard this argument before :-)


Another very simple alternative is goaccess [0] which is purely server-side. It gathers quite a lot of information by parsing the server log. It doesn't do all the JavaScript stuff to track every single click, but it gives you stats on how many users visit your site, which parts, when, which sites or which domains they're coming from and also how much bandwidth they use. It also shows which status codes are coming from which paths and a lot more. It supports various output formats such as HTML or an interactive htop-like terminal application. It's also being actively developed. I use it and find it very useful.

0: https://www.goaccess.io/


GoAccess is really one of the greatest tools of these kind. I use it with GA on all my servers.


Piwik has the problem that it writes directly to MySQL as the activity happens. If your database is down, you lose data. If you have a spike of traffic above what your DB can handle in writes, you lose data.

Snowplow doesn't have this problem.


You can query your writes in Redis, so it won't be lost if your database goes down.

https://piwik.org/faq/how-to/faq_19738/


Note to anybody reading this but not following the link: parent poster means "queue your writes in Redis" and not "query."


I'm sorry, I am not a native speaker and it was already late.


Cos your Redis will never go down? Or need to be upgraded? Patched?


How does snowplow solve this problem? By writing to disk?


Snowplow is implemented as a unidirectional data pipeline:

    tracking -> collection -> enrichment -> storage
Between each step there is typically some kind of persistent queue (mostly S3/Kinesis), and data won't be lost if a downstream component is not operational. Examples:

* If your event collector is unavailable, raw events will be cached in the tracker in localStorage, SQLite or similar

* If your Redshift database is read-only for maintenance, enriched events will be held back until Redshift is writeable again


There is of course the opportunity for a collector to be down, but the aim is to keep those components super simple and rely on really durable storage (Kinesis, S3) managed by someone else.


They also have a problem that their password hashing method is still md5[0].

[0] - https://developer.piwik.org/api-reference/Piwik/Auth


Piwik started in 2007. It was inexcusable to use md5 then. The problem is that it was ever md5, not that it is still md5.

That really doesn't inspire confidence that they know what they're doing. md5 is a canary in the coal mine for me, and I definitely won't use Piwik now.


To be fair, they aren't an encryption service or something. They INSERT INTO visitor_count. So far as "inspiring confidence" needs to go for such a thing, I wouldn't write them off just yet.


It looks like it's scheduled to be fixed in Piwik 3.0.0: https://github.com/piwik/piwik/issues/5728


I love piwik but this is no excuse, sorry :) Privacy also means data protection!


"Self-hosting analytics for better privacy and accuracy."

Then why using google fonts which is listed by Disconnect.me (used by Firefox) as a tracking domain? Isn't that paradoxical?


Oh, hey, good point. Let me fix that.

EDIT: Done. Not exactly straightforward to download Google Fonts but there are great helpers around. Got rid of CDNjs as well, since CloudFlare has HTTP/2 now. No 3rd parties left except GA, which will go in a day or two.


If you just want to have nicer webfonts without the 'phoning home to Google' issue try Brick webfonts at http://brick.im/.

If the goal is to get rid of all third-party dependencies you can still use the Brick repository on GitHub to download the better looking fonts and self-serve them, just as you are now with the Google served ones.

The only 'gotcha' I found with using Brick is that NoScript users won't see the fonts and won't see the Brick URL in the menu for whitelisting. They will need to inspect the page and learn that they must manually add brick.im. Not exactly user-friendly but then again NoScript breaks lots of things and users are used to it.


> Not exactly straightforward to download Google Fonts

Hah, wonder why that is.


It's pretty easy to download them. They're on Github

https://github.com/google/fonts


I don't think* embedding Google Fonts allows Google much (any?) data collection beyond, presumably, that the font was requested, though it's definitely an unnecessary dependency.

Given Google Fonts allows you to download all of the fonts, with license info provided, in a variety of formats, there's almost no good reason not to embed them directly in your own site.

Right now, Sandstorm apps will currently work with Google Fonts, but the Sandstorm team intends to sandbox the client side better in the near-ish future, so I've already been nudging Sandstorm apps to make sure not to use such things when I see it.

Note that while the author's blog uses Google Fonts, the Piwik Sandstorm package does not. (I checked.)

*I don't know.


Google Fonts does see the page you are on, in the `Referer` header. According to mitmproxy:

    host:              fonts.googleapis.com
    Connection:        keep-alive
    Proxy-Connection:  keep-alive
    Accept:            text/css,*/*;q=0.1
    User-Agent:        Mozilla/5.0 (iPhone; CPU iPhone OS 9_3_1 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) Version/9.0 Mobile/13E238 Safari/601.1
    Accept-Language:   en-us
    Referer:           https://blog.filippo.io/self-host-analytics/
    Accept-Encoding:   gzip, deflate
(I agree with the rest of your comment; the best thing to do is host the fonts yourself.)


And it does appear Google just blankets this into it's general API terms of use that they can "use submitted data" in accordance with their general privacy policies. So yeah, I guess they can use it as part of their tracking. :/


I share the concern about privacy, but there is a benefit to using Google Fonts: maybe it increases the chance of the font being already cached on the client?


Small side story: I had a 'fun' night where I 'just' wanted to make a self-hosted wordpress also perfect regarding privacy. This was not easy! First I had to install a plugin which does not use google fonts for a certain theme, then I need to disable the default avatar service using another plugin and customization, next I needed to convert youtube and twitter embeds into images with links and so on. It took me several hours.

At that time I thought I was the only one with this requirements ... also a bit strange for an open source project to have such defaults IMO.


He probably didn't know that given it's not the most intuitive thing to notice.


Google Analytics has Measurement Protocol which allows for posting data server-side.

All you need to do is proxy the data collection and then send it to them, taking advantage of all the scalability and features they have.

https://developers.google.com/analytics/devguides/collection...


The biggest problem with Piwik is that is not scalable and the cost ($) with servers to store analytics data. I saw many cases that Piwik not supported the big volume of data.


Piwik is scalable up to at least 1 billion actions per month. Piwik is scalable! But it is not cheap, as one needs powerful database servers with a lot of RAM and fast SSD disks. It can be costly, but Piwik scales!


Why isn't it scalable?


MySQL


It can scale pretty far before it becomes an issue. Years ago I ran it on our main DB for sites that got over 1 million visits a month with no noticeable overhead. If I had it on its own server with its own DB it could have handled far more traffic.


I see. What is the upper bound of "big volume of data" possible under MySQL + Piwik?


That's a difficult thing to answer. However the more important problem is loss of data whenever your DB isn't available due to downtime, upgrade etc. It depends how important data loss is for your user case. I'm a data completist but I'm in therapy for it ;)

It's definitely worth playing with, and trivially easy to spin up. Other self-hosted options aren't anything like as simple to get up and running.


Well, but the point was MySQL + Piwik "does not scale" and that it's "expensive" besides, which doesn't comport with my experience and sounds like received wisdom.


I think if your database is down you've typically got bigger problems than your analytics.


I've worked on Piwik servers tracking & processing reports for up to 800 million pageview per month on hundreds of medium and larger websites.


Put Rabbit or Kafka in front of MySQL then?


They have a Redis-based solution to sit in front of mySQL (seems to be a "Pro" feature though):

https://piwik.org/faq/how-to/faq_19738/


It's not a Pro feature as in you have to pay for it. It's developed by the devs at piwik.pro, but you can download it for free.


There are many, many self-hosting analytics tools, from your own big data pipes to tools like the aforementioned piwik, as well as Open Web Analytics. I like Snowplow (http://snowplowanalytics.com/), but it's currently hard-coded to AWS.

Hosting your own analytics data can be great, but there are lots of ways to get better accuracy and control over your data without having to host everything. Still, if you can, it's great.


Also one thing to add here is that the client libraries are a lot of the work and you can use the snowplow js/ios/python etc.. no matter what server-side setup you use. I like to think of snowplow as pushing the open-source analytics standard and then hopefully an ecosystem of server-side products grows around that, led by their own product.

We're doing something as dumb as using logs from cloud storage and parsing those with a ~100 line python script into a DB. S3/GCS deal with the collector uptime and as long as you aren't time-sensitive it is a great solution.

The biggest issue with self-hosted analytics is visualizing/sharing the results with non-tech parts of the team. Piwik has some advantages here because it's closer to Google Analytics or Mixpanel than to a DB of event rows...


Snowplow is only loosely hard coded to AWS. I'm using it and breaking it free is only a few hundred lines of code.

For example, rerouting Snowplow's Kinesis collector into Kafka is 114 lines, and that includes logging, metrics, etc - I basically just had to extend the AbstractSink object in their scala collector. Reading from Kafka is another couple of hundred lines, similarly writing to files.


Thanks for sharing yummyfajitas - expect official Kafka support for Snowplow a little later this year [1] [2]; it's been long awaited! (Snowplow co-founder)

[1] https://github.com/snowplow/snowplow/milestones/Kafka%20%231 [2] https://github.com/snowplow/snowplow/milestones/Kafka%20%232


Nice. If you build some sort of native maxmind or other geotargeting into the scala collector, that would also be cool.

(Not that it was difficult to roll my own - so far snowplow is perfect for my needs - but obviously I'd rather use an official one.)


We do all enrichments like MaxMind, weather, arbitrary JavaScript etc downstream of collection, in our enrichment phase - the list of configurable enrichments is here: https://github.com/snowplow/snowplow/wiki/Configurable-enric...


I hope you're going to push your ~200 back to the project! Would love to see it free of the AWS dependency.


One potential problem: if you, say, test out self-hosted alongside GA, and get different numbers, people are going to question the value of the self-hosted thing.


They should question both, but maybe GA more. GA is an approximation and absolutely not completely correct.


Try selling that to a business guy or client or someone. Possible, but not easy, especially if there's ever a large discrepancy.


Haha oh I know. Tell a client, "Yeah this is the data, it says this, BUT this is also how its wrong" and they always ignore the last part. Its data, that stuff can't be wrong!


I've done my own analytics since before there were hosted services.

Still use an old-school, proprietary tool called Sawmill[1]. One very nice aspect (out of many) about it is that it handles hundreds of other log formats out of the box, so it can report sensibly on switches, email, firewall logs... just about anything with very little effort.

No connection, other than being a long-time happy customer.

[1] http://sawmill.net/


I haven't used Sawmill in ages, but it wasn't bad, once upon a time. There's also GoAccess, which is a snazzy ncurses logfile analyzer:

https://www.goaccess.io/


Piwik also handle dozens of Log Formats, check it out: http://piwik.org/log-analytics/

Github project: https://github.com/piwik/piwik-log-analytics


Shameless plug: We're also working on an on-premise custom analytics platform that can be deployed either Heroku or AWS with Cloudformation. https://github.com/rakam-io/rakam


To mitigate security loss (Piwik is complex), run Piwik and serve its gif on another machine.


Can you elaborate a bit more?


I'd assume the prior poster is suggesting that running a third party tool on your primary web server(s) would increase the surface area for attacks by some (possibly not small) amount. e.g. if piwik is compromised, an attacker would have some sort of user access on your web server(s), which is generally a bad thing. I suppose some people also use the same database user for all web applications, which would potentially be disastrous.

There are mitigations one can implement without going to that length (run piwik under a different user than any of your other web applications, using suexec, use a different database and user, etc.). At the least, putting Piwik in a container or VM makes sense, if any data on your web server(s) is critical or sensitive.

I suspect for very large deployments this would go without saying. But, for users with only one web server, it might seem reasonable to drop it into the same virtual host and run it all as the same user (and it's probably safe enough to do so for many users, as long as they stay on top of updates). But, any web application you run adds surface area for attackers. Might as well isolate them as well as your skills and resources allow.


See also some of the official Piwik security tips: https://piwik.org/docs/how-to-secure-piwik/


If you're serving the pages yourself (not, as this site does, through Cloudflare's caches), what does this tell you that isn't in the server logs?


Unless you are tracking events, outlinks, time spent on page or something similar, nothing as far as data goes. But obviously server logs lack statistics and aggregation. PIWIK actually supports using the server logs as its data source instead of the javascript tag.


Deduplicated reach (Unique Visitors). IP addresses change and are shared, so they aren't suitable for this.


For those of us that prefer not to be tracked by self-hosted Sandstorm apps, 2 new ublock rules (they're very rough, but I am no ublock wizard):

||sandcats.io/embed.js$script

view.gif?page=$image

Note that you would have to change the filters if the scripts and tracking pixel are renamed of course but this should catch a majority of the push button installs.


If you visit a web site the operators of that site will know that someone at your IP address visited the site, which pages you viewed, how long you lingered, which buttons you clicked, etc. If you want to prevent that your only option is to browse via things like proxies or Tor.


Why do you object to self-hosted analytics? I understand blocking centralized trackers (I do so myself), but self-hosted doesn't seem problematic in the same way GA being present on half the pages on the Internet is.

It also strikes me as an unwinnable battle for all but the largest sites.


Because OP is against all kind of tracking? And because he can...


I can't claim to speak for OP, but am also against most tracking. I would also tend to think that being against first party tracking would be an unwinnable battle. It also leaks less data than third-party tracking, since the third party can see your activity across multiple sites whereas first party can only see your activity on that site unless it's aggregated through a backend service (another poster mentioned the ability to upload server logs to GA). No matter what, they can see what you load from their site.

Getting to first party hosting of more intrusive analytics (scroll location, etc) I think rather than disallowing certain scripts/URLs to run, you have to get back to behavioral-based blocking. Doing that in an environment that you allow any JS to execute seems tough since sandboxing something that can update the page based on location can "talk" to another part that can report back to the server.

If you don't like intrusive first party analytics, just stop all JS.


Off topic a bit. But with GA being blocked, all these blocks work by blocking URLs and domains at the time they are loaded, but not requests made after load of the requested resource. Couldn't you just proxy the URL to GA so it's not blocked?


By "requests made after load of the requested resource", you mean JS XHR requests for example? I'd guess they are also filtered by adblockers, aren't they?


Not quite true though... Piwik, by default, includes EasyList, which has the following block rule:

/piwik.$domain=~piwik.org,script

This will block your self-hosted piwik.js file, unless you perform some redirection trickery.


It's very easy to rename (or symlink) the piwik.js script to something else that is not blocked by any filter list. And unless you're a major site nobody will bother creating a special rule just for your self-hosted Piwik.


The default Piwik Sandstorm install has "piwik" nowhere in the URL.


Yeah, unfortunately it shows up as https://ls4an735rucvfa6ps6bb.filippo.sandcats.io/embed.js - it's getting to the point where I need to start performing actual content inspection.


I run Piwik on my server log files. This is my right as a website owner / system admin. :)


How come?


I would love to have Sandstorm on my RasPi but they don't support ARM.

It sounds like a great home intranet.


Just notes on this for the curious:

1. Sandstorm doesn't support ARM currently because Sandstorm apps run native Linux binaries, and every app would have to be compiled for each architecture.

2. I honestly think you'd be running pretty crippled trying to do Sandstorm on a RasPi. It's a bit smaller scale than Sandstorm seems targeted for. Each open Sandstorm grain commonly uses 100 MB of RAM or more (on top of the RAM used by Linux and the Sandstorm server itself, of course), so with just a couple of Sandstorm grains running simultaneously, you can max out a RasPi pretty quickly.


>Sandstorm doesn't support ARM currently because Sandstorm apps run native Linux binaries, and every app would have to be compiled for each architecture.

That's true of any linux distro providing binary packages. They all support arm anyways, it is trivially simple to compile packages. Even small projects like openbsd compile tens of thousands of packages for a dozen arches.


Yes but distros accomplish that by being highly opinionated on the build process you use to build packages whereas Sandstorm tries to be unopinionated on this point.

Sandstorm will support ARM someday but it's going to require a large investment in tooling in order to be painless for developers.


>Yes but distros accomplish that by being highly opinionated on the build process you use to build packages whereas Sandstorm tries to be unopinionated on this point.

How's that exactly? I looked it over and can't see any difference at all.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: