Self-host analytics for better privacy and accuracy

pixelmonkey · on May 9, 2016

Piwik is a great project, but it tends not to work well for handling sites with millions of events per day. Your MySQL table starts to bust at the seams pretty quickly.

For big sites, you'll want that event data in GiB's of plain raw logs that you can bulk load into tools like BigQuery or Redshift for analysis.

My team has built/delivered a SaaS web content analytics platform for the past few years called Parse.ly. We instrument pageview events (like Google Analytics) automatically, and we also instrument time-on-page using heartbeat events. We collect 50 billion monthly page events for over 600 top-traffic sites, and display it all in a real-time dashboard.

To emphasize that our customers own the data they send us, we recently launched a Raw Data Pipeline product:

http://parse.ly/data-pipeline

Basically, we host a secure S3 bucket and Kinesis stream for customers, and deliver their raw (enriched) event data there. From there, they typically load it into their own BigQuery or Redshift instance, or they analyze pieces of it directly with Python/R/Excel/etc.

Our customers tell us this strikes the right balance among data ownership, query flexibility, and hassle-free infrastructure.

gedrap · on May 10, 2016

I've read parse.ly tech blog a few times, it was a great so I immediately recognized your company. As someone who's working in the publishing as well and gradually moving to a more data driven approach - thanks! :)

educar · on May 10, 2016

Is their any difference in data ownership between GA and parsely?

pixelmonkey · on May 10, 2016

Data ownership in GA is a "gray area" that becomes less gray if you pay $150K/yr for "GA Premium".

Google has mixed incentives in running its free analytics service. It gets web-wide analytics data, it uses data to help it sell more AdWords to customers, and it integrates GA with other services, like their display advertising products (DFP, etc.)

From a practical standpoint, you don't "own" analytics data when a) you can't easily access it in raw form and b) the SaaS provider "leaks" your data to dilute its value to you. We address (a) and (b) directly through our products and public data privacy stance. See this blog post for our public view on analytics data privacy:

http://blog.parsely.com/post/3394/analytics-privacy-without-...

lewisl9029 · on May 10, 2016

While I find your stance on privacy very refreshing for an analytics company, hiding your pricing info behind a sales rep is a huge turnoff for me. If you feel your pricing is reasonable for the service that you provide, I really don't see why you can't just display it proudly on your site.

If SpaceX can afford to not hide their pricing behind a sales rep, so can you: http://www.spacex.com/about/capabilities

pixelmonkey · on May 10, 2016

Whether to display pricing on the website is something we debated in the past, and continue to debate. (Your comment may wake up the debate for me.)

Pricing for analytics services (in the marketplace) is all over the map. Google picked $150K/year as the price for GA Premium because that's the low end of an Adobe Analytics contract, who is the market leader. We're typically cheaper than existing Adobe/GA contracts. Non-competitive "event analytics" companies like MixPanel and Heap have variable per-event pricing that would break the bank for the customers we serve. We have a bit of an aversion to per-event pricing because it feels like "punishing customers for success".

Meanwhile, per seat pricing, though attractive on the surface and popular in the SaaS segment, has several concerns in our space. First, we want customers to feel free to hand out access to our platform: part of our value proposition is democratizing access to analytics data. So we don't want "stingy seat quotas" typical with tools like Salesforce. Second, for an analytics tool, seats are a bit easier to "hack" for a pricing model -- though our dashboards can be customized per user, a single shared account can access all the data. Meanwhile, our costs don't scale with seats, but with site traffic/users instead.

For these reasons, and more, we've settled on "tiered pricing". Roughly speaking, our service is offered in three tiers. Each tier supports a larger class of site (more monthly uniques), which also bestows more features (e.g. more data retention in higher tiers). To work within the budget constraints of some companies, we will discount tiers while removing cost-affecting features, e.g. maybe you are in the highest class of site, but we disable API access and limit data retention. Because this is a tad more complex than a pricing page could express easily, and also because we think the value of the product comes through best in a guided demo, we made the decision to hide pricing and instead responsively provide demos on-demand.

So, the tl;dr is, pricing, and the display of it, is definitely something we think about, and we have (IMO valid) reasons for not displaying pricing right now, but you make a fair point: if Musk can price his rockets publicly, maybe we can figure something out, too :)

Scirra_Tom · on May 10, 2016

I avoid all services with prices behind sales rep when possible. I always feel I'm not an astute negotiator, the sales rep will be ergo he will see me as a mug and take me for all he can get. If I do buy, even if I'm happy I'm always left wondering if I'm paying 2x as much as other customers.

weq · on May 10, 2016

Im sure parsle.ys analytics system told them that their potentical customers were getting sticker shock; hence the sales guys need to explain the value propesition to them properly before releaving the figure ;)

Sometimes i feel Google should just blanket replace their disclosure statements across the board with this classic video; https://www.youtube.com/watch?v=8fvTxv46ano. Less beating around the bush.

ngrilly · on May 10, 2016

Liked the comparison with SpaceX. Never heard this argument before :-)

_jomo · on May 9, 2016

Another very simple alternative is goaccess [0] which is purely server-side. It gathers quite a lot of information by parsing the server log. It doesn't do all the JavaScript stuff to track every single click, but it gives you stats on how many users visit your site, which parts, when, which sites or which domains they're coming from and also how much bandwidth they use. It also shows which status codes are coming from which paths and a lot more. It supports various output formats such as HTML or an interactive htop-like terminal application. It's also being actively developed. I use it and find it very useful.

0: https://www.goaccess.io/

herbst · on May 10, 2016

GoAccess is really one of the greatest tools of these kind. I use it with GA on all my servers.

shermozle · on May 9, 2016

Piwik has the problem that it writes directly to MySQL as the activity happens. If your database is down, you lose data. If you have a spike of traffic above what your DB can handle in writes, you lose data.

Snowplow doesn't have this problem.

moehm · on May 9, 2016

You can query your writes in Redis, so it won't be lost if your database goes down.

https://piwik.org/faq/how-to/faq_19738/

JohnBooty · on May 10, 2016

Note to anybody reading this but not following the link: parent poster means "queue your writes in Redis" and not "query."

moehm · on May 10, 2016

I'm sorry, I am not a native speaker and it was already late.

shermozle · on May 9, 2016

Cos your Redis will never go down? Or need to be upgraded? Patched?

educar · on May 9, 2016

How does snowplow solve this problem? By writing to disk?

alexdean · on May 9, 2016

Snowplow is implemented as a unidirectional data pipeline:

    tracking -> collection -> enrichment -> storage

Between each step there is typically some kind of persistent queue (mostly S3/Kinesis), and data won't be lost if a downstream component is not operational. Examples:

* If your event collector is unavailable, raw events will be cached in the tracker in localStorage, SQLite or similar

* If your Redshift database is read-only for maintenance, enriched events will be held back until Redshift is writeable again

shermozle · on May 10, 2016

There is of course the opportunity for a collector to be down, but the aim is to keep those components super simple and rely on really durable storage (Kinesis, S3) managed by someone else.

dsp1234 · on May 9, 2016

They also have a problem that their password hashing method is still md5[0].

[0] - https://developer.piwik.org/api-reference/Piwik/Auth

smt88 · on May 10, 2016

Piwik started in 2007. It was inexcusable to use md5 then. The problem is that it was ever md5, not that it is still md5.

That really doesn't inspire confidence that they know what they're doing. md5 is a canary in the coal mine for me, and I definitely won't use Piwik now.

lucb1e · on May 10, 2016

To be fair, they aren't an encryption service or something. They INSERT INTO visitor_count. So far as "inspiring confidence" needs to go for such a thing, I wouldn't write them off just yet.

mattab · on May 10, 2016

It looks like it's scheduled to be fixed in Piwik 3.0.0: https://github.com/piwik/piwik/issues/5728

karussell · on May 10, 2016

I love piwik but this is no excuse, sorry :) Privacy also means data protection!

Ileca · on May 9, 2016

"Self-hosting analytics for better privacy and accuracy."

Then why using google fonts which is listed by Disconnect.me (used by Firefox) as a tracking domain? Isn't that paradoxical?

FiloSottile · on May 9, 2016

Oh, hey, good point. Let me fix that.

EDIT: Done. Not exactly straightforward to download Google Fonts but there are great helpers around. Got rid of CDNjs as well, since CloudFlare has HTTP/2 now. No 3rd parties left except GA, which will go in a day or two.

tombrossman · on May 10, 2016

If you just want to have nicer webfonts without the 'phoning home to Google' issue try Brick webfonts at http://brick.im/.

If the goal is to get rid of all third-party dependencies you can still use the Brick repository on GitHub to download the better looking fonts and self-serve them, just as you are now with the Google served ones.

The only 'gotcha' I found with using Brick is that NoScript users won't see the fonts and won't see the Brick URL in the menu for whitelisting. They will need to inspect the page and learn that they must manually add brick.im. Not exactly user-friendly but then again NoScript breaks lots of things and users are used to it.

lucb1e · on May 10, 2016

> Not exactly straightforward to download Google Fonts

Hah, wonder why that is.

corobo · on May 11, 2016

It's pretty easy to download them. They're on Github

https://github.com/google/fonts

ocdtrekkie · on May 9, 2016

I don't think* embedding Google Fonts allows Google much (any?) data collection beyond, presumably, that the font was requested, though it's definitely an unnecessary dependency.

Given Google Fonts allows you to download all of the fonts, with license info provided, in a variety of formats, there's almost no good reason not to embed them directly in your own site.

Right now, Sandstorm apps will currently work with Google Fonts, but the Sandstorm team intends to sandbox the client side better in the near-ish future, so I've already been nudging Sandstorm apps to make sure not to use such things when I see it.

Note that while the author's blog uses Google Fonts, the Piwik Sandstorm package does not. (I checked.)

*I don't know.

0942v8653 · on May 9, 2016

Google Fonts does see the page you are on, in the `Referer` header. According to mitmproxy:

    host:              fonts.googleapis.com
    Connection:        keep-alive
    Proxy-Connection:  keep-alive
    Accept:            text/css,*/*;q=0.1
    User-Agent:        Mozilla/5.0 (iPhone; CPU iPhone OS 9_3_1 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) Version/9.0 Mobile/13E238 Safari/601.1
    Accept-Language:   en-us
    Referer:           https://blog.filippo.io/self-host-analytics/
    Accept-Encoding:   gzip, deflate

(I agree with the rest of your comment; the best thing to do is host the fonts yourself.)

ocdtrekkie · on May 10, 2016

And it does appear Google just blankets this into it's general API terms of use that they can "use submitted data" in accordance with their general privacy policies. So yeah, I guess they can use it as part of their tracking. :/

ngrilly · on May 10, 2016

I share the concern about privacy, but there is a benefit to using Google Fonts: maybe it increases the chance of the font being already cached on the client?

karussell · on May 10, 2016

Small side story: I had a 'fun' night where I 'just' wanted to make a self-hosted wordpress also perfect regarding privacy. This was not easy! First I had to install a plugin which does not use google fonts for a certain theme, then I need to disable the default avatar service using another plugin and customization, next I needed to convert youtube and twitter embeds into images with links and so on. It took me several hours.

At that time I thought I was the only one with this requirements ... also a bit strange for an open source project to have such defaults IMO.

curiousgal · on May 9, 2016

He probably didn't know that given it's not the most intuitive thing to notice.

manigandham · on May 10, 2016

Google Analytics has Measurement Protocol which allows for posting data server-side.

All you need to do is proxy the data collection and then send it to them, taking advantage of all the scalability and features they have.

https://developers.google.com/analytics/devguides/collection...

viana007 · on May 9, 2016

The biggest problem with Piwik is that is not scalable and the cost ($) with servers to store analytics data. I saw many cases that Piwik not supported the big volume of data.

mattab · on May 10, 2016

Piwik is scalable up to at least 1 billion actions per month. Piwik is scalable! But it is not cheap, as one needs powerful database servers with a lot of RAM and fast SSD disks. It can be costly, but Piwik scales!

rhizome · on May 9, 2016

Why isn't it scalable?

shermozle · on May 9, 2016

MySQL

driverdan · on May 10, 2016

It can scale pretty far before it becomes an issue. Years ago I ran it on our main DB for sites that got over 1 million visits a month with no noticeable overhead. If I had it on its own server with its own DB it could have handled far more traffic.

rhizome · on May 9, 2016

I see. What is the upper bound of "big volume of data" possible under MySQL + Piwik?

shermozle · on May 10, 2016

That's a difficult thing to answer. However the more important problem is loss of data whenever your DB isn't available due to downtime, upgrade etc. It depends how important data loss is for your user case. I'm a data completist but I'm in therapy for it ;)

It's definitely worth playing with, and trivially easy to spin up. Other self-hosted options aren't anything like as simple to get up and running.

rhizome · on May 10, 2016

Well, but the point was MySQL + Piwik "does not scale" and that it's "expensive" besides, which doesn't comport with my experience and sounds like received wisdom.

lucb1e · on May 10, 2016

I think if your database is down you've typically got bigger problems than your analytics.

mattab · on May 10, 2016

I've worked on Piwik servers tracking & processing reports for up to 800 million pageview per month on hundreds of medium and larger websites.

toomuchtodo · on May 9, 2016

Put Rabbit or Kafka in front of MySQL then?

nacs · on May 9, 2016

They have a Redis-based solution to sit in front of mySQL (seems to be a "Pro" feature though):

https://piwik.org/faq/how-to/faq_19738/

moehm · on May 9, 2016

It's not a Pro feature as in you have to pay for it. It's developed by the devs at piwik.pro, but you can download it for free.

mwexler · on May 9, 2016

There are many, many self-hosting analytics tools, from your own big data pipes to tools like the aforementioned piwik, as well as Open Web Analytics. I like Snowplow (http://snowplowanalytics.com/), but it's currently hard-coded to AWS.

Hosting your own analytics data can be great, but there are lots of ways to get better accuracy and control over your data without having to host everything. Still, if you can, it's great.

zoomzoom · on May 9, 2016

Also one thing to add here is that the client libraries are a lot of the work and you can use the snowplow js/ios/python etc.. no matter what server-side setup you use. I like to think of snowplow as pushing the open-source analytics standard and then hopefully an ecosystem of server-side products grows around that, led by their own product.

We're doing something as dumb as using logs from cloud storage and parsing those with a ~100 line python script into a DB. S3/GCS deal with the collector uptime and as long as you aren't time-sensitive it is a great solution.

The biggest issue with self-hosted analytics is visualizing/sharing the results with non-tech parts of the team. Piwik has some advantages here because it's closer to Google Analytics or Mixpanel than to a DB of event rows...

yummyfajitas · on May 9, 2016

Snowplow is only loosely hard coded to AWS. I'm using it and breaking it free is only a few hundred lines of code.

For example, rerouting Snowplow's Kinesis collector into Kafka is 114 lines, and that includes logging, metrics, etc - I basically just had to extend the AbstractSink object in their scala collector. Reading from Kafka is another couple of hundred lines, similarly writing to files.

alexdean · on May 9, 2016

Thanks for sharing yummyfajitas - expect official Kafka support for Snowplow a little later this year [1] [2]; it's been long awaited! (Snowplow co-founder)

[1] https://github.com/snowplow/snowplow/milestones/Kafka%20%231 [2] https://github.com/snowplow/snowplow/milestones/Kafka%20%232

yummyfajitas · on May 9, 2016

Nice. If you build some sort of native maxmind or other geotargeting into the scala collector, that would also be cool.

(Not that it was difficult to roll my own - so far snowplow is perfect for my needs - but obviously I'd rather use an official one.)

alexdean · on May 9, 2016

We do all enrichments like MaxMind, weather, arbitrary JavaScript etc downstream of collection, in our enrichment phase - the list of configurable enrichments is here: https://github.com/snowplow/snowplow/wiki/Configurable-enric...

shermozle · on May 9, 2016

I hope you're going to push your ~200 back to the project! Would love to see it free of the AWS dependency.

davidw · on May 9, 2016

One potential problem: if you, say, test out self-hosted alongside GA, and get different numbers, people are going to question the value of the self-hosted thing.

soared · on May 10, 2016

They should question both, but maybe GA more. GA is an approximation and absolutely not completely correct.

davidw · on May 10, 2016

Try selling that to a business guy or client or someone. Possible, but not easy, especially if there's ever a large discrepancy.

soared · on May 12, 2016

Haha oh I know. Tell a client, "Yeah this is the data, it says this, BUT this is also how its wrong" and they always ignore the last part. Its data, that stuff can't be wrong!

__jal · on May 9, 2016

I've done my own analytics since before there were hosted services.

Still use an old-school, proprietary tool called Sawmill[1]. One very nice aspect (out of many) about it is that it handles hundreds of other log formats out of the box, so it can report sensibly on switches, email, firewall logs... just about anything with very little effort.

No connection, other than being a long-time happy customer.

[1] http://sawmill.net/

blacksmith_tb · on May 9, 2016

I haven't used Sawmill in ages, but it wasn't bad, once upon a time. There's also GoAccess, which is a snazzy ncurses logfile analyzer:

https://www.goaccess.io/

mattab · on May 10, 2016

Piwik also handle dozens of Log Formats, check it out: http://piwik.org/log-analytics/

Github project: https://github.com/piwik/piwik-log-analytics

buremba · on May 9, 2016

Shameless plug: We're also working on an on-premise custom analytics platform that can be deployed either Heroku or AWS with Cloudformation. https://github.com/rakam-io/rakam

d0ugie · on May 9, 2016

To mitigate security loss (Piwik is complex), run Piwik and serve its gif on another machine.

ljoshua · on May 9, 2016

Can you elaborate a bit more?

SwellJoe · on May 9, 2016

I'd assume the prior poster is suggesting that running a third party tool on your primary web server(s) would increase the surface area for attacks by some (possibly not small) amount. e.g. if piwik is compromised, an attacker would have some sort of user access on your web server(s), which is generally a bad thing. I suppose some people also use the same database user for all web applications, which would potentially be disastrous.

There are mitigations one can implement without going to that length (run piwik under a different user than any of your other web applications, using suexec, use a different database and user, etc.). At the least, putting Piwik in a container or VM makes sense, if any data on your web server(s) is critical or sensitive.

I suspect for very large deployments this would go without saying. But, for users with only one web server, it might seem reasonable to drop it into the same virtual host and run it all as the same user (and it's probably safe enough to do so for many users, as long as they stay on top of updates). But, any web application you run adds surface area for attackers. Might as well isolate them as well as your skills and resources allow.

mattab · on May 10, 2016

See also some of the official Piwik security tips: https://piwik.org/docs/how-to-secure-piwik/

Animats · on May 9, 2016

If you're serving the pages yourself (not, as this site does, through Cloudflare's caches), what does this tell you that isn't in the server logs?

FiloSottile · on May 9, 2016

Unless you are tracking events, outlinks, time spent on page or something similar, nothing as far as data goes. But obviously server logs lack statistics and aggregation. PIWIK actually supports using the server logs as its data source instead of the javascript tag.

shermozle · on May 9, 2016

Deduplicated reach (Unique Visitors). IP addresses change and are shared, so they aren't suitable for this.

tirus · on May 9, 2016

For those of us that prefer not to be tracked by self-hosted Sandstorm apps, 2 new ublock rules (they're very rough, but I am no ublock wizard):

||sandcats.io/embed.js$script

view.gif?page=$image

Note that you would have to change the filters if the scripts and tracking pixel are renamed of course but this should catch a majority of the push button installs.

guelo · on May 9, 2016

If you visit a web site the operators of that site will know that someone at your IP address visited the site, which pages you viewed, how long you lingered, which buttons you clicked, etc. If you want to prevent that your only option is to browse via things like proxies or Tor.

dragonne · on May 9, 2016

Why do you object to self-hosted analytics? I understand blocking centralized trackers (I do so myself), but self-hosted doesn't seem problematic in the same way GA being present on half the pages on the Internet is.

It also strikes me as an unwinnable battle for all but the largest sites.

angry-hacker · on May 9, 2016

Because OP is against all kind of tracking? And because he can...

Programmatic · on May 10, 2016

I can't claim to speak for OP, but am also against most tracking. I would also tend to think that being against first party tracking would be an unwinnable battle. It also leaks less data than third-party tracking, since the third party can see your activity across multiple sites whereas first party can only see your activity on that site unless it's aggregated through a backend service (another poster mentioned the ability to upload server logs to GA). No matter what, they can see what you load from their site.

Getting to first party hosting of more intrusive analytics (scroll location, etc) I think rather than disallowing certain scripts/URLs to run, you have to get back to behavioral-based blocking. Doing that in an environment that you allow any JS to execute seems tough since sandboxing something that can update the page based on location can "talk" to another part that can report back to the server.

If you don't like intrusive first party analytics, just stop all JS.

philliphaydon · on May 10, 2016

Off topic a bit. But with GA being blocked, all these blocks work by blocking URLs and domains at the time they are loaded, but not requests made after load of the requested resource. Couldn't you just proxy the URL to GA so it's not blocked?

dest · on May 10, 2016

By "requests made after load of the requested resource", you mean JS XHR requests for example? I'd guess they are also filtered by adblockers, aren't they?

daedalus_j · on May 9, 2016

Not quite true though... Piwik, by default, includes EasyList, which has the following block rule:

/piwik.$domain=~piwik.org,script

This will block your self-hosted piwik.js file, unless you perform some redirection trickery.

skrause · on May 9, 2016

It's very easy to rename (or symlink) the piwik.js script to something else that is not blocked by any filter list. And unless you're a major site nobody will bother creating a special rule just for your self-hosted Piwik.

FiloSottile · on May 9, 2016

The default Piwik Sandstorm install has "piwik" nowhere in the URL.

tirus · on May 9, 2016

Yeah, unfortunately it shows up as https://ls4an735rucvfa6ps6bb.filippo.sandcats.io/embed.js - it's getting to the point where I need to start performing actual content inspection.

aorth · on May 10, 2016

I run Piwik on my server log files. This is my right as a website owner / system admin. :)

_oadw · on May 10, 2016

How come?

aluhut · on May 9, 2016

I would love to have Sandstorm on my RasPi but they don't support ARM.

It sounds like a great home intranet.

ocdtrekkie · on May 9, 2016

Just notes on this for the curious:

1. Sandstorm doesn't support ARM currently because Sandstorm apps run native Linux binaries, and every app would have to be compiled for each architecture.

2. I honestly think you'd be running pretty crippled trying to do Sandstorm on a RasPi. It's a bit smaller scale than Sandstorm seems targeted for. Each open Sandstorm grain commonly uses 100 MB of RAM or more (on top of the RAM used by Linux and the Sandstorm server itself, of course), so with just a couple of Sandstorm grains running simultaneously, you can max out a RasPi pretty quickly.

MustardTiger · on May 9, 2016

>Sandstorm doesn't support ARM currently because Sandstorm apps run native Linux binaries, and every app would have to be compiled for each architecture.

That's true of any linux distro providing binary packages. They all support arm anyways, it is trivially simple to compile packages. Even small projects like openbsd compile tens of thousands of packages for a dozen arches.

kentonv · on May 10, 2016

Yes but distros accomplish that by being highly opinionated on the build process you use to build packages whereas Sandstorm tries to be unopinionated on this point.

Sandstorm will support ARM someday but it's going to require a large investment in tooling in order to be painless for developers.

MustardTiger · on May 10, 2016

>Yes but distros accomplish that by being highly opinionated on the build process you use to build packages whereas Sandstorm tries to be unopinionated on this point.

How's that exactly? I looked it over and can't see any difference at all.