Piwik is a great project, but it tends not to work well for handling sites with millions of events per day. Your MySQL table starts to bust at the seams pretty quickly.
For big sites, you'll want that event data in GiB's of plain raw logs that you can bulk load into tools like BigQuery or Redshift for analysis.
My team has built/delivered a SaaS web content analytics platform for the past few years called Parse.ly. We instrument pageview events (like Google Analytics) automatically, and we also instrument time-on-page using heartbeat events. We collect 50 billion monthly page events for over 600 top-traffic sites, and display it all in a real-time dashboard.
To emphasize that our customers own the data they send us, we recently launched a Raw Data Pipeline product:
Basically, we host a secure S3 bucket and Kinesis stream for customers, and deliver their raw (enriched) event data there. From there, they typically load it into their own BigQuery or Redshift instance, or they analyze pieces of it directly with Python/R/Excel/etc.
Our customers tell us this strikes the right balance among data ownership, query flexibility, and hassle-free infrastructure.
I've read parse.ly tech blog a few times, it was a great so I immediately recognized your company. As someone who's working in the publishing as well and gradually moving to a more data driven approach - thanks! :)
Data ownership in GA is a "gray area" that becomes less gray if you pay $150K/yr for "GA Premium".
Google has mixed incentives in running its free analytics service. It gets web-wide analytics data, it uses data to help it sell more AdWords to customers, and it integrates GA with other services, like their display advertising products (DFP, etc.)
From a practical standpoint, you don't "own" analytics data when a) you can't easily access it in raw form and b) the SaaS provider "leaks" your data to dilute its value to you. We address (a) and (b) directly through our products and public data privacy stance. See this blog post for our public view on analytics data privacy:
While I find your stance on privacy very refreshing for an analytics company, hiding your pricing info behind a sales rep is a huge turnoff for me. If you feel your pricing is reasonable for the service that you provide, I really don't see why you can't just display it proudly on your site.
Whether to display pricing on the website is something we debated in the past, and continue to debate. (Your comment may wake up the debate for me.)
Pricing for analytics services (in the marketplace) is all over the map. Google picked $150K/year as the price for GA Premium because that's the low end of an Adobe Analytics contract, who is the market leader. We're typically cheaper than existing Adobe/GA contracts. Non-competitive "event analytics" companies like MixPanel and Heap have variable per-event pricing that would break the bank for the customers we serve. We have a bit of an aversion to per-event pricing because it feels like "punishing customers for success".
Meanwhile, per seat pricing, though attractive on the surface and popular in the SaaS segment, has several concerns in our space. First, we want customers to feel free to hand out access to our platform: part of our value proposition is democratizing access to analytics data. So we don't want "stingy seat quotas" typical with tools like Salesforce. Second, for an analytics tool, seats are a bit easier to "hack" for a pricing model -- though our dashboards can be customized per user, a single shared account can access all the data. Meanwhile, our costs don't scale with seats, but with site traffic/users instead.
For these reasons, and more, we've settled on "tiered pricing". Roughly speaking, our service is offered in three tiers. Each tier supports a larger class of site (more monthly uniques), which also bestows more features (e.g. more data retention in higher tiers). To work within the budget constraints of some companies, we will discount tiers while removing cost-affecting features, e.g. maybe you are in the highest class of site, but we disable API access and limit data retention. Because this is a tad more complex than a pricing page could express easily, and also because we think the value of the product comes through best in a guided demo, we made the decision to hide pricing and instead responsively provide demos on-demand.
So, the tl;dr is, pricing, and the display of it, is definitely something we think about, and we have (IMO valid) reasons for not displaying pricing right now, but you make a fair point: if Musk can price his rockets publicly, maybe we can figure something out, too :)
I avoid all services with prices behind sales rep when possible. I always feel I'm not an astute negotiator, the sales rep will be ergo he will see me as a mug and take me for all he can get. If I do buy, even if I'm happy I'm always left wondering if I'm paying 2x as much as other customers.
Im sure parsle.ys analytics system told them that their potentical customers were getting sticker shock; hence the sales guys need to explain the value propesition to them properly before releaving the figure ;)
Sometimes i feel Google should just blanket replace their disclosure statements across the board with this classic video; https://www.youtube.com/watch?v=8fvTxv46ano.
Less beating around the bush.
Another very simple alternative is goaccess [0] which is purely server-side. It gathers quite a lot of information by parsing the server log. It doesn't do all the JavaScript stuff to track every single click, but it gives you stats on how many users visit your site, which parts, when, which sites or which domains they're coming from and also how much bandwidth they use. It also shows which status codes are coming from which paths and a lot more. It supports various output formats such as HTML or an interactive htop-like terminal application. It's also being actively developed. I use it and find it very useful.
Piwik has the problem that it writes directly to MySQL as the activity happens. If your database is down, you lose data. If you have a spike of traffic above what your DB can handle in writes, you lose data.
Snowplow is implemented as a unidirectional data pipeline:
tracking -> collection -> enrichment -> storage
Between each step there is typically some kind of persistent queue (mostly S3/Kinesis), and data won't be lost if a downstream component is not operational. Examples:
* If your event collector is unavailable, raw events will be cached in the tracker in localStorage, SQLite or similar
* If your Redshift database is read-only for maintenance, enriched events will be held back until Redshift is writeable again
There is of course the opportunity for a collector to be down, but the aim is to keep those components super simple and rely on really durable storage (Kinesis, S3) managed by someone else.
Piwik started in 2007. It was inexcusable to use md5 then. The problem is that it was ever md5, not that it is still md5.
That really doesn't inspire confidence that they know what they're doing. md5 is a canary in the coal mine for me, and I definitely won't use Piwik now.
To be fair, they aren't an encryption service or something. They INSERT INTO visitor_count. So far as "inspiring confidence" needs to go for such a thing, I wouldn't write them off just yet.
EDIT: Done. Not exactly straightforward to download Google Fonts but there are great helpers around. Got rid of CDNjs as well, since CloudFlare has HTTP/2 now. No 3rd parties left except GA, which will go in a day or two.
If you just want to have nicer webfonts without the 'phoning home to Google' issue try Brick webfonts at http://brick.im/.
If the goal is to get rid of all third-party dependencies you can still use the Brick repository on GitHub to download the better looking fonts and self-serve them, just as you are now with the Google served ones.
The only 'gotcha' I found with using Brick is that NoScript users won't see the fonts and won't see the Brick URL in the menu for whitelisting. They will need to inspect the page and learn that they must manually add brick.im. Not exactly user-friendly but then again NoScript breaks lots of things and users are used to it.
I don't think* embedding Google Fonts allows Google much (any?) data collection beyond, presumably, that the font was requested, though it's definitely an unnecessary dependency.
Given Google Fonts allows you to download all of the fonts, with license info provided, in a variety of formats, there's almost no good reason not to embed them directly in your own site.
Right now, Sandstorm apps will currently work with Google Fonts, but the Sandstorm team intends to sandbox the client side better in the near-ish future, so I've already been nudging Sandstorm apps to make sure not to use such things when I see it.
Note that while the author's blog uses Google Fonts, the Piwik Sandstorm package does not. (I checked.)
And it does appear Google just blankets this into it's general API terms of use that they can "use submitted data" in accordance with their general privacy policies. So yeah, I guess they can use it as part of their tracking. :/
I share the concern about privacy, but there is a benefit to using Google Fonts: maybe it increases the chance of the font being already cached on the client?
Small side story: I had a 'fun' night where I 'just' wanted to make a self-hosted wordpress also perfect regarding privacy. This was not easy! First I had to install a plugin which does not use google fonts for a certain theme, then I need to disable the default avatar service using another plugin and customization, next I needed to convert youtube and twitter embeds into images with links and so on. It took me several hours.
At that time I thought I was the only one with this requirements ... also a bit strange for an open source project to have such defaults IMO.
The biggest problem with Piwik is that is not scalable and the cost ($) with servers to store analytics data. I saw many cases that Piwik not supported the big volume of data.
Piwik is scalable up to at least 1 billion actions per month. Piwik is scalable! But it is not cheap, as one needs powerful database servers with a lot of RAM and fast SSD disks. It can be costly, but Piwik scales!
It can scale pretty far before it becomes an issue. Years ago I ran it on our main DB for sites that got over 1 million visits a month with no noticeable overhead. If I had it on its own server with its own DB it could have handled far more traffic.
That's a difficult thing to answer. However the more important problem is loss of data whenever your DB isn't available due to downtime, upgrade etc. It depends how important data loss is for your user case. I'm a data completist but I'm in therapy for it ;)
It's definitely worth playing with, and trivially easy to spin up. Other self-hosted options aren't anything like as simple to get up and running.
Well, but the point was MySQL + Piwik "does not scale" and that it's "expensive" besides, which doesn't comport with my experience and sounds like received wisdom.
There are many, many self-hosting analytics tools, from your own big data pipes to tools like the aforementioned piwik, as well as Open Web Analytics. I like Snowplow (http://snowplowanalytics.com/), but it's currently hard-coded to AWS.
Hosting your own analytics data can be great, but there are lots of ways to get better accuracy and control over your data without having to host everything. Still, if you can, it's great.
Also one thing to add here is that the client libraries are a lot of the work and you can use the snowplow js/ios/python etc.. no matter what server-side setup you use. I like to think of snowplow as pushing the open-source analytics standard and then hopefully an ecosystem of server-side products grows around that, led by their own product.
We're doing something as dumb as using logs from cloud storage and parsing those with a ~100 line python script into a DB. S3/GCS deal with the collector uptime and as long as you aren't time-sensitive it is a great solution.
The biggest issue with self-hosted analytics is visualizing/sharing the results with non-tech parts of the team. Piwik has some advantages here because it's closer to Google Analytics or Mixpanel than to a DB of event rows...
Snowplow is only loosely hard coded to AWS. I'm using it and breaking it free is only a few hundred lines of code.
For example, rerouting Snowplow's Kinesis collector into Kafka is 114 lines, and that includes logging, metrics, etc - I basically just had to extend the AbstractSink object in their scala collector. Reading from Kafka is another couple of hundred lines, similarly writing to files.
Thanks for sharing yummyfajitas - expect official Kafka support for Snowplow a little later this year [1] [2]; it's been long awaited! (Snowplow co-founder)
One potential problem: if you, say, test out self-hosted alongside GA, and get different numbers, people are going to question the value of the self-hosted thing.
Haha oh I know. Tell a client, "Yeah this is the data, it says this, BUT this is also how its wrong" and they always ignore the last part. Its data, that stuff can't be wrong!
I've done my own analytics since before there were hosted services.
Still use an old-school, proprietary tool called Sawmill[1]. One very nice aspect (out of many) about it is that it handles hundreds of other log formats out of the box, so it can report sensibly on switches, email, firewall logs... just about anything with very little effort.
No connection, other than being a long-time happy customer.
Shameless plug: We're also working on an on-premise custom analytics platform that can be deployed either Heroku or AWS with Cloudformation. https://github.com/rakam-io/rakam
I'd assume the prior poster is suggesting that running a third party tool on your primary web server(s) would increase the surface area for attacks by some (possibly not small) amount. e.g. if piwik is compromised, an attacker would have some sort of user access on your web server(s), which is generally a bad thing. I suppose some people also use the same database user for all web applications, which would potentially be disastrous.
There are mitigations one can implement without going to that length (run piwik under a different user than any of your other web applications, using suexec, use a different database and user, etc.). At the least, putting Piwik in a container or VM makes sense, if any data on your web server(s) is critical or sensitive.
I suspect for very large deployments this would go without saying. But, for users with only one web server, it might seem reasonable to drop it into the same virtual host and run it all as the same user (and it's probably safe enough to do so for many users, as long as they stay on top of updates). But, any web application you run adds surface area for attackers. Might as well isolate them as well as your skills and resources allow.
Unless you are tracking events, outlinks, time spent on page or something similar, nothing as far as data goes. But obviously server logs lack statistics and aggregation. PIWIK actually supports using the server logs as its data source instead of the javascript tag.
For those of us that prefer not to be tracked by self-hosted Sandstorm apps, 2 new ublock rules (they're very rough, but I am no ublock wizard):
||sandcats.io/embed.js$script
view.gif?page=$image
Note that you would have to change the filters if the scripts and tracking pixel are renamed of course but this should catch a majority of the push button installs.
If you visit a web site the operators of that site will know that someone at your IP address visited the site, which pages you viewed, how long you lingered, which buttons you clicked, etc. If you want to prevent that your only option is to browse via things like proxies or Tor.
Why do you object to self-hosted analytics? I understand blocking centralized trackers (I do so myself), but self-hosted doesn't seem problematic in the same way GA being present on half the pages on the Internet is.
It also strikes me as an unwinnable battle for all but the largest sites.
I can't claim to speak for OP, but am also against most tracking. I would also tend to think that being against first party tracking would be an unwinnable battle. It also leaks less data than third-party tracking, since the third party can see your activity across multiple sites whereas first party can only see your activity on that site unless it's aggregated through a backend service (another poster mentioned the ability to upload server logs to GA). No matter what, they can see what you load from their site.
Getting to first party hosting of more intrusive analytics (scroll location, etc) I think rather than disallowing certain scripts/URLs to run, you have to get back to behavioral-based blocking. Doing that in an environment that you allow any JS to execute seems tough since sandboxing something that can update the page based on location can "talk" to another part that can report back to the server.
If you don't like intrusive first party analytics, just stop all JS.
Off topic a bit. But with GA being blocked, all these blocks work by blocking URLs and domains at the time they are loaded, but not requests made after load of the requested resource. Couldn't you just proxy the URL to GA so it's not blocked?
By "requests made after load of the requested resource", you mean JS XHR requests for example? I'd guess they are also filtered by adblockers, aren't they?
It's very easy to rename (or symlink) the piwik.js script to something else that is not blocked by any filter list. And unless you're a major site nobody will bother creating a special rule just for your self-hosted Piwik.
1. Sandstorm doesn't support ARM currently because Sandstorm apps run native Linux binaries, and every app would have to be compiled for each architecture.
2. I honestly think you'd be running pretty crippled trying to do Sandstorm on a RasPi. It's a bit smaller scale than Sandstorm seems targeted for. Each open Sandstorm grain commonly uses 100 MB of RAM or more (on top of the RAM used by Linux and the Sandstorm server itself, of course), so with just a couple of Sandstorm grains running simultaneously, you can max out a RasPi pretty quickly.
>Sandstorm doesn't support ARM currently because Sandstorm apps run native Linux binaries, and every app would have to be compiled for each architecture.
That's true of any linux distro providing binary packages. They all support arm anyways, it is trivially simple to compile packages. Even small projects like openbsd compile tens of thousands of packages for a dozen arches.
Yes but distros accomplish that by being highly opinionated on the build process you use to build packages whereas Sandstorm tries to be unopinionated on this point.
Sandstorm will support ARM someday but it's going to require a large investment in tooling in order to be painless for developers.
>Yes but distros accomplish that by being highly opinionated on the build process you use to build packages whereas Sandstorm tries to be unopinionated on this point.
How's that exactly? I looked it over and can't see any difference at all.
For big sites, you'll want that event data in GiB's of plain raw logs that you can bulk load into tools like BigQuery or Redshift for analysis.
My team has built/delivered a SaaS web content analytics platform for the past few years called Parse.ly. We instrument pageview events (like Google Analytics) automatically, and we also instrument time-on-page using heartbeat events. We collect 50 billion monthly page events for over 600 top-traffic sites, and display it all in a real-time dashboard.
To emphasize that our customers own the data they send us, we recently launched a Raw Data Pipeline product:
http://parse.ly/data-pipeline
Basically, we host a secure S3 bucket and Kinesis stream for customers, and deliver their raw (enriched) event data there. From there, they typically load it into their own BigQuery or Redshift instance, or they analyze pieces of it directly with Python/R/Excel/etc.
Our customers tell us this strikes the right balance among data ownership, query flexibility, and hassle-free infrastructure.