Yes. Our schema is fully denormalized, which is particularly important for performance on funnels. (See heapanalytics.com/features/funnels)
In particular, to compute where a user drops off in a funnel, I need to scan one array from left to right, and I don't need to do any joins. This shards very well, since all of a user's data lives on one shard, and most of the queries are aggregations, which are simple to reassemble from subqueries.
Postgresql has gradually and intelligently been incorporating NoSql features e.g: hstore, hstore2, json, jsonb, etc. There are powerful features like indexing, etc.
Query performance is critical. Insert perf is a comparatively minor concern.
Even so, there are a few factors to consider:
- Half second dedupes are for users with 100k+ events, which is <<1% of them.
- We batch events for ~5s before adding them to the cluster, so we aren't deduping for every event -- only once per ~5s of events per user.
If this becomes an issue, we can remove the deduping from normal operation and only call it when we're backfilling / updating events. Even so, we still need this function to exist, and the 100x performance improvement is very helpful.
In particular, to compute where a user drops off in a funnel, I need to scan one array from left to right, and I don't need to do any joins. This shards very well, since all of a user's data lives on one shard, and most of the queries are aggregations, which are simple to reassemble from subqueries.