More

gane5h · on Dec 4, 2018

The culture around cryptocurrencies can be a little insular – so we're sharing our onboarding guide that teaches you to buy cryptocurrencies step-by-step. I think it's broadly applicable regardless of what you think about the viability of crypto.

gane5h · on Oct 23, 2015

I've used this in the past, in high performance math.

If you have data (vectors, matrices, etc.) that doesn't fit neatly into a SIMD block size, you'll have to zero out fields after the calculation. At this point, it's cheaper to generate a zero on the register than load via memory (cheaper as in the number of CPU instructions.)

gane5h · on July 23, 2015

Congrats to the compose folks! IIRC, they acquired Cloudant a little while ago.

amyjess · on July 23, 2015

And Softlayer.

As a friend of mine who works there says, "IBM wanted to have a great cloud business, so they bought one".

gt384u · on July 23, 2015

And Blue Box (https://www.blueboxcloud.com/) a month and a half ago. Looks like they're investing in or acquiring cloud platform software companies like gangbusters at the moment.

gane5h · on July 1, 2015

We store our event stream data in Elasticsearch. Two features that made it appealing:

  * the ingest-side can be scaled up by adding more shards
  * the query-side can be scaled up by adding more replicas

To compute rollup analytics, we make heavy use of Elasticsearch's aggregation framework to compute daily/weekly/monthly/quarterly active users.

From my understanding Postgres has many of these features, but the distributed features of ES are killer!

atombender · on July 1, 2015

We're using ElasticSearch for events, too. The aggregation operators are really surprisingly fast.

That said, one major downside to ES is that it's not schemaless. You can try to use the dynamic mapping system, but it will most likely just bite you eventually, since ES is strict about coercing data types. If your data isn't completely consistent, it will actually refuse to index it. Any changes made to your schema also requires reindexing. (For some reason ES can't do in-place indexing, despite supporting storing all the original data in the "_source" field.)

If your data isn't perfectly consistent, one way to work around the mapping problem is to append a type name to every field. So instead of indexing {"user_id": "3"}, you index {"user_id.string": "3"}. This means that if you get some input data where the user_id is an int, it doesn't conflict because it will stored in "user_id.int". You have to handle the inconsistency on the query end, but it's possibly better than micromanaging the index.

gane5h · on June 29, 2015

The closest to a hosted solution I've come across is Azure's DataLake announced at the recent Build conference: https://azure.microsoft.com/blog/2015/04/29/introducing-azur...

We're working on something adjacent at Silota. We pull in your CRM, behavioral analytics and support data and provide an easy to comprehend view layer atop. Our users are not analysts or data scientist, but account managers (more popularly known as Customer Success Managers.)

(and hence the appeal of hosted solutions.)

gane5h · on April 9, 2015

Really cool write up – thanks! First time I’m hearing about CitusDB. They appear to be building a columnar, distributed database while preserving the Postgres frontend (similar to redshift, aster, greenplum, etc.)

It’s all in the details. I’m planning to investigate the following during my next weekend hack. Hope somebody can answer some pre-sales questions for me:

  - how complete is the postgres functionality (e.g.: lateral joins)
  - can you set a sharding key to control the shard distribution
  - does the database do multiple passes for queries with subselects
  - usually one increases the replication factor (limited by budget) to improve query times, with the limitation that it slows down loading time. does the DB stage intermediate writes to batch them, so does the user need to do this? this works really well for append-only, timestamped event data.
  - do you have a job manager or scheduler, needed when you have multiple views that need to be updated without melting your infrastructure
  - how easy is it to operate? does the database expose operational metrics so that you can see the load on each shard to potentially detect unbalanced shards?
  - tips on hardware configuration (big advantage of redshift here is that you don’t have to run your own warehouse.) maybe partner with MongoHQ?

It’ll be nice to see some sample query plans graphically visualized.

gane5h · on Oct 10, 2014

Going on a tangent here: this benchmark highlights the difficulty of sorting in general. Sorts are necessary for computing percentiles (such as the median.) In practical applications, an approximate algorithm such as t-digest should suffice. You can return results in seconds as opposed to "chest thumping" benchmarks to prove a point. :)

I wrote a post on this: http://www.silota.com/site-search-blog/approximate-median-co...

sonoffett · on Oct 11, 2014

Perhaps I misunderstand your comment, but you actually don't need to sort to compute a median (see O(n) median of medians algorithm [1]).

[1] http://en.wikipedia.org/wiki/Median_of_medians

gane5h · on April 28, 2014

I learnt this the hard way. Initially, I designed the API with bigints for ids and found some older versions of PHP didn't support bigints. I had to switch to using strings.

madeofpalk · on April 28, 2014

Also, Javascript https://dev.twitter.com/docs/twitter-ids-json-and-snowflake

gane5h · on March 21, 2014

My biggest difficulty with the documentation is the lack of examples and a logical layout. That's why at our company (we do hosted ES), we decided to invest in a new documentation portal ala Stripe: http://www.silota.com/docs/api/

Hope you find that useful. Any feedback is appreciated!

arafalov · on March 21, 2014

Is Silota based on Lucene (or ES or Solr)? Because you don't seem to mention it. And if it is not, there is not enough information to make an informed choice. I don't see any information on tokenization nor does your own site seems to have a search engine to check (dogfood?).

arafalov · on March 21, 2014

Never mind. Seems to have missed (hosted ES) point. I would make frankly make that (or at least Lucene mention) a selling point as opposed to a buried one. That way people who actually know what to look for in a search engine know what's under the covers.

gane5h · on March 21, 2014

Good feedback. Still iterating on the messaging – at this stage, we are still figuring out how to describe the product.

When you begin incorporating ES into your application, roughly you’d be thinking about: 1. Figuring out the structure of your data and translating that into ES’s mapping 2. Learning the query syntax (the ES docs assume you already know Lucene, not usually the case.) 3. Setup an ingest workflow and keeping your indexed data in sync 4. Securing your cluster if you want to hit ES directly from the browser/API client 5. Maintaining your cluster

Silota attempts to solve 3, 4, 5. Improving documentation helps with 2.

There’s an e-commerce search example here: http://www.silota.com/docs/api/ecommerce-product-search-exam...

gane5h · on March 2, 2014

Say I'm trying to sell water.

Features: Liquid at room temperature. One oxygen atom, two hydrogen atoms. Has high specific heat capacity.

Benefits: Helps with the balance of bodily fluids necessary for digestion, absorption, transportation of nutrients, creation of saliva, maintenance of body temperature.