quod_2058's comments

quod_2058 · on April 14, 2020

Or questDB if you want that sort of performance but prefer sql

starpilot · on April 15, 2020

Fast as kdb? You have a single fact to back that up?

quod_2058 · on April 15, 2020

Saw this last week. The examples are limited but clearly easy to reproduce. I saw in a discussion on Reddit where it seems they plan to get even faster using more techniques like prefetch. https://news.ycombinator.com/item?id=22803504

starpilot · on April 15, 2020

mother of god.

quod_2058 · on April 7, 2020

I came across QuestDB in the past, but never tried myself. At my company, we use kx and onetick. Could you please elaborate why you are also comparing with Postgres since it's not really a time-series database nor revendicating to be part of the "high performance" club?

jnordwick · on April 7, 2020

because they seem to only support that, from their page:

"As of now, SIMD operations are available for non-keyed aggregation queries, such as select sum(value) from table."

not even sure if they support where clauses on that, sums of functions of a column, or even other things like stddev of the column.

their storage format though looks good and simple (similar to kdb actually), but they really should have an 8-byte char instead of the 16-byte one since that would be far more used.

their partitioning scheme is only on time so less advanced than other system.

single designated timestamp column (so no bitemporal), but do support asof joins which is nice.

they totally screwed up on dates and time. dates only to milli and timestamps only to micros. huge mistake.

long256 which is nice, but strangely no long128s (which wind up being nice when you have base-10 fixed-point numbers normalized to a large number of decimals).

i didn't see any fixed-width string/byte columns. Does have 32-bit symbols (i assume similar to kdb?) that might cover some of those use cases.

some good and some bad in there. never going to compete with kdb or onetick on performance (and nobody competes with kdb on query language/extensibity) , but could find a niche based on price and having simpler more easily adapted to querying and more human interface.

bluestreak · on April 8, 2020

Good summary, thank you.

- we will extend SIMD to where clause, keyed aggregations, sampling, ordering, joins etc. It is a matter of time.

- do you mind elaborating on how we screwed up date and time?

- what makes you think we are never going to compete on performance with kdb+?

jnordwick · on April 8, 2020

- nano are important for keeping ordering. while you may return results in insert order it is nice to have them so any other operations done in them outside the db you can retain (or recreate) that ordering. in financial systems nanos have become a sort of defacto standard for this. for examsple, all our messaging timestamps at places i've worked are always nano for anything written in the last 5-10 years.

Also when you are trying to do calculations on high-frequency data (tick, iot) it ruins your ability to take meaningful deltas (eg, arrival rates) since you get a lot of 0s and 1s for the time deltas. Its difficult to take weighted averages with weights of 0s.

the issues solving that (if you really need a wider range) are easier to solve that having to force everything down to micros and creating ways around that. (eg, kdb uses multiple date, time, and timestamp types and it doesn't use the unix epoch since it isn't very useful for tick, censor, or any high-frequency data i've seen).

better than a double that some systems still use.

-kdb's secret sauce that people don't seem to understand is its query language that more naturally fits the time series domain. It isn't really a database as more it is an array language with a database component. (eg, try to write an efficient sql query that calculates 5 minute bars on tick data).

I actually like Java too - I've written or worked on a couple trading systems written in core java. just get good devs who understand what it means to write zero gc code, abuse off-heap memory, and understand what hotspot can intrinsify. If you can stay in the simd code for all the heavy lifting loops (filters, aggregates, etc), I don't think java will be an impediment.

I think you have parts going in the correct direction, and you seem to have good experience from looking at the bios. Nothing really un-fixable (or un-addable) in what I saw glancing at your docs. I did bookmark you to see how the db goes. Will prob check out soon.

patrick73_uk · on April 8, 2020

I'm a QuestDB dev, with regards to nano timestamps, we dont use a nanosecond timestamp because its not possible to be accurate to that resolution with current hardware. However, on a single host the nano second clocks are precise and monotonic, they would be useful to maintain order. I think they do make sense and we will have to look into providing timestamps to that resolution.

salmonlogs · on April 8, 2020

"its not possible to be accurate to that resolution with current hardware"

Are you referring to the clock precision of consumer grade hardware here?

In my experience the vast majority of financial time series data is reported in nanoseconds. The data providers, vendors, exchanges and data brokers absolutely have hardware capable of measuring timestamps in nanoseconds.

The accuracy doesn't have to be to 1ns of resolution to warrant measuring in nanos - even to the nearest 100ns is a useful and meaningful improvement beyond micros.

bluestreak · on April 8, 2020

We are going to add a new type in the future to support nanos! Sorry for the confusion.

jnordwick · on April 9, 2020

can you please contact me to jnordwick@gmail.com?

binomiq · on April 8, 2020

KDB works around this by storing the nanos in 64 bits but only for a particular time range.

1707.09.22D00:12:43.145224194 (max negative) to 2292.04.10D23:47:16.854775806 (max positive)

With 0 at 2000.01.01D00:00:00.000000000

smabie · on April 8, 2020

The reason why you cannot ever compete with kdb+/q is because the database and language run in one address space. Your benchmark gets around this problem by using the built in sum() function, but kdb+/q can just execute arbitrary code and never suffer a performance penalty. Unless you plan on integrated a high performance programming language into your DB, it simply will not be possible to ever meaningfully compete on the effective total time of complex queries.

I, of course, am not disparaging your work, the performance numbers are very impressive!

patrick73_uk · on April 8, 2020

I'm a QuestDB dev, data on a QuestDb is also stored in a single address space and SQL queries are compiled into objects that run in that address space. However, it is just SQL not a bespoke language. A future possibility would be to allow queries in java or scala.

smabie · on April 8, 2020

Implementing the ability for QuestDB to dynamically load jars would be really cool. And if you exposed an interface to directly communicate with the Db, you could get rid of the SQL parsing overhead as well. This would also allow QuestDB to function as an app engine of sorts, just like kdb+/q. I see real value in that for latency sensitive financial applications.

bluestreak · on April 7, 2020

You are right, PostgreSQL is not necessarily optimal for time-series workloads. In this case, the benchmark is a simple task, and not related to time-series. It's consists of reading 1 billion values to a table and sum them together. It doesn't get simpler than that.

The reason we are showcasing this instead of other more complex queries is because this is a simple, easily reproducible benchmark. It provides point of reference for performance figures.

quod_2058 · on March 3, 2020

I don't think it's fair to say "A is faster than B" like in the above comments based on the order they appear in a list that mixes GPU clusters and laptops results. The author of the benchmark does nothing wrong deontologically, but the results table seems ordered by time and some people jump to quick conclusion or use it as a way to rank performance when it's not appropriate.

quod_2058 · on March 3, 2020

Greenplum seems to like a way to throw more hardware at a problem. Queries are too slow with postgres? Shard your data across machines and distribute queries to speed things up by running in parallel. It's using scale as a means to compensate for low efficiency.

On the opposite side of the spectrum you have other open source projects like questDB that have full focus on core performance: constantly optimise to get as much as possible from a single processor core. You can't scale out (at least yet), but given how fast it is on single core, it will be pretty powerful if they chose to go this route.

atombender · on March 3, 2020

Note that Greenplum supports column-oriented tables. I haven't used it myself, so I can't comment on whether it's slower than ClickHouse.

quod_2058 · on Feb 28, 2020

I am a big supporter of part time too. I can't overstate the importance of work-life balance. Work less, work better, and don't forget yourself nor your family.

However reading the description, I can't tell whether the potential hire fits this category.

Are they looking to have 2 jobs in parallel? What would their side thing be, is it different enough that they wouldn't be working on both concurrently?

If they want to be away from writing software 2 days a week to fulfill a hobby (e.g restore a boat, paint or whatever) then it makes a lot of sense. They will be happy, grateful, and the culture will be great.

If they are seeking part time in order to work for 2 startups or setup their own side business, then I'm not so sure.

bluestreak · on Feb 28, 2020

It is the latter. I appreciate their honesty though, which makes the call that much harder. Their side projects are bitcoin related, far more than just a hobby.

x0x0 · on March 2, 2020

Consider that allowing this perk may enable you to hire someone you'd never be able to get otherwise.

I -- and I'm not the only founder I know who has had this experience -- had to compromise technically on early hires because they were what we could get.