It's scales so perfectly we are deploying it across 3 datacenters, each with four 384 core machines with 3 terabytes of RAM. I can't speak highly enough about sqlite and the team behind it.
I don't understand why you'd use SQLite for a deployment this heavy over PostgreSQL or MySQL. Yes, SQLite is fantastic, but why would you choose to do it this way?
Because SQLite is often the right tool for the job, more so than Postgres and MySQL. It scales insanely well and requires minimal configuration and near zero administration. I try to use SQLite for as long as is feasible in my projects purely because it "just works".
Don't let the "Lite" fool you. Depending on your needs, you can scale to 10s of thousand of users using just SQLite. Or not! It is all about knowing your system and properly evaluating your options. I choose to use SQLite because it often fits the use case of small to medium projects the best.
Honest question: where do Postgres and MySQL fall down for such deployments? What is the administration required? Backups? Users and permissions? Indices?
FWIW I use sqlite for a personal wiki, i.e. a website with exactly one user :)
I think the idea is that if your service does more, it necessarily needs more work to maintain. From a simple point of view, there are just more moving parts to go wrong.
I think it is a tradeoff. If you want more data resilience on disk, SQLite will require all sorts of hacks to (and more difficult, when) to dump to disk.
On the other hand, if you need every ounce of performance and do not care so much about data resiliency on disk, then the other sql applications will be too much hassle to configure in a way you do not miss any of the many safeguards to save data they ship with by default, which will kill your performance when you least expect.
Conversely, why would you choose one of those alternatives?
(I don't actually think there aren't any reasons to do so, mind you, but I can well imagine that if you really KISS, well - even at impressive sizes sqlite makes sense).
It sure is. After recently porting a large application from SQLite to Postgres, much of my time was spent casting types and ensuring code was updated to input the correct types into Postgres. All in all, the changes were for the better, as many bugs were discovered and resolved in the process.
Well, the historic one is, "because you need multiple concurrent data writers."
But, that's what this branch is supposed to be for. So now the reason is: "because you need multiple concurrent data writers from multiple processes." And it might also be "you need multiple concurrent data writers and care about your data enough to not risk loss in the event of a system crash or power failure" since it uses PRAGMA synchronous=OFF.
You are using this particular branch in production, deployed as you mention, or sqlite in general? Some more info about this use case would be very interesting if you can share!
What sort of volume is this handling? Is the dataset in memory, or if not, what size and type of I/O is backing this? It seems like a ton of CPU, whereas in my experience typically I/O is the primary bottleneck for database loads.
The link says that it requires turning synchroous off, which means that you won't be waiting for real I/O on transactions, since no fsync calls are emitted. Add that to a huge amount of RAM for cache, and it's very reasonable to be CPU bound... So long as you don't mind data loss or corruption on power loss or kernel panic...
Doesn't a BBU Raid controllers largely address the power loss concern? And, I always thought BBU Raid controllers were just absolute common sense for any serious database (until everyone went cloud and suddenly basics like dual network card, dual PSU and raid controllers didn't fit Amazon's desire to sell complexity).
Right..but you leave fsync on in this case, no? This might be incompatible with the locking required by this server-process edition though (I guess, no clue). But more generally speaking, fsync=on with a proper raid controller gives massive performance boost (like orders of magnitude) while being relatively resilient to power loss.
It may be a bit out-of-date, but BerkeleyDB provides a SQLite front-end that supports page-level locking -- and thus has better TPS when multiple threads are writing. It's basically SQLite with the btree ripped out and replaced with BerkeleyDB. More information here [0].
If this requires turning off pragma synchronous, doesn't that disqualify it for anything but read only tasks? Power loss etc would corrupt the database...
Seriously, as impressive as SQLite is for what it is, people shouldn't be given the impression that it doesn't come with some massive and often surprising caveats when viewed as a general purpose database.
Looks like I misremembered this (re 64bit integers) and it was one of the other slightly odd limitations SQLite has around rowids and foreign keys that caused me to abandon the last system I was designing with it...
As for what most concerned me most when using SQLite was the ease of which it would happily allow me to make broken queries and not raise a fuss, e.g. referencing non-GROUP BYed terms in an aggregation.
Can you provide example(s) of the broken query with non-GROUP BY columns in an aggregation? (Or a reference link if this is well known - I've searched briefly but can't find anything.)
TBH I like the ability to do this. Sometimes you truly don't care which value you get, you just want one. Sometimes you have application-level constraints which mean the one it picks is guaranteed to be correct, but you can avoid the cost of enforcing that (at insertion time and at lookup time). It's a handy tool when it's appropriate.
But hell yes, tools should warn you about this loudly, as it's not generally correct.
we have been using this page locking technology for quite a while now and it works incredibly well. Having both read and write concurrency is super powerful and scales fantastically.
First off, it is not a single instance, but many sqlite databases. Many wich has 20+ TiBs of data.
For the last 10+ years I have worked at a backup company and we had gone threw some iterations of storage backends, including a inhoues system that fell on its face. So far the only thing that has been able to keep up with our demands has been Sqlite, although there have been some hurdles.
Because of how we want isolation between users having any sort of shared database system really is not a option. And in the early days our users liked the idea of simply copying a users dataset and sending it some place else.
In any case we tested, I should say a guy named Brain on our team tested many different systems and ended up picking Sqlite many years ago. It has stuck and served us very very well. It was not until we wanted to access the data in different ways that we ran into concurrency issues and made a few weird decisions on how to access the data to avoid some of these problems.
These databases store the volume data for each volume we back up. And because it is a database we can track the blocks in such a way that allows us to present a consistent view of each backup in time. And it has been good enough to be able to export volumes as iscsi luns (custom iscsi server I wrote to interface when the backend), and in turn allow us to virtualize peoples systems we backed up on the fly without moving data from one format to another.
Now that I think about it ~7PB is only half of it, that is just the data we have managed by our hardware. We have many other people storing just as much if not more on their systems.
Looking on my original comment it may seem like I implied that we have a single 7PB database, but no, it is 1000s of smaller ones between 8 and 30TiB in most cases.
Have we had data loss? yes, but it was mostly to improperly using sqlite. Except one occasion which we were sure it was sqlite itself, which I am sure Mr. Hipp will refute :p, but I forget the details and a fix was applied quickly.
Sqlite is nice, and I would recommend it to anybody for just about any use, even uses you might have not considered for a embedded database.
Yes, the block data is directly inserted into the database.
This has worked fine, but I don't think it is optimal. I think we might get some performance increases if we just used seqlite for keeping block offsets into a flat file. But then there are a bunch of other problems to solve.
Data is compressed, do we would have to handle free hole allocations for variable sized data. Or simply waste free space. Not that sqlite does does any better with those things, just things we would have to solve in general.
Transactions. It's relatively easy to atomically edit random data in a file on disk, if you're willing to overwrite the entire file with your new version. But as soon as you need to edit files that are too big for that approach, the problem becomes exceptionally complicated.
Indexing any columns you want in one line of code is also very nice.
Comparable questions are why use a library when you can just write the code yourself, or why use a high level language when you can write it in assembly. SQLite gives you a lot of power, having already solved a lot of the problems and with a lot more testing, that you then don't have to. The price you pay for that is using their API, which, in this case, is SQL. Using an ORM is a separate choice. You may choose differently than others on these decisions, but using SQLite is certainly a reasonable decision that many people find compelling.
It's also useful to point out that SQLite can have many reader processes. The single-process limit is only on writes.
SQLite is a data structure. The process of finding things on disk and atomically updating them is not as simple as it sounds and is about 90% of the work of a database.
To add to the arguments already made: You don't have to figure out serialization. In particular, writing out structs in C and C++ can break even between compiler versions if padding changes.
That's fine when you don't need complex queries or your data is small. As complexity increases you will eventually end up with a buggy slow relational database.
The collision rate (percentage of attempted transactions that failed due to a page-level locking conflict) in all tests was between 1 and 2%. Failed transactions are not included in the TPS counts below.
I read this as: retry logic required. That's a bit onerous IMO. Wrapping a single writer in a mutex gives near rw=1 performance in the cases I tested.
(a) Does this mean that there will be an sqlite version that can be deployed as a process accessible over network in servers?
(b) Does this mean that sqlite will now have the capability to do concurrent reads/writes, but that it is upto users to implement a process that can take advantage of this and create something like (a)?
> (a) Does this mean that there will be an sqlite version that can be deployed as a process accessible over network in servers?
No.
> (b) Does this mean that sqlite will now have the capability to do concurrent reads/writes, but that it is upto users to implement a process that can take advantage of this and create something like (a)?
Yes... this allows a multi-threaded application for example to have multiple readers/writers without blocking each other.
Comdb2 is a database system that uses SQLite as a query planner. It also allows Lua stored procedures. It has a different backend from SQLite with different capabilities.