> Prior to this, getting up and running from a cold-start might’ve required installing or even compiling severall OSS packages, carefully noting path locations, standing up a specialized database… Enough work that a data generalist might not have bothered, or their IT department might not have supported it.
I've been able to "CREATE EXTENSION postgis;" for more than a decade. There have been spatial extensions for PG, MySQL, Oracle, MS SQL Server, and SQLite for a long time. DuckDB doesn't make any material difference in how easy it is to install.
That requires data to already be in Postgres, otherwise you have to ETL data into it first.
DuckDB on the other hand works with data as-is (Parquet, TSV, sqlite, postgres... whether on disk, S3, etc.) with requiring an ETL step (though if the data isn't already in a columnar format, things are gonna be slow... but it will still work).
I work with Parquet data directly with no ETL step. I can literally drop into Jupyter or a Python REPL and duckdb.query("from '*.parquet'")
Correct me if I'm wrong, but I don't think that's possible with Postgis. (even pg_parquet requires copying? [1])
Yeah, if you want to work with GeoParquet, and you want to keep your data in that format. I can see how that's easer to use your example. That's not what a lot of geospatial data is in. You might have shapefiles, geopackages, geojsons, who knows? There is a lot of software, from QGIS to ESRI to work with different formats to solve different problems. I don't think GeoParquet, even though it might be the fastest geospatial vector data format right now, is that common, and the article did not claim that either. So, given an average user trying to answer some GIS question, some ETL is pretty much a given, on average. And given that, installing PostGIS and installing DuckDB, both require some ETL, and learning some query and analytics language. DuckDB might be an improvement, but it's certainly not as much of a leap as quote is making it out to be.
I'll give DuckDB a shot and compare to postgis. Eventually, I think the issue for me would be that I use all the vector outputs via apps connecting to the same database.
Yeah, just an example of a QoL issue with DuckDB: even though it can glob files in other cases, the way it passes parameters to GDAL means that globs are taken literally instead of expanded. So I can't query a directory with thirty million geojson files. This is not a problem in geopandas because ipython, being a full interactive development environment, allows me to produce the glob any way I choose.
I think this is a fundamental problem with the SQL pattern. You can try to make things just work, but when they fall then what?
I think this is just cause it hasn't been implemented in spatial yet. DuckDB is currently going through a pretty big refactor of the way we glob/scan/union multiple files with all the recent focus on data lake formats, but my plan is to get to it in spatial after next release when that part of the code has stabilized a bit.
SQL is a DSL and yes, all Domain Specific Languages will only enable what the engine parsing the DSL supports.
But all SQL database I'm aware of let you write custom extensions, which are exactly that: they extend the base functionality of the database with new paradigms. I.e postgis enabling geospatial in postgres or the extensions that enable fuzzy-matching/searching.
And as SQL is pretty much a turing-complete DSL, there is very little you can't do with it, even if the syntax might not agree with everyone
The original article feels a tremendous amount like another piece of DuckDB marketing, from the breathless admiration to the baseless claims like the title
I've been able to "CREATE EXTENSION postgis;" for more than a decade. There have been spatial extensions for PG, MySQL, Oracle, MS SQL Server, and SQLite for a long time. DuckDB doesn't make any material difference in how easy it is to install.