Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm a big fan of DuckDB and I do geospatial analysis, mostly around partitioning geographies (into Uber H3 hexagons), calculating Haversine distances, calculating areas of geometries, figuring out which geometry a point falls in, etc. Many of these features have existed in some form or other in geopandas or postgis, so DuckDB's spatial extensions bring nothing new.

But what DuckDB as an engine does is it lets me work directly on parquet/geoparquet files at scale (vectorized and parallelized) on my local desktop. It beats geopandas in that respect. It's a quality of life improvement to say the least.

DuckDB also has an extension architecture that admits more exotic geospatial features like Hilbert curves, Uber H3 support.

https://duckdb.org/docs/stable/extensions/spatial/functions....

https://duckdb.org/community_extensions/extensions/h3.html



I totally agree with this. DuckDB for me was a huge QoL improvement just working with random datasets. I found it much easier to explore datasets using DuckDB rather than Pandas, Postgres or Databricks.

The spatial features were just barely out when I was last doing a lot of heavy geospatial work, but even then they were very nice.

An aside, I had a Junior who would just load datasets into PowerBI to explore them for the first time, and that was actually a shockingly useful workflow.

pandas is very nice and was my bread and butter for a long time, but I frequently ran into memory issues and problems at scale with pandas, which I would never hit with polars or duckdb. I'm not sure if this holds true today as I know there's been updates, but it was certainly a problem then. Using geopandas ran into the same issues.

Just using GDAL and other libraries out of the box is frankly not a great experience. If you have a QGIS (another wonderful tool) workflow, it's frustrating to be dropping into Jupyter notebooks to do translations, but that seemed to be the best option.

In general, it just feels like geospatial analysis is about 10 years behind regular data analysis. Shapefiles are common because of ESRI dominance, but frankly not a great format. PostGIS is great, geopandas is great, but there's a lot more things in the data ecosystem than just Postgres and pandas. PowerBI barely had geospatial support a couple years ago. I think PowerBI Shapemaps exclusively used TopoJSON?

All of this is to say, DuckDB geospatial is very cool and helpful.


> An aside, I had a Junior who would just load datasets into PowerBI to explore them for the first time, and that was actually a shockingly useful workflow.

What was shockingly useful in PowerBI compared to DuckDB?


Graphics. Good BI tools are very effectively exploratory data analysis tools.


Why do you use haver-sine over geodesic or reprojection?

I’ve been doing the reprojection thing, projecting coordinates to a “local” CRS, for previous projects mainly because that’s what geopandas recommend and is built around, but I am reaching a stage where I’d like to calculate distance for objects all over the globe, and I’m genuinely interested to learn what’s a good choice here.


Just an app dev, not a geospatial expert, but reprojection always seemed like something a library should handle under the hood unless one has specific needs. I'm used to the ergonomics / moron-proofing of something like Postgis' `ST_Distance(geography point1, geography point2)` and it gives you the the right answer in meters. You can easily switch to spherical or Cartesian distances if you need distance calculations to go faster. `ST_Area(geography geog)` and it gives you the size of your shape in square meters wherever on the planet.


What the "right answer" is will vary widely on application and analyst. That's one reason there are so many coordinate ref systems. If you keep everything in 3857, you'll get answers in ~meters, but whether that's "right" depends on where and how large the distance or geom is and what your precision threshold is. So, really, everyone's needs are necessarily "specific."


Look into Vincenty [1] or Karney (for a more robust solution) [2]. Vincenty should be good enough for most use cases.

[1] https://en.wikipedia.org/wiki/Vincenty's_formulae

[2] https://github.com/pbrod/karney


Reprojection is accurate locally but inaccurate at scale.

Geodesics are the most accurate (Vincenty etc) but are computationally heavy.

Haversine is a nice middle ground.


But is it actually inaccurate at scale?

I get that drawing a projection is inaccurate at scale, but if I do all my calculation as meters north/south and east/west of 0,0 on the equator, won’t all my distance calculations be correct?

Like 5.000.000 east and 0 m north is 5.000km from the origin. I cannot see how that could ever become inaccurate as I move further away.

Where is the inaccuracy introduced? When I reproject back to coordinates on a globe?


Btw, you can plug your comment into ChatGPT and it'll give you a reasonable answer. The short answer is: distortions.


Looking just at the Hilbert reference, I'm wondering why there is no function to return, for a given level of precision, the set of segments along the curve containing corresponding to a sub-rectangle of the space. Is this functionality packaged up elsewhere?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: