Agreed, I find this to be a super productive environment, because you get all of vscode's IDE plus the niceties of Jupyter and IPython.
I wrote a small vscode extension that builds upon this to automatically infer code blocks via indentation, so that you don't have to select them manually: [0]
I develop Lonboard [0], a Python library for plotting large geospatial data. If you have small data (~max 30,000 coordinates), leaflet-based Python libraries like folium and ipyleaflet can be fine, but because Lonboard uses deck.gl for GPU-accelerated rendering, it's 30-50x faster than leaflet for large datasets [1].
It can read from HTTP urls, but you'd need to manage signing the URLs yourself. On the writing side, it currently writes to an ArrayBuffer, which then you could upload to a server or save on the user's machine.
Arrow JS is just ArrayBuffers underneath. You do want to amortize some operations to avoid unnecessary conversions. I.e. Arrow JS stores strings as UTF-8, but native JS strings are UTF-16 I believe.
Arrow is especially powerful across the WASM <--> JS boundary! In fact, I wrote a library to interpret Arrow from Wasm memory into JS without any copies [0]. (Motivating blog post [1])
Yeah, we built it to essentially stream columnar record batches from server GPUs to browser GPUs with minimal touching of any of the array buffers. It was very happy-path for that kind of fast bulk columnar processing, and we donated it to the community to grow to use cases beyond that. So it sounds like the client code may have been doing more than that.
For high performance code, I'd have expected overhead in %s, not Xs. And not surprised to hear slowdowns for any straying beyond that -- cool to see folks have expanded further! More recently, we've been having good experiences more recently here in Perspective <-arrow-> Loaders, enough so that we haven't had to dig deeper. Our current code is targeting < 24 FPS, as genAI data analytics is more about bigger volumes than velocity, so unsure. However, it's hard to imagine going much faster though given it's bulk typed arrays without copying, especially on real code.
Sorry, this is not true _at all_ for geospatial data.
A quick benchmark [0] shows that saving to GeoPackage, FlatGeobuf, and GeoParquet are roughly 10x faster than saving to CSV. Additionally, the CSV is much larger than any other format.
And here's my quick benchmark, dataset from my full-time job:
> import geopandas as gpd
> import pandas as pd
> from shapely.geometry import Point
> d = pd.read_csv('data/tracks/2024_01_01.csv')
> d.shape
(3690166, 4)
> list(d)
['user_id', 'timestamp', 'lat', 'lon']
> %%timeit -n 1
> d.to_csv('/tmp/test.csv')
14.9 s ± 1.18 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
> d2 = gpd.GeoDataFrame(d.drop(['lon', 'lat'], axis=1), geometry=gpd.GeoSeries([Point(*i) for i in d[['lon', 'lat']].values]), crs=4326)
> d2.shape, list(d2)
((3690166, 3), ['user_id', 'timestamp', 'geometry'])
> %%timeit -n 1
> d2.to_file('/tmp/test.gpkg')
4min 32s ± 7.5 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
> %%timeit -n 1
> d.to_csv('/tmp/test.csv.gz')
37.4 s ± 291 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
> ls -lah /tmp/test*
-rw-rw-r-- 1 culebron culebron 228M мар 26 21:10 /tmp/test.csv
-rw-rw-r-- 1 culebron culebron 63M мар 26 22:03 /tmp/test.csv.gz
-rw-r--r-- 1 culebron culebron 423M мар 26 21:58 /tmp/test.gpkg
CSV saved in 15s, GPKG in 272s. 18x slowdown.
I guess your dataset is countries borders, isn't it? Something that 1) has few records and makes a small r-tree, and 2) contains linestrings/polygons that can be densified, similar to Google Polyline algorithm.
But a lot of geospatial data is just sets of points. For instance: housing per entire country (couple of million points). Address database (IIRC 20+M points). Or GPS logs of multiple users, received from logging database, ordered by time, not assembled in tracks -- several million per day.
For such datasets, use CSV, don't abuse indexed formats. (Unless you store it for a long time and actually use the index for spatial search, multiple times.)
Your issue is that you're using the default (old) binding to GDAL, based on Fiona [0].
You need to use pyogrio [1], its vectorized counterpart, instead. Make sure you use `engine="pyogrio"` when calling `to_file` [2]. Fiona does a loop in Python, while pyogrio is exclusively compiled. So pyogrio is usually about 10-15x faster than fiona. Soon, in pyogrio version 0.8, it will be another ~2-4x faster than pyogrio is now [3].
CSV is still faster than geo-formats with pyogrio. From what I saw, it writes most of the file quickly, then spends a lot of time, I think, building the index.
> %%timeit -n 1
> d.to_csv('/tmp/test.csv')
10.8 s ± 1.05 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
> %%timeit -n 1
> d2.to_file('/tmp/test.gpkg', engine='pyogrio')
1min 15s ± 5.96 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
> %%timeit -n 1
> d.to_csv('/tmp/test.csv.gz')
35.3 s ± 1.37 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
> %%timeit -n 1
> d2.to_file('/tmp/test.fgb', driver='FlatGeobuf', engine='pyogrio')
19.9 s ± 512 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
> ls -lah /tmp/test*
-rw-rw-r-- 1 culebron culebron 228M мар 27 11:02 /tmp/test.csv
-rw-rw-r-- 1 culebron culebron 63M мар 27 11:27 /tmp/test.csv.gz
-rw-rw-r-- 1 culebron culebron 545M мар 27 11:52 /tmp/test.fgb
-rw-r--r-- 1 culebron culebron 423M мар 27 11:14 /tmp/test.gpkg
That's why I'm working on the GeoParquet spec [0]! It gives you both compression-by-default and super fast reads and writes! So it's usually as small as gzipped CSV, if not smaller, while being faster to read and write than GeoPackage.
Try using `GeoDataFrame.to_parquet` and `GeoPandas.read_parquet`
Another seemingly extremely similar project released in the last few days: https://github.com/raulcd/datanomy