Vector data in Arraylake
You can use Arraylake to ingest, analyze, and write vector geometries in addition to raster data. This notebook demonstrates general patterns for working with vector data in Arraylake; it is divided into two main sections:
- Ingesting GeoParquet data to Arraylake
- Storing vector data cubes in Arraylake, after joining vector and raster data
Connect to Arraylake​
import arraylake as al
client = al.Client()
repo = client.get_repo("earthmover/vector_datacube")
Query Overture Maps​
We will use vector data from Overture Maps. These datasets are quite large; rather than read an entire dataset, we use DuckDB to ingest only the data we're interested in. Overture Maps has great resources demonstrating how to query their data with DuckDB.
Here is a DuckDB query that will pull all Overture places data categorized as 'wind_energy' within a spatial bounding box covering Germany.
ger_wind_query = """
SELECT
id,
names.primary,
categories.primary as category,
ST_AsText(ST_GeomFromWKB(geometry)) as geometry
FROM read_parquet("s3://overturemaps-us-west-2/release/2024-07-22.0/theme=places/type=place/*", hive_partitioning=1)
WHERE
categories.primary IN ('wind_energy')
AND bbox.xmin > 5.8663153
AND bbox.xmax < 15.0419319
AND bbox.ymin > 47.2701114
AND bbox.ymax < 55.099161
"""
Using the DuckDB Python API, create an in-memory connection to perform spatial queries of s3 object storage.
import duckdb
import geopandas as gpd
# create duckdb connection
conn = duckdb.connect()
# load extensions
conn.load_extension("spatial")
conn.load_extension("httpfs")
# specify aws region
conn.execute("SET s3_region='us-west-2'");
Ingesting GeoParquet data to Arraylake​
The below function executes the DuckDB query and converts the resulting Pandas.DataFrame to a GeoPandas.GeoDataFrame.
def overture_query(query: str) -> gpd.GeoDataFrame:
import geopandas as gpd
import shapely
df = conn.execute(query).fetchdf()
gdf = gpd.GeoDataFrame(df)
gdf["geometry"] = gdf["geometry"].apply(shapely.wkt.loads)
gdf = gdf.set_geometry("geometry").set_crs("EPSG:4326")
return gdf
ger_wind_gdf = overture_query(ger_wind_query)
print(ger_wind_gdf.head())
FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))
id primary \
0 08f1fa355139106d038022d0afb88e55 Chauffages et Sanitaires Grethen Jos
1 08f1fa3c96b2eb5c03a6d6e8ad7d8c68 Duvivier
2 08f1fa3cd1bb02a4033f6b0ef0900eb1 LuxEnergie SA
3 08f1fa3ceb98203003b3d967139ab3f9 Lënster Energie Sà rl
4 08f1fa3195404a16035b92b7b79fa7eb Bio Man
category geometry
0 wind_energy POINT (6.30365 49.51271)
1 wind_energy POINT (5.87755 49.59550)
2 wind_energy POINT (6.16488 49.62893)
3 wind_energy POINT (6.25885 49.69611)
4 wind_energy POINT (6.36202 49.66138)
We use Xvec to convert the GeoDataFrame to an Xarray.Dataset with a spatially-aware GeometryIndex
.
import xarray as xr
ger_wind_ds = xr.Dataset(
{
"primary": ("geometry", ger_wind_gdf["primary"]),
"id": ("geometry", ger_wind_gdf["id"]),
},
coords={"geometry": ger_wind_gdf["geometry"]},
).xvec.set_geom_indexes("geometry", crs=ger_wind_gdf.crs)
print(ger_wind_ds)
<xarray.Dataset> Size: 624B
Dimensions: (geometry: 26)
Coordinates:
* geometry (geometry) object 208B POINT (6.303651 49.512707) ... POINT (13...
Data variables:
primary (geometry) object 208B 'Chauffages et Sanitaires Grethen Jos' ....
id (geometry) object 208B '08f1fa355139106d038022d0afb88e55' ... '...
Indexes:
geometry GeometryIndex (crs=EPSG:4326)
For more details and discussion of vector data cubes, check out the Xvec documentation and our recent blog post.
Write vector data to Arraylake​
Now, we have a 1-d vector data cube that resides in memory. To save it to disk, we need to encode the geometries into a format that is compatible with Zarr. In memory, geometries are represented as Shapely objects. However, these aren't compatible with array-based file formats such as Zarr. To write data, we encode the geometries according to the Climate Forecast (CF) Metadata Conventions specification for geometries, which is compatible with Zarr and NetCDF file formats. This conversion is handled by Xvec using functionality from the cf_xarray package. Once encoded, the data can be written to Arraylake as usual.
# Encode shapely geometries as CF
encoded_wind_ds = ger_wind_ds.xvec.encode_cf()
# Write zarr data to Arraylake Repo
encoded_wind_ds.to_zarr(repo.store, group="germany_wind", mode="w")
# Commit changes
repo.commit("write germany wind gdf to zarr")
66b4f8a19a4d239ce895e0d5
Vector data cubes in Arraylake​
Section 1 demonstrated reading GeoParquet data as a GeoPandas GeoDataFrame, converting the dataframe to an Xarray Dataset with spatially-aware indexes and writing that object to Arraylake. Now, we will merge raster data with the vector data cube created in section 1 to create a multi-dimensional vector data cube. We will sample ERA5 atmospheric reanalysis data at a various geometries. This section builds on the vector data cube workflow described in this blog post.