Skip to main content

Vector data in Arraylake

You can use Arraylake to ingest, analyze, and write vector geometries in addition to raster data. This notebook demonstrates general patterns for working with vector data in Arraylake; it is divided into two main sections:

  1. Ingesting GeoParquet data to Arraylake
  2. Storing vector data cubes in Arraylake, after joining vector and raster data
    • Vector data cubes with Shapely Point geometries
    • Vector data cubes with Shapely Polygon geometries

Connect to Arraylake

import arraylake as al

client = al.Client()
repo = client.get_repo("earthmover/vector_datacube")

Query Overture Maps

We will use vector data from Overture Maps. These datasets are quite large; rather than read an entire dataset, we use DuckDB to ingest only the data we're interested in. Overture Maps has great resources demonstrating how to query their data with DuckDB.

Here is a DuckDB query that will pull all Overture places data categorized as 'wind_energy' within a spatial bounding box covering Germany.

ger_wind_query = """
SELECT
id,
names.primary,
categories.primary as category,
ST_AsText(ST_GeomFromWKB(geometry)) as geometry
FROM read_parquet("s3://overturemaps-us-west-2/release/2024-07-22.0/theme=places/type=place/*", hive_partitioning=1)
WHERE
categories.primary IN ('wind_energy')
AND bbox.xmin > 5.8663153
AND bbox.xmax < 15.0419319
AND bbox.ymin > 47.2701114
AND bbox.ymax < 55.099161
"""

Using the DuckDB Python API, create an in-memory connection to perform spatial queries of s3 object storage.

import duckdb
import geopandas as gpd

# create duckdb connection
conn = duckdb.connect()

# load extensions
conn.load_extension("spatial")
conn.load_extension("httpfs")

# specify aws region
conn.execute("SET s3_region='us-west-2'");

Ingesting GeoParquet data to Arraylake

The below function executes the DuckDB query and converts the resulting Pandas.DataFrame to a GeoPandas.GeoDataFrame.

def overture_query(query: str) -> gpd.GeoDataFrame:

import geopandas as gpd
import shapely

df = conn.execute(query).fetchdf()
gdf = gpd.GeoDataFrame(df)
gdf["geometry"] = gdf["geometry"].apply(shapely.wkt.loads)
gdf = gdf.set_geometry("geometry").set_crs("EPSG:4326")

return gdf
ger_wind_gdf = overture_query(ger_wind_query)
print(ger_wind_gdf.head())
Output
FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))
Output
                                 id                               primary  \
0 08f1fa355139106d038022d0afb88e55 Chauffages et Sanitaires Grethen Jos
1 08f1fa3c96b2eb5c03a6d6e8ad7d8c68 Duvivier
2 08f1fa3cd1bb02a4033f6b0ef0900eb1 LuxEnergie SA
3 08f1fa3ceb98203003b3d967139ab3f9 Lënster Energie Sàrl
4 08f1fa3195404a16035b92b7b79fa7eb Bio Man

category geometry
0 wind_energy POINT (6.30365 49.51271)
1 wind_energy POINT (5.87755 49.59550)
2 wind_energy POINT (6.16488 49.62893)
3 wind_energy POINT (6.25885 49.69611)
4 wind_energy POINT (6.36202 49.66138)

We use Xvec to convert the GeoDataFrame to an Xarray.Dataset with a spatially-aware GeometryIndex.

import xarray as xr
import xvec

ger_wind_ds = xr.Dataset(
{
"primary": ("geometry", ger_wind_gdf["primary"]),
"id": ("geometry", ger_wind_gdf["id"]),
},
coords={"geometry": ger_wind_gdf["geometry"]},
).xvec.set_geom_indexes("geometry", crs=ger_wind_gdf.crs)
print(ger_wind_ds)
Output
<xarray.Dataset> Size: 624B
Dimensions: (geometry: 26)
Coordinates:
* geometry (geometry) object 208B POINT (6.303651 49.512707) ... POINT (13...
Data variables:
primary (geometry) object 208B 'Chauffages et Sanitaires Grethen Jos' ....
id (geometry) object 208B '08f1fa355139106d038022d0afb88e55' ... '...
Indexes:
geometry GeometryIndex (crs=EPSG:4326)

For more details and discussion of vector data cubes, check out the Xvec documentation and our recent blog post.

Write vector data to Arraylake

Now, we have a 1-d vector data cube that resides in memory. To save it to disk, we need to encode the geometries into a format that is compatible with Zarr. In memory, geometries are represented as Shapely objects. However, these aren't compatible with array-based file formats such as Zarr. To write data, we encode the geometries according to the Climate Forecast (CF) Metadata Conventions specification for geometries, which is compatible with Zarr and NetCDF file formats. This conversion is handled by Xvec using functionality from the cf_xarray package. Once encoded, the data can be written to Arraylake as usual.

# Encode shapely geometries as CF
encoded_wind_ds = ger_wind_ds.xvec.encode_cf()
# Write zarr data to Arraylake Repo
encoded_wind_ds.to_zarr(repo.store, group="germany_wind", mode="w")
# Commit changes
repo.commit("write germany wind gdf to zarr")
Output
66b4f8a19a4d239ce895e0d5

Vector data cubes in Arraylake

Section 1 demonstrated reading GeoParquet data as a GeoPandas GeoDataFrame, converting the dataframe to an Xarray Dataset with spatially-aware indexes and writing that object to Arraylake. Now, we will merge raster data with the vector data cube created in section 1 to create a multi-dimensional vector data cube. We will sample ERA5 atmospheric reanalysis data at a various geometries. This section builds on the vector data cube workflow described in this blog post.

A vector data cube of Point geometries

We'll first join raster and vector data with a set of Shapely Point geometries. We will use the 1-dimensional vector data cube created in Section 1 that contains spatial information about wind power plants in Germany.

Read raster data from Google Cloud Storage

We will use an ‘analysis-ready’ ERA5 data from WeatherBench2, which makes it possible to access ERA5 data from Google Cloud Storage.

uri = "gs://gcp-public-data-arco-era5/ar/1959-2022-full_37-6h-0p25deg-chunk-1.zarr-v2"
era5_ds_sub = (
# Open the dataset
xr.open_zarr(uri, chunks={"time": 48}, consolidated=True)
# Select the near-surface level
.isel(level=0, drop=True)
# subset in time
.sel(time=slice("2019-01", "2020-01"))[
["2m_temperature", "u_component_of_wind", "10m_u_component_of_wind"]
]
)
# convert ERA5 longitude to match Overture geometry coordinates
era5_ds_sub = era5_ds_sub.assign_coords(
longitude=(((era5_ds_sub.longitude + 180) % 360) - 180)
)
era5_ds_sub = era5_ds_sub.sortby("longitude")

Read the vector data from Arraylake

germany_wind_vec = repo.to_xarray("germany_wind")
print(germany_wind_vec)
Output
<xarray.Dataset> Size: 1kB
Dimensions: (geometry: 26, node: 26)
Coordinates:
lat (geometry) float64 208B ...
lon (geometry) float64 208B ...
spatial_ref int64 8B ...
Dimensions without coordinates: geometry, node
Data variables:
geometry_container float64 8B ...
id (geometry) object 208B ...
primary (geometry) object 208B ...
x (node) float64 208B ...
y (node) float64 208B ...

germany_wind_vec has the CF-encoded geometries we wrote earlier. To use it in memory, decode the CF-encoded geometries to Shapely objects using the xvec.decode_cf() method.

germany_wind_vec = germany_wind_vec.xvec.decode_cf()

Note that we have now an object that is identical to the vector cube we first created in memory, except this one we read from object storage:

xr.testing.assert_identical(germany_wind_vec, ger_wind_ds)

The datacube can also be converted to a GeoDataFrame.

germany_wind_era5_gdf = germany_wind_vec.xvec.to_geodataframe()
print(germany_wind_era5_gdf.head())
Output
                   geometry                                id  \
0 POINT (6.30365 49.51271) 08f1fa355139106d038022d0afb88e55
1 POINT (5.87755 49.59550) 08f1fa3c96b2eb5c03a6d6e8ad7d8c68
2 POINT (6.16488 49.62893) 08f1fa3cd1bb02a4033f6b0ef0900eb1
3 POINT (6.25885 49.69611) 08f1fa3ceb98203003b3d967139ab3f9
4 POINT (6.36202 49.66138) 08f1fa3195404a16035b92b7b79fa7eb

primary
0 Chauffages et Sanitaires Grethen Jos
1 Duvivier
2 LuxEnergie SA
3 Lënster Energie Sàrl
4 Bio Man

Sample the raster data cube at the set of vector geometries using xvec.extract_points().

germany_wind_vector_cube = era5_ds_sub.xvec.extract_points(
germany_wind_vec.geometry, x_coords="longitude", y_coords="latitude"
)
# add attr data from vector dataframe to cube
germany_wind_vector_cube.update(
{
"primary": ("geometry", germany_wind_vec["primary"].data),
"id": ("geometry", germany_wind_vec["id"].data),
}
)
germany_wind_vector_cube["primary"] = germany_wind_vector_cube["primary"].astype(str)
germany_wind_vector_cube["id"] = germany_wind_vector_cube["id"].astype(str)

print(germany_wind_vector_cube)
Output
<xarray.Dataset> Size: 515kB
Dimensions: (time: 1584, geometry: 26)
Coordinates:
* time (time) datetime64[ns] 13kB 2019-01-01 ... 2020-0...
* geometry (geometry) object 208B POINT (6.303651 49.512707...
Data variables:
2m_temperature (time, geometry) float32 165kB dask.array<chunksize=(36, 26), meta=np.ndarray>
u_component_of_wind (time, geometry) float32 165kB dask.array<chunksize=(36, 26), meta=np.ndarray>
10m_u_component_of_wind (time, geometry) float32 165kB dask.array<chunksize=(36, 26), meta=np.ndarray>
primary (geometry) <U48 5kB 'Chauffages et Sanitaires Gr...
id (geometry) <U32 3kB '08f1fa355139106d038022d0afb...
Indexes:
geometry GeometryIndex (crs=EPSG:4326)

Load it into memory:

germany_wind_vector_cube = germany_wind_vector_cube.load()
print(germany_wind_vector_cube)
Output
<xarray.Dataset> Size: 515kB
Dimensions: (time: 1584, geometry: 26)
Coordinates:
* time (time) datetime64[ns] 13kB 2019-01-01 ... 2020-0...
* geometry (geometry) object 208B POINT (6.303651 49.512707...
Data variables:
2m_temperature (time, geometry) float32 165kB 279.2 ... 281.2
u_component_of_wind (time, geometry) float32 165kB 36.67 37.22 ... nan
10m_u_component_of_wind (time, geometry) float32 165kB 2.009 ... 4.278
primary (geometry) <U48 5kB 'Chauffages et Sanitaires Gr...
id (geometry) <U32 3kB '08f1fa355139106d038022d0afb...
Indexes:
geometry GeometryIndex (crs=EPSG:4326)

Cool. We need to rechunk the dataset before it can be written to Zarr. After that, encode the geometries to CF-geometries, and write.

germany_wind_vector_cube = (
# Chunk data
germany_wind_vector_cube.chunk({"time": 48, "geometry": 26})
# Encode geoemtries
.xvec.encode_cf()
# Write to Zarr
.to_zarr(repo.store, group="germany_wind/era5", mode="w")
)
# Commit changes
repo.commit("write germany wind era5 vector cube")
Output
66b4f9ad68f6c437dc12f096

A vector data cube of Polygon geometries

The above example demonstrated exctracting point data from a raster datacube. We can also create a vector datacube from GeoParquet data containing Polygons.

The below query pulls all areas categorized as cropland in state of Utah from Overture Maps landcover dataset. Read more about this dataset in the base theme documentation.

ut_ag_query = """

SELECT
id,
ST_AsText(ST_GeomFromWKB(geometry)) as geometry
FROM read_parquet("s3://overturemaps-us-west-2/release/2024-07-22.0/theme=base/type=land_cover/*", hive_partitioning=1)
WHERE
subtype='crop'
AND bbox.xmin > -114.052962
AND bbox.xmax < -109.041058
AND bbox.ymin > 36.997966
AND bbox.ymax < 42.003014
"""
ut_ag_gdf = overture_query(ut_ag_query)
print(ut_ag_gdf.head())
Output
FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))
Output
                                 id  \
0 08b29952e8d0afff0005db1cb08159dd
1 08b2991b8a2c8fff0005d1e316fc51a3
2 08b29950cc68bfff0005d1f0bdeb1d03
3 08b29950ce284fff0005d374ed58e2f1
4 08b2995764870fff0005d14ab791df67

geometry
0 POLYGON ((-113.99291 37.30672, -113.99238 37.3...
1 POLYGON ((-113.94247 37.87148, -113.94151 37.8...
2 POLYGON ((-113.61381 37.04188, -113.61361 37.0...
3 POLYGON ((-113.59954 37.04872, -113.59914 37.0...
4 POLYGON ((-113.60088 37.08047, -113.60018 37.0...

Again, we'll sample the ERA5 datacube, but in order to handle the Polygon geometries of the ut_ag_gdf dataset, we'll use Xvec's zonal_stats() method. This performs a reduction over the sampled areas.

utah_ag_era5 = era5_ds_sub.xvec.zonal_stats(
ut_ag_gdf.geometry, x_coords="longitude", y_coords="latitude"
)
Output
/srv/conda/envs/notebook/lib/python3.12/site-packages/xarray/util/deprecation_helpers.py:140: PerformanceWarning: Reshaping is producing a large chunk. To accept the large
chunk and silence this warning, set the option
>>> with dask.config.set(**{'array.slicing.split_large_chunks': False}):
... array.reshape(shape)

To avoid creating the large chunks, set the option
>>> with dask.config.set(**{'array.slicing.split_large_chunks': True}):
... array.reshape(shape)Explicitly passing ``limit`` to ``reshape`` will also silence this warning
>>> array.reshape(shape, limit='128 MiB')
return func(*args, **kwargs)
utah_ag_era5 = utah_ag_era5.update({"id": ("geometry", ut_ag_gdf["id"])})
utah_ag_era5["id"] = utah_ag_era5["id"].astype(str)

print(utah_ag_era5)
Output
<xarray.Dataset> Size: 99MB
Dimensions: (time: 1584, geometry: 5154)
Coordinates:
* time (time) datetime64[ns] 13kB 2019-01-01 ... 2020-0...
* geometry (geometry) object 41kB POLYGON ((-113.9929142 37...
Data variables:
2m_temperature (geometry, time) float32 33MB dask.array<chunksize=(5154, 36), meta=np.ndarray>
u_component_of_wind (geometry, time) float32 33MB dask.array<chunksize=(5154, 36), meta=np.ndarray>
10m_u_component_of_wind (geometry, time) float32 33MB dask.array<chunksize=(5154, 36), meta=np.ndarray>
id (geometry) <U32 660kB '08b29952e8d0afff0005db1cb...
Indexes:
geometry GeometryIndex (crs=EPSG:4326)

Rechunk, encode geometries, and write to Zarr, as before:

utah_ag_era5 = (
# Chunk
utah_ag_era5.chunk({"time": 48})
# Encode geometries
.xvec.encode_cf()
# Write to zarr
.to_zarr(repo.store, group="utah_crop_cover", mode="w")
)
# Commit changes
repo.commit("write utah crop era5 vector cube")
Output
66b4fbc368f6c437dc12f277

Summary

Increased support for vector data cubes in the Xarray ecosystem means that we can now extend Arraylake's functionality to a new type of data. This notebook demonstrates a few key capabilities:

  • You can ingest GeoParquet vector data from cloud object storage, convert it to a vector data cube, and write it to Arraylake as a Zarr data cube.
  • You can read vector data cubes from Arraylake and have a functional, in memory data cube.
  • You can assemble multi-dimensional data cubes by joining raster and vector data. This is possible for vector data containing both Point and Polygon geometries.