Vector data in Arraylake

You can use Arraylake to ingest, analyze, and write vector geometries in addition to raster data. This notebook demonstrates general patterns for working with vector data in Arraylake; it is divided into two main sections:

Ingesting GeoParquet data to Arraylake
Storing vector data cubes in Arraylake, after joining vector and raster data
- Vector data cubes with Shapely Point geometries
- Vector data cubes with Shapely Polygon geometries

Connect to Arraylake

import arraylake as al

client = al.Client()
repo = client.get_repo("earthmover/vector_datacube-icechunk")

Query Overture Maps

We will use vector data from Overture Maps. These datasets are quite large; rather than read an entire dataset, we use DuckDB to ingest only the data we're interested in. Overture Maps has great resources demonstrating how to query their data with DuckDB.

Here is a DuckDB query that will pull all Overture places data categorized as 'wind_energy' within a spatial bounding box covering Germany.

ger_wind_query = """
      SELECT 
      id, 
      names.primary,
      categories.primary as category,
      ST_AsText(ST_GeomFromWKB(geometry)) as geometry
      FROM read_parquet("s3://overturemaps-us-west-2/release/2024-07-22.0/theme=places/type=place/*", hive_partitioning=1)
      WHERE 
            categories.primary IN ('wind_energy')
            AND bbox.xmin > 5.8663153 
            AND bbox.xmax < 15.0419319 
            AND bbox.ymin > 47.2701114 
            AND bbox.ymax < 55.099161
            """

Using the DuckDB Python API, create an in-memory connection to perform spatial queries of s3 object storage.

import duckdb
import geopandas as gpd

# create duckdb connection
conn = duckdb.connect()

# load extensions
conn.load_extension("spatial")
conn.load_extension("httpfs")

# specify aws region
conn.execute("SET s3_region='us-west-2'");

Ingesting GeoParquet data to Arraylake

The below function executes the DuckDB query and converts the resulting Pandas.DataFrame to a GeoPandas.GeoDataFrame.

def overture_query(query: str) -> gpd.GeoDataFrame:
    import geopandas as gpd
    import shapely

    df = conn.execute(query).fetchdf()
    gdf = gpd.GeoDataFrame(df)
    gdf["geometry"] = gdf["geometry"].apply(shapely.wkt.loads)
    gdf = gdf.set_geometry("geometry").set_crs("EPSG:4326")

    return gdf

ger_wind_gdf = overture_query(ger_wind_query)
print(ger_wind_gdf.head())

Output
FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Output
                                 id                               primary  \
08f1fa355139106d038022d0afb88e55  Chauffages et Sanitaires Grethen Jos   
08f1fa3c96b2eb5c03a6d6e8ad7d8c68                              Duvivier   
08f1fa3cd1bb02a4033f6b0ef0900eb1                         LuxEnergie SA   
08f1fa3ceb98203003b3d967139ab3f9                  Lënster Energie Sàrl   
08f1fa3195404a16035b92b7b79fa7eb                               Bio Man   

      category                  geometry  
wind_energy  POINT (6.30365 49.51271)  
wind_energy  POINT (5.87755 49.59550)  
wind_energy  POINT (6.16488 49.62893)  
wind_energy  POINT (6.25885 49.69611)  
wind_energy  POINT (6.36202 49.66138)

We use Xvec to convert the GeoDataFrame to an Xarray.Dataset with a spatially-aware GeometryIndex.

import xarray as xr

ger_wind_ds = xr.Dataset(
    {
        "primary": ("geometry", ger_wind_gdf["primary"]),
        "id": ("geometry", ger_wind_gdf["id"]),
    },
    coords={"geometry": ger_wind_gdf["geometry"]},
).xvec.set_geom_indexes("geometry", crs=ger_wind_gdf.crs)
print(ger_wind_ds)

Output
<xarray.Dataset> Size: 624B
Dimensions:   (geometry: 26)
Coordinates:
  * geometry  (geometry) object 208B POINT (6.303651 49.512707) ... POINT (13...
Data variables:
    primary   (geometry) object 208B 'Chauffages et Sanitaires Grethen Jos' ....
    id        (geometry) object 208B '08f1fa355139106d038022d0afb88e55' ... '...
Indexes:
    geometry  GeometryIndex (crs=EPSG:4326)

For more details and discussion of vector data cubes, check out the Xvec documentation and our recent blog post.

Write vector data to Arraylake

Now, we have a 1-d vector data cube that resides in memory. To save it to disk, we need to encode the geometries into a format that is compatible with Zarr. In memory, geometries are represented as Shapely objects. However, these aren't compatible with array-based file formats such as Zarr. To write data, we encode the geometries according to the Climate Forecast (CF) Metadata Conventions specification for geometries, which is compatible with Zarr and NetCDF file formats. This conversion is handled by Xvec using functionality from the cf_xarray package. Once encoded, the data can be written to Arraylake as usual.

# Encode shapely geometries as CF
encoded_wind_ds = ger_wind_ds.xvec.encode_cf()

# Write zarr data to Arraylake Repo
session = repo.writable_session("main")
encoded_wind_ds.to_zarr(session.store, group="germany_wind", mode="w")

# Commit changes
session.commit("write germany wind gdf to zarr")

Output
66b4f8a19a4d239ce895e0d5

Vector data cubes in Arraylake

Section 1 demonstrated reading GeoParquet data as a GeoPandas GeoDataFrame, converting the dataframe to an Xarray Dataset with spatially-aware indexes and writing that object to Arraylake. Now, we will merge raster data with the vector data cube created in section 1 to create a multi-dimensional vector data cube. We will sample ERA5 atmospheric reanalysis data at a various geometries. This section builds on the vector data cube workflow described in this blog post.

A vector data cube of `Point` geometries

We'll first join raster and vector data with a set of Shapely Point geometries. We will use the 1-dimensional vector data cube created in Section 1 that contains spatial information about wind power plants in Germany.

Read raster data from Google Cloud Storage

We will use an ‘analysis-ready’ ERA5 data from WeatherBench2, which makes it possible to access ERA5 data from Google Cloud Storage.

uri = "gs://gcp-public-data-arco-era5/ar/1959-2022-full_37-6h-0p25deg-chunk-1.zarr-v2"
era5_ds_sub = (
    # Open the dataset
    xr.open_zarr(uri, chunks={"time": 48}, consolidated=True)
    # Select the near-surface level
    .isel(level=0, drop=True)
    # subset in time
    .sel(time=slice("2019-01", "2020-01"))[["2m_temperature", "u_component_of_wind", "10m_u_component_of_wind"]]
)
# convert ERA5 longitude to match Overture geometry coordinates
era5_ds_sub = era5_ds_sub.assign_coords(longitude=(((era5_ds_sub.longitude + 180) % 360) - 180))
era5_ds_sub = era5_ds_sub.sortby("longitude")

Read the vector data from Arraylake

session = repo.readonly_session("main")

germany_wind_vec = xr.open_zarr(
    session.store,
    group="germany_wind",
    consolidated=True,
    zarr_format=3,
)
print(germany_wind_vec)

Output
<xarray.Dataset> Size: 1kB
Dimensions:             (geometry: 26, node: 26)
Coordinates:
    lat                 (geometry) float64 208B ...
    lon                 (geometry) float64 208B ...
    spatial_ref         int64 8B ...
Dimensions without coordinates: geometry, node
Data variables:
    geometry_container  float64 8B ...
    id                  (geometry) object 208B ...
    primary             (geometry) object 208B ...
    x                   (node) float64 208B ...
    y                   (node) float64 208B ...

germany_wind_vec has the CF-encoded geometries we wrote earlier. To use it in memory, decode the CF-encoded geometries to Shapely objects using the xvec.decode_cf() method.

germany_wind_vec = germany_wind_vec.xvec.decode_cf()

Note that we have now an object that is identical to the vector cube we first created in memory, except this one we read from object storage:

xr.testing.assert_identical(germany_wind_vec, ger_wind_ds)

The datacube can also be converted to a GeoDataFrame.

germany_wind_era5_gdf = germany_wind_vec.xvec.to_geodataframe()
print(germany_wind_era5_gdf.head())

Output
                   geometry                                id  \
POINT (6.30365 49.51271)  08f1fa355139106d038022d0afb88e55   
POINT (5.87755 49.59550)  08f1fa3c96b2eb5c03a6d6e8ad7d8c68   
POINT (6.16488 49.62893)  08f1fa3cd1bb02a4033f6b0ef0900eb1   
POINT (6.25885 49.69611)  08f1fa3ceb98203003b3d967139ab3f9   
POINT (6.36202 49.66138)  08f1fa3195404a16035b92b7b79fa7eb   

                                primary  
Chauffages et Sanitaires Grethen Jos  
                            Duvivier  
                       LuxEnergie SA  
                Lënster Energie Sàrl  
                             Bio Man

Sample the raster data cube at the set of vector geometries using `xvec.extract_points()`.

germany_wind_vector_cube = era5_ds_sub.xvec.extract_points(germany_wind_vec.geometry, x_coords="longitude", y_coords="latitude")
# add attr data from vector dataframe to cube
germany_wind_vector_cube.update(
    {
        "primary": ("geometry", germany_wind_vec["primary"].data),
        "id": ("geometry", germany_wind_vec["id"].data),
    }
)
germany_wind_vector_cube["primary"] = germany_wind_vector_cube["primary"].astype(str)
germany_wind_vector_cube["id"] = germany_wind_vector_cube["id"].astype(str)

print(germany_wind_vector_cube)

Output
<xarray.Dataset> Size: 515kB
Dimensions:                  (time: 1584, geometry: 26)
Coordinates:
  * time                     (time) datetime64[ns] 13kB 2019-01-01 ... 2020-0...
  * geometry                 (geometry) object 208B POINT (6.303651 49.512707...
Data variables:
    2m_temperature           (time, geometry) float32 165kB dask.array<chunksize=(36, 26), meta=np.ndarray>
    u_component_of_wind      (time, geometry) float32 165kB dask.array<chunksize=(36, 26), meta=np.ndarray>
    10m_u_component_of_wind  (time, geometry) float32 165kB dask.array<chunksize=(36, 26), meta=np.ndarray>
    primary                  (geometry) <U48 5kB 'Chauffages et Sanitaires Gr...
    id                       (geometry) <U32 3kB '08f1fa355139106d038022d0afb...
Indexes:
    geometry  GeometryIndex (crs=EPSG:4326)

Load it into memory:

germany_wind_vector_cube = germany_wind_vector_cube.load()
print(germany_wind_vector_cube)

Output
<xarray.Dataset> Size: 515kB
Dimensions:                  (time: 1584, geometry: 26)
Coordinates:
  * time                     (time) datetime64[ns] 13kB 2019-01-01 ... 2020-0...
  * geometry                 (geometry) object 208B POINT (6.303651 49.512707...
Data variables:
    2m_temperature           (time, geometry) float32 165kB 279.2 ... 281.2
    u_component_of_wind      (time, geometry) float32 165kB 36.67 37.22 ... nan
    10m_u_component_of_wind  (time, geometry) float32 165kB 2.009 ... 4.278
    primary                  (geometry) <U48 5kB 'Chauffages et Sanitaires Gr...
    id                       (geometry) <U32 3kB '08f1fa355139106d038022d0afb...
Indexes:
    geometry  GeometryIndex (crs=EPSG:4326)

Cool. We need to rechunk the dataset before it can be written to Zarr. After that, encode the geometries to CF-geometries, and write.

germany_wind_vector_cube = (
    # Chunk data
    germany_wind_vector_cube.chunk({"time": 48, "geometry": 26})
    # Encode geoemtries
    .xvec.encode_cf()
    # Write to Zarr
    .to_zarr(repo.store, group="germany_wind/era5", mode="w")
)

# Commit changes
session = repo.writable_session("main")
session.commit("write germany wind era5 vector cube")

Output
66b4f9ad68f6c437dc12f096

A vector data cube of `Polygon` geometries

The above example demonstrated exctracting point data from a raster datacube. We can also create a vector datacube from GeoParquet data containing Polygons.

The below query pulls all areas categorized as cropland in state of Utah from Overture Maps landcover dataset. Read more about this dataset in the base theme documentation.

ut_ag_query = """
  
  SELECT 
  id, 
  ST_AsText(ST_GeomFromWKB(geometry)) as geometry
  FROM read_parquet("s3://overturemaps-us-west-2/release/2024-07-22.0/theme=base/type=land_cover/*", hive_partitioning=1)
  WHERE
        subtype='crop'
        AND bbox.xmin > -114.052962 
        AND bbox.xmax < -109.041058 
        AND bbox.ymin > 36.997966 
        AND bbox.ymax < 42.003014
        """
ut_ag_gdf = overture_query(ut_ag_query)
print(ut_ag_gdf.head())

Output
FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Output
                                 id  \
08b29952e8d0afff0005db1cb08159dd   
08b2991b8a2c8fff0005d1e316fc51a3   
08b29950cc68bfff0005d1f0bdeb1d03   
08b29950ce284fff0005d374ed58e2f1   
08b2995764870fff0005d14ab791df67   

                                            geometry  
POLYGON ((-113.99291 37.30672, -113.99238 37.3...  
POLYGON ((-113.94247 37.87148, -113.94151 37.8...  
POLYGON ((-113.61381 37.04188, -113.61361 37.0...  
POLYGON ((-113.59954 37.04872, -113.59914 37.0...  
POLYGON ((-113.60088 37.08047, -113.60018 37.0...

Again, we'll sample the ERA5 datacube, but in order to handle the Polygon geometries of the ut_ag_gdf dataset, we'll use Xvec's zonal_stats() method. This performs a reduction over the sampled areas.

utah_ag_era5 = era5_ds_sub.xvec.zonal_stats(ut_ag_gdf.geometry, x_coords="longitude", y_coords="latitude")

Output
/srv/conda/envs/notebook/lib/python3.12/site-packages/xarray/util/deprecation_helpers.py:140: PerformanceWarning: Reshaping is producing a large chunk. To accept the large
chunk and silence this warning, set the option
    >>> with dask.config.set(**{'array.slicing.split_large_chunks': False}):
    ...     array.reshape(shape)

To avoid creating the large chunks, set the option
    >>> with dask.config.set(**{'array.slicing.split_large_chunks': True}):
    ...     array.reshape(shape)Explicitly passing ``limit`` to ``reshape`` will also silence this warning
    >>> array.reshape(shape, limit='128 MiB')
  return func(*args, **kwargs)

utah_ag_era5 = utah_ag_era5.update({"id": ("geometry", ut_ag_gdf["id"])})
utah_ag_era5["id"] = utah_ag_era5["id"].astype(str)

print(utah_ag_era5)

Output
<xarray.Dataset> Size: 99MB
Dimensions:                  (time: 1584, geometry: 5154)
Coordinates:
  * time                     (time) datetime64[ns] 13kB 2019-01-01 ... 2020-0...
  * geometry                 (geometry) object 41kB POLYGON ((-113.9929142 37...
Data variables:
    2m_temperature           (geometry, time) float32 33MB dask.array<chunksize=(5154, 36), meta=np.ndarray>
    u_component_of_wind      (geometry, time) float32 33MB dask.array<chunksize=(5154, 36), meta=np.ndarray>
    10m_u_component_of_wind  (geometry, time) float32 33MB dask.array<chunksize=(5154, 36), meta=np.ndarray>
    id                       (geometry) <U32 660kB '08b29952e8d0afff0005db1cb...
Indexes:
    geometry  GeometryIndex (crs=EPSG:4326)

Rechunk, encode geometries, and write to Zarr, as before:

session = repo.writable_session("main")

utah_ag_era5 = (
    # Chunk
    utah_ag_era5.chunk({"time": 48})
    # Encode geometries
    .xvec.encode_cf()
    # Write to zarr
    .to_zarr(session.store, group="utah_crop_cover", mode="w")
)

# Commit changes
session.commit("write utah crop era5 vector cube")

Output
66b4fbc368f6c437dc12f277

Summary

Increased support for vector data cubes in the Xarray ecosystem means that we can now extend Arraylake's functionality to a new type of data. This notebook demonstrates a few key capabilities:

You can ingest GeoParquet vector data from cloud object storage, convert it to a vector data cube, and write it to Arraylake as a Zarr data cube.
You can read vector data cubes from Arraylake and have a functional, in memory data cube.
You can assemble multi-dimensional data cubes by joining raster and vector data. This is possible for vector data containing both Point and Polygon geometries.

Connect to Arraylake​

Query Overture Maps​

Ingesting GeoParquet data to Arraylake​

Write vector data to Arraylake​

Vector data cubes in Arraylake​

A vector data cube of Point geometries​

Read raster data from Google Cloud Storage​

Read the vector data from Arraylake​

Sample the raster data cube at the set of vector geometries using xvec.extract_points().​

A vector data cube of Polygon geometries​

Summary​