Searching and filtering
Introduction
Arraylake allows you to search through a repository using metadata present in the Zarr attrs
property. A recommended workflow is to:
- Use the
Repo.tree
visualization to iterate on your filter query, and then - Use
Repo.filter_metadata
to obtain a list of Zarr groups that can read in to an Xarray dataset.
Both methods use JMESpath, pronounced "james path", a rich query language for JSON to express queries. Arraylake only supports the filtering functionality, not aggregation (or projection). For more, see the JMESpath tutorial.
Setup
import arraylake as al
import numpy as np
client = al.Client()
An example climate model dataset
This example dataset is loosely inspired by the CMIP datasets, though we use longer names to keep the examples readable.
The hierarchy looks like:
project/model_name/experiment_name/data_stream/spatial_grid/variable_name
. For each model name, we will assign variables and appropriate CF attributes.
Arraylake does not assume or impose any particular convention for the attributes set on the Zarr array or Zarr groups. Instead it lets you use JMESpath, a query language for JSON, to allow a range of simple to complex queries on the attributes.
We begin by using the zarr
library to create a nested tree of arrays with attributes that we will later be able to search over.
You'll need to change the organization name 👇 from earthmover
to your-org-name
.
cmip_repo = client.get_or_create_repo("earthmover/search-cmip-like")
root = cmip_repo.root_group
varnames = {
"atm": ["pr", "co2", "tas"],
"land": ["rootd", "tasmin", "tasmax"],
}
attrs = {
"pr": {"standard_name": "precipitation_flux"},
"co2": {"standard_name": "mole_fraction_of_carbon_dioxide_in_air"},
"tas": {"standard_name": "air_temperature", "cell_methods": "time:mean"},
"tasmin": {"standard_name": "air_temperature", "cell_methods": "time:min"},
"tasmax": {"standard_name": "air_temperature", "cell_methods": "time:max"},
"rootd": {"standard_name": "root_depth", "units": "m"},
}
for mip in ["CMIP", "ScenarioMIP"]:
for model in ["model1", "model2"]:
for experiment_id in ["historical"]:
for stream in ["atm_daily", "land_daily", "land_monthly"]:
if mip == "ScenarioMIP" and stream != "atm_daily":
continue
for grid_id in ["native", "latlon"]:
if grid_id == "latlon" and model == "model2":
continue
frequency = "mon" if "mon" in stream else "day"
component, _ = stream.split("_")
path = f"{mip}/{model}/{experiment_id}/{stream}/{grid_id}"
group = root.create_group(path, overwrite=True)
for variable in varnames[component]:
path = f"{mip}/{model}/{experiment_id}/{stream}/{grid_id}/{variable}"
array = root.create_dataset(
path,
shape=(4, 64, 128),
overwrite=True,
fill_value=np.nan,
dtype=np.float64,
)
if variable in attrs:
array.attrs.update(attrs[variable])
array.attrs.update(
{
"frequency": frequency,
"grid": grid_id,
"experiment_id": experiment_id,
"_ARRAY_DIMENSIONS": ["time", "nlat", "nlon"]
if grid_id == "native"
else ["time", "latitude", "longitude"],
}
)
cmip_repo.commit("demo dataset commit")
Here is the full tree
cmip_repo.tree()
/
├── 📁 CMIP
│ ├── 📁 model1
│ │ └── 📁 historical
│ │ ├── 📁 land_daily
│ │ │ ├── 📁 latlon
│ │ │ │ ├── 🇦 rootd (4, 64, 128) float64
│ │ │ │ ├── 🇦 tasmax (4, 64, 128) float64
│ │ │ │ └── 🇦 tasmin (4, 64, 128) float64
│ │ │ └── 📁 native
│ │ │ ├── 🇦 tasmin (4, 64, 128) float64
│ │ │ ├── 🇦 rootd (4, 64, 128) float64
│ │ │ └── 🇦 tasmax (4, 64, 128) float64
│ │ ├── 📁 atm_daily
│ │ │ ├── 📁 latlon
│ │ │ │ ├── 🇦 tas (4, 64, 128) float64
│ │ │ │ ├── 🇦 co2 (4, 64, 128) float64
│ │ │ │ └── 🇦 pr (4, 64, 128) float64
│ │ │ └── 📁 native
│ │ │ ├── 🇦 co2 (4, 64, 128) float64
│ │ │ ├── 🇦 tas (4, 64, 128) float64
│ │ │ └── 🇦 pr (4, 64, 128) float64
│ │ └── 📁 land_monthly
│ │ ├── 📁 latlon
│ │ │ ├── 🇦 rootd (4, 64, 128) float64
│ │ │ ├── 🇦 tasmin (4, 64, 128) float64
│ │ │ └── 🇦 tasmax (4, 64, 128) float64
│ │ └── 📁 native
│ │ ├── 🇦 tasmax (4, 64, 128) float64
│ │ ├── 🇦 tasmin (4, 64, 128) float64
│ │ └── 🇦 rootd (4, 64, 128) float64
│ └── 📁 model2
│ └── 📁 historical
│ ├── 📁 land_daily
│ │ └── 📁 native
│ │ ├── 🇦 tasmax (4, 64, 128) float64
│ │ ├── 🇦 tasmin (4, 64, 128) float64
│ │ └── 🇦 rootd (4, 64, 128) float64