Searching and filtering
Introduction
Arraylake allows you to search through a repository using metadata present in the Zarr attrs
property. A recommended workflow is to:
- Use the
Repo.tree
visualization to iterate on your filter query, and then - Use
Repo.filter_metadata
to obtain a list of Zarr groups that can read in to an Xarray dataset.
Both methods use JMESpath, pronounced "james path", a rich query language for JSON to express queries. Arraylake only supports the filtering functionality, not aggregation (or projection). For more, see the JMESpath tutorial.
Setup
import arraylake as al
import numpy as np
import xarray as xr
import zarr
client = al.Client()
An example climate model dataset
This example dataset is loosely inspired by the CMIP datasets, though we use longer names to keep the examples readable.
The hierarchy looks like:
project/model_name/experiment_name/data_stream/spatial_grid/variable_name
. For each model name, we will assign variables and appropriate CF attributes.
Arraylake does not assume or impose any particular convention for the attributes set on the Zarr array or Zarr groups. Instead it lets you use JMESpath, a query language for JSON, to allow a range of simple to complex queries on the attributes.
We begin by using the zarr
library to create a nested tree of arrays with attributes that we will later be able to search over.
You'll need to change the organization name 👇 from earthmover
to your-org-name
.
cmip_repo = client.get_or_create_repo("earthmover/search-cmip-like")
root = cmip_repo.root_group
varnames = {
"atm": ["pr", "co2", "tas"],
"land": ["rootd", "tasmin", "tasmax"],
}
attrs = {
"pr": {"standard_name": "precipitation_flux"},
"co2": {"standard_name": "mole_fraction_of_carbon_dioxide_in_air"},
"tas": {"standard_name": "air_temperature", "cell_methods": "time:mean"},
"tasmin": {"standard_name": "air_temperature", "cell_methods": "time:min"},
"tasmax": {"standard_name": "air_temperature", "cell_methods": "time:max"},
"rootd": {"standard_name": "root_depth", "units": "m"},
}
for mip in ["CMIP", "ScenarioMIP"]:
for model in ["model1", "model2"]:
for experiment_id in ["historical"]:
for stream in ["atm_daily", "land_daily", "land_monthly"]:
if mip == "ScenarioMIP" and stream != "atm_daily":
continue
for grid_id in ["native", "latlon"]:
if grid_id == "latlon" and model == "model2":
continue
frequency = "mon" if "mon" in stream else "day"
component, _ = stream.split("_")
path = f"{mip}/{model}/{experiment_id}/{stream}/{grid_id}"
group = root.create_group(path, overwrite=True)
for variable in varnames[component]:
path = f"{mip}/{model}/{experiment_id}/{stream}/{grid_id}/{variable}"
array = root.create_dataset(
path,
shape=(4, 64, 128),
overwrite=True,
fill_value=np.nan,
dtype=np.float64,
)
if variable in attrs:
array.attrs.update(attrs[variable])
array.attrs.update(
{
"frequency": frequency,
"grid": grid_id,
"experiment_id": experiment_id,
"_ARRAY_DIMENSIONS": ["time", "nlat", "nlon"]
if grid_id == "native"
else ["time", "latitude", "longitude"],
}
)
repo.commit("demo dataset commit")
Here is the full tree
cmip_repo.tree()
/
├── 📁 CMIP
│ ├── 📁 model1
│ │ └── 📁 historical
│ │ ├── 📁 land_daily
│ │ │ ├── 📁 latlon
│ │ │ │ ├── 🇦 rootd (4, 64, 128) float64
│ │ │ │ ├── 🇦 tasmax (4, 64, 128) float64
│ │ │ │ └── 🇦 tasmin (4, 64, 128) float64
│ │ │ └── 📁 native
│ │ │ ├── 🇦 tasmin (4, 64, 128) float64
│ │ │ ├── 🇦 rootd (4, 64, 128) float64
│ │ │ └── 🇦 tasmax (4, 64, 128) float64
│ │ ├── 📁 atm_daily
│ │ │ ├── 📁 latlon
│ │ │ │ ├── 🇦 tas (4, 64, 128) float64
│ │ │ │ ├── 🇦 co2 (4, 64, 128) float64
│ │ │ │ └── 🇦 pr (4, 64, 128) float64
│ │ │ └── 📁 native
│ │ │ ├── 🇦 co2 (4, 64, 128) float64
│ │ │ ├── 🇦 tas (4, 64, 128) float64
│ │ │ └── 🇦 pr (4, 64, 128) float64
│ │ └─ ─ 📁 land_monthly
│ │ ├── 📁 latlon
│ │ │ ├── 🇦 rootd (4, 64, 128) float64
│ │ │ ├── 🇦 tasmin (4, 64, 128) float64
│ │ │ └── 🇦 tasmax (4, 64, 128) float64
│ │ └── 📁 native
│ │ ├── 🇦 tasmax (4, 64, 128) float64
│ │ ├── 🇦 tasmin (4, 64, 128) float64
│ │ └── 🇦 rootd (4, 64, 128) float64
│ └── 📁 model2
│ └── 📁 historical
│ ├── 📁 land_daily
│ │ └── 📁 native
│ │ ├── 🇦 tasmax (4, 64, 128) float64
│ │ ├── 🇦 tasmin (4, 64, 128) float64
│ │ └── 🇦 rootd (4, 64, 128) float64
│ ├── 📁 atm_daily
│ │ └── 📁 native
│ │ ├── 🇦 co2 (4, 64, 128) float64
│ │ ├── 🇦 tas (4, 64, 128) float64
│ │ └── 🇦 pr (4, 64, 128) float64
│ └── 📁 land_monthly
│ └── 📁 native
│ ├── 🇦 tasmax (4, 64, 128) float64
│ ├── 🇦 rootd (4, 64, 128) float64
│ └── 🇦 tasmin (4, 64, 128) float64
└── 📁 ScenarioMIP
├── 📁 model1
│ └── 📁 historical
│ └── 📁 atm_daily
│ ├── 📁 native
│ │ ├── 🇦 co2 (4, 64, 128) float64
│ │ ├── 🇦 tas (4, 64, 128) float64
│ │ └── 🇦 pr (4, 64, 128) float64
│ └── 📁 latlon
│ ├── 🇦 pr (4, 64, 128) float64
│ ├── 🇦 tas (4, 64, 128) float64
│ └── 🇦 co2 (4, 64, 128) float64
└── 📁 model2
└── 📁 historical
└── 📁 atm_daily
└── 📁 native
├── 🇦 co2 (4, 64, 128) float64
├── 🇦 tas (4, 64, 128) float64
└── 🇦 pr (4, 64, 128) float64
- There is no order to the nodes in the tree.
- The tree is a lot nicer to navigate with
ipytree
installed.
We see a mix of variables from the land, ocean, and atmosphere component models.
Matching the value of a single attribute
Now lets filter to only include variables with the CF attribute standard_name
of "air_temperature"
.
cmip_repo.tree(filter="standard_name == 'air_temperature'")
/
├── 📁 CMIP
│ ├── 📁 model1
│ │ └── 📁 historical
│ │ ├── 📁 atm_daily
│ │ │ ├── 📁 native
│ │ │ │ └── 🇦 tas (4, 64, 128) float64
│ │ │ └── 📁 latlon
│ │ │ └── 🇦 tas (4, 64, 128) float64
│ │ ├── 📁 land_daily
│ │ │ ├── 📁 latlon
│ │ │ │ ├── 🇦 tasmax (4, 64, 128) float64
│ │ │ │ └── 🇦 tasmin (4, 64, 128) float64
│ │ │ └── 📁 native
│ │ │ ├── 🇦 tasmin (4, 64, 128) float64
│ │ │ └── 🇦 tasmax (4, 64, 128) float64
│ │ └── 📁 land_monthly
│ │ ├── 📁 latlon
│ │ │ ├── 🇦 tasmin (4, 64, 128) float64
│ │ │ └── 🇦 tasmax (4, 64, 128) float64
│ │ └── 📁 native
│ │ ├── 🇦 tasmax (4, 64, 128) float64
│ │ └── 🇦 tasmin (4, 64, 128) float64
│ └── 📁 model2
│ └── 📁 historical
│ ├── 📁 land_daily
│ │ └── 📁 native
│ │ ├── 🇦 tasmax (4, 64, 128) float64
│ │ └── 🇦 tasmin (4, 64, 128) float64
│ ├── 📁 land_monthly
│ │ └── 📁 native
│ │ ├── 🇦 tasmax (4, 64, 128) float64
│ │ └── 🇦 tasmin (4, 64, 128) float64
│ └── 📁 atm_daily
│ └── 📁 native
│ └── 🇦 tas (4, 64, 128) float64
└── 📁 ScenarioMIP
├── 📁 model1
│ └── 📁 historical
│ └── 📁 atm_daily
│ ├── 📁 native
│ │ └── 🇦 tas (4, 64, 128) float64
│ └── 📁 latlon
│ └── 🇦 tas (4, 64, 128) float64
└── 📁 model2
└── 📁 historical
└── 📁 atm_daily
└── 📁 native
└── 🇦 tas (4, 64, 128) float64
Very nice! We get back only zarr groups that contain the rootd
array. Let's further filter to select only the daily frequency output. The CMIP (CMOR) convention is to specify this by setting the frequency
attribute to 'day'
It is safer to use backticks (`), that is specify literal values, for comparisons. While single quotes conveniently work for equality comparisons to strings, they will not work for comparisons with other data types. So repo.tree(filter="standard_name == 'root_depth'")
will work but is not recommended.
cmip_repo.tree(
filter="standard_name == `air_temperature` && frequency == `day` && grid==`native`"
)
/
├── 📁 CMIP
│ ├── 📁 model1
│ │ └── 📁 historical
│ │ ├── 📁 atm_daily
│ │ │ └── 📁 native
│ │ │ └── 🇦 tas (4, 64, 128) float64
│ │ └── 📁 land_daily
│ │ └── 📁 native
│ │ ├── 🇦 tasmin (4, 64, 128) float64
│ │ └── 🇦 tasmax (4, 64, 128) float64
│ └── 📁 model2
│ └── 📁 historical
│ ├── 📁 land_daily
│ │ └── 📁 native
│ │ ├── 🇦 tasmax (4, 64, 128) float64
│ │ └── 🇦 tasmin (4, 64, 128) float64
│ └── 📁 atm_daily
│ └── 📁 native
│ └── 🇦 tas (4, 64, 128) float64
└── 📁 ScenarioMIP
├── 📁 model1
│ └── 📁 historical
│ └── 📁 atm_daily
│ └── 📁 native
│ └── 🇦 tas (4, 64, 128) float64
└── 📁 model2
└── 📁 historical
└── 📁 atm_daily
└── 📁 native
└── 🇦 tas (4, 64, 128) float64
From tree to Xarray
Now lets read that to an Xarray object. Use repo.filter_metadata
to view the tree as a list of paths
The results are unsorted!
results = cmip_repo.filter_metadata(
filter="standard_name == `air_temperature` && frequency == 'day' && grid=='native' && experiment_id=='historical'"
)
results
['CMIP/model1/historical/land_daily/native/tasmax',
'CMIP/model2/historical/land_daily/native/tasmin',
'ScenarioMIP/model2/historical/atm_daily/native/tas',
'CMIP/model1/historical/land_daily/native/tasmin',
'CMIP/model2/historical/atm_daily/native/tas',
'CMIP/model1/historical/atm_daily/native/tas',
'CMIP/model2/historical/land_daily/native/tasmax',
'ScenarioMIP/model1/historical/atm_daily/native/tas']
One way to work with these results is to read the paths to a single dataset each
datasets = {}
for full_path in results:
group, array = full_path.rsplit("/", maxsplit=1)
da = cmip_repo.to_xarray(group=group)[array]
datasets[group] = da
datasets
{'CMIP/model1/historical/land_daily/native': <xarray.DataArray 'tasmin' (time: 4, nlat: 64, nlon: 128)>
[32768 values with dtype=float64]
Dimensions without coordinates: time, nlat, nlon
Attributes:
cell_methods: time:min
experiment_id: historical
frequency: day
grid: native
standard_name: air_temperature,
'CMIP/model2/historical/land_daily/native': <xarray.DataArray 'tasmax' (time: 4, nlat: 64, nlon: 128)>
[32768 values with dtype=float64]
Dimensions without coordinates: time, nlat, nlon
Attributes:
cell_methods: time:max
experiment_id: historical
frequency: day
grid: native
standard_name: air_temperature,
'ScenarioMIP/model2/historical/atm_daily/native': <xarray.DataArray 'tas' (time: 4, nlat: 64, nlon: 128)>
[32768 values with dtype=float64]
Dimensions without coordinates: time, nlat, nlon
Attributes:
cell_methods: time:mean
experiment_id: historical
frequency: day
grid: native
standard_name: air_temperature,
'CMIP/model2/historical/atm_daily/native': <xarray.DataArray 'tas' (time: 4, nlat: 64, nlon: 128)>
[32768 values with dtype=float64]
Dimensions without coordinates: time, nlat, nlon
Attributes:
cell_methods: time:mean
experiment_id: historical
frequency: day
grid: native
standard_name: air_temperature,
'CMIP/model1/historical/atm_daily/native': <xarray.DataArray 'tas' (time: 4, nlat: 64, nlon: 128)>
[32768 values with dtype=float64]
Dimensions without coordinates: time, nlat, nlon
Attributes:
cell_methods: time:mean
experiment_id: historical
frequency: day
grid: native
standard_name: air_temperature,
'ScenarioMIP/model1/historical/atm_daily/native': <xarray.DataArray 'tas' (time: 4, nlat: 64, nlon: 128)>
[32768 values with dtype=float64]
Dimensions without coordinates: time, nlat, nlon
Attributes:
cell_methods: time:mean
experiment_id: historical
frequency: day
grid: native
standard_name: air_temperature}
The resulting dictionary of DataArrays can be manipulated to a more useful form using xarray's combining functions, or the excellent, but experimental, xarray-datatree library.
An example STAC-like dataset
We now introduce a second dataset, inspired by STAC (SpatioTemporal Asset Catalogs).
stac_repo = client.get_or_create_repo("earthmover/search-stac-like")
stac_repo
<arraylake.repo.Repo 'earthmover/search-stac-like'>
for itime in range(1, 6):
group = stac_repo.root_group.create_group(f"staclike/sensor/time{itime}", overwrite=True)
group.attrs.update(
{"created:at:time": f"2022-05-{itime:02d}", "timestamp_number": itime}
)
for band in ["nir", "blue"]:
array = group.create_dataset(
band,
shape=(1, 64, 128),
overwrite=True,
fill_value=np.nan,
dtype=np.float64,
)
array.attrs.update({"_ARRAY_DIMENSIONS": ["time", "y", "x"]})
stac_repo.commit("added STAC-like dataset")
stac_repo.tree()
/
└── 📁 staclike
└── 📁 sensor
├── 📁 time4
│ ├── 🇦 blue (1, 64, 128) float64
│ └── 🇦 nir (1, 64, 128) float64
├── 📁 time5
│ ├── 🇦 blue (1, 64, 128) float64
│ └── 🇦 nir (1, 64, 128) float64
├── 📁 time1
│ ├── 🇦 blue (1, 64, 128) float64
│ └── 🇦 nir (1, 64, 128) float64
├── 📁 time2
│ ├── 🇦 blue (1, 64, 128) float64
│ └── 🇦 nir (1, 64, 128) float64
└── 📁 time3
├── 🇦 blue (1, 64, 128) float64
└── 🇦 nir (1, 64, 128) float64
Comparing values
Comparisons against literal values that are not strings are allowed. As earlier, use backticks (`) to specify literal values for comparisons.
For example, compare dates:
stac_repo.tree(filter='"created:at:time" <= `2022-05-03`')
/
└── 📁 staclike
└── 📁 sensor
├── 📁 time3
├── 📁 time2
└── 📁 time1
JMESpath does not treat dates specially. The example above compares strings meaning the 'created:at:time'
entry in the attribute dictionary must contain a string.
Compare to integers (again specify literals using \
):
stac_repo.tree(filter='"timestamp_number" < `3`')
/
└── 📁 staclike
└── 📁 sensor
├── 📁 time2
└── 📁 time1
Handling queries with no results
How about if there are no results for a query?
The comparison of two missing keys is truthy! The following filter string will match all entries if 'foo'
is not an attribute that exists.
filter='foo == "bar"'
stac_repo.tree(filter="foo == bar")
/
└── 📁 staclike
└── 📁 sensor
├── 📁 time3
│ ├── 🇦 nir (1, 64, 128) float64
│ └── 🇦 blue (1, 64, 128) float64
├── 📁 time2
│ ├── 🇦 nir (1, 64, 128) float64
│ └── 🇦 blue (1, 64, 128) float64
├── 📁 time4
│ ├── 🇦 nir (1, 64, 128) float64
│ └── 🇦 blue (1, 64, 128) float64
├── 📁 time5
│ ├── 🇦 nir (1, 64, 128) float64
│ └── 🇦 blue (1, 64, 128) float64
└── 📁 time1
├── 🇦 blue (1, 64, 128) float64
└── 🇦 nir (1, 64, 128) float64
Ouch! that's just the whole repository!
Use the contains
function to check for the presence of the keys before asserting equality:
stac_repo.tree(filter='contains(keys(@), "foo") && foo == "bar"')
/
No results since none of our arrays have the attribute key "foo"
. This is much more sensible. For more, see the JMESpath docs on functions.
Advanced Examples
Inexact matches
Use the contains
function to search for a particular substring in a value. Here we search for all air_temperature
variables containing 'max'
in the cell_methods
attribute.
cmip_repo.tree(
filter="contains(keys(@), `cell_methods`) && contains(cell_methods, 'max')"
)
/
└── 📁 CMIP
├── 📁 model2
│ └── 📁 historical
│ ├── 📁 land_monthly
│ │ └── 📁 native
│ │ └── 🇦 tasmax (4, 64, 128) float64
│ └── 📁 land_daily
│ └── 📁 native
│ └── 🇦 tasmax (4, 64, 128) float64
└── 📁 model1
└── 📁 historical
├── 📁 land_daily
│ ├── 📁 latlon
│ │ └── 🇦 tasmax (4, 64, 128) float64
│ └── 📁 native
│ └── 🇦 tasmax (4, 64, 128) float64
└── 📁 land_monthly
├── 📁 latlon
│ └── 🇦 tasmax (4, 64, 128) float64
└── 📁 native
└── 🇦 tasmax (4, 64, 128) float64
Filtering arrays by dimension names
While Arraylake does not enforce a particular metadata convention, we can take advantage of conventions in Zarr. For example, dimension names are stored under the special key _ARRAY_DIMENSIONS
.
This is a more complicated way of just getting back the variables on the latlon
grid that is, we could juse use filter="grid_id == 'latlon'"
cmip_repo.tree(filter="_ARRAY_DIMENSIONS == ['time', 'latitude', 'longitude']")
/
├── 📁 CMIP
│ └── 📁 model1
│ └── 📁 historical
│ ├── 📁 atm_daily
│ │ └── 📁 latlon
│ │ ├── 🇦 co2 (4, 64, 128) float64
│ │ ├── 🇦 tas (4, 64, 128) float64
│ │ └── 🇦 pr (4, 64, 128) float64
│ ├── 📁 land_daily
│ │ └── 📁 latlon
│ │ ├── 🇦 tasmax (4, 64, 128) float64
│ │ ├── 🇦 rootd (4, 64, 128) float64
│ │ └── 🇦 tasmin (4, 64, 128) float64
│ └── 📁 land_monthly
│ └── 📁 latlon
│ ├── 🇦 tasmin (4, 64, 128) float64
│ ├── 🇦 rootd (4, 64, 128) float64
│ └── 🇦 tasmax (4, 64, 128) float64
└── 📁 ScenarioMIP
└── 📁 model1
└── 📁 historical
└── 📁 atm_daily
└── 📁 latlon
├── 🇦 pr (4, 64, 128) float64
├── 🇦 co2 (4, 64, 128) float64
└── 🇦 tas (4, 64, 128) float64
Search for entries in the _ARRAY_DIMENSIONS
list by order:
stac_repo.tree(filter="_ARRAY_DIMENSIONS[0] == 'time'") # just all the data
/
└── 📁 staclike
└── 📁 sensor
├── 📁 time2
│ ├── 🇦 nir (1, 64, 128) float64
│ └── 🇦 blue (1, 64, 128) float64
├── 📁 time3
│ ├── 🇦 nir (1, 64, 128) float64
│ └── 🇦 blue (1, 64, 128) float64
├── 📁 time4
│ ├── 🇦 nir (1, 64, 128) float64
│ └── 🇦 blue (1, 64, 128) float64
├── 📁 time5
│ ├── 🇦 nir (1, 64, 128) float64
│ └── 🇦 blue (1, 64, 128) float64
└── 📁 time1
├── 🇦 blue (1, 64, 128) float64
└── 🇦 nir (1, 64, 128) float64
We can check if an array's dimensions contain a specific dimension name:
cmip_repo.tree(filter="contains(_ARRAY_DIMENSIONS, 'time')")
/
├── 📁 ScenarioMIP
│ ├── 📁 model1
│ │ └── 📁 historical
│ │ └── 📁 atm_daily
│ │ ├── 📁 latlon
│ │ │ ├── 🇦 tas (4, 64, 128) float64
│ │ │ ├── 🇦 pr (4, 64, 128) float64
│ │ │ └── 🇦 co2 (4, 64, 128) float64
│ │ └── 📁 native
│ │ ├── 🇦 tas (4, 64, 128) float64
│ │ ├── 🇦 co2 (4, 64, 128) float64
│ │ └── 🇦 pr (4, 64, 128) float64
│ └── 📁 model2
│ └── 📁 historical
│ └── 📁 atm_daily
│ └── 📁 native
│ ├── 🇦 co2 (4, 64, 128) float64
│ ├── 🇦 pr (4, 64, 128) float64
│ └── 🇦 tas (4, 64, 128) float64
└── 📁 CMIP
├── 📁 model1
│ └── 📁 historical
│ ├── 📁 atm_daily
│ │ ├── 📁 latlon
│ │ │ ├── 🇦 tas (4, 64, 128) float64
│ │ │ ├── 🇦 co2 (4, 64, 128) float64
│ │ │ └── 🇦 pr (4, 64, 128) float64
│ │ └── 📁 native
│ │ ├── 🇦 pr (4, 64, 128) float64
│ │ ├── 🇦 tas (4, 64, 128) float64
│ │ └── 🇦 co2 (4, 64, 128) float64
│ ├── 📁 land_daily
│ │ ├── 📁 latlon
│ │ │ ├── 🇦 rootd (4, 64, 128) float64
│ │ │ ├── 🇦 tasmax (4, 64, 128) float64
│ │ │ └── 🇦 tasmin (4, 64, 128) float64
│ │ └── 📁 native
│ │ ├── 🇦 tasmin (4, 64, 128) float64
│ │ ├── 🇦 rootd (4, 64, 128) float64
│ │ └── 🇦 tasmax (4, 64, 128) float64
│ └── 📁 land_monthly
│ ├── 📁 latlon
│ │ ├── 🇦 rootd (4, 64, 128) float64
│ │ ├── 🇦 tasmin (4, 64, 128) float64
│ │ └── 🇦 tasmax (4, 64, 128) float64
│ └── 📁 native
│ ├── 🇦 tasmax (4, 64, 128) float64
│ ├── 🇦 rootd (4, 64, 128) float64
│ └── 🇦 tasmin (4, 64, 128) float64
└── 📁 model2
└── 📁 historical
├── 📁 atm_daily
│ └── 📁 native
│ ├── 🇦 co2 (4, 64, 128) float64
│ ├── 🇦 tas (4, 64, 128) float64
│ └── 🇦 pr (4, 64, 128) float64
├── 📁 land_daily
│ └── 📁 native
│ ├── 🇦 tasmax (4, 64, 128) float64
│ ├── 🇦 tasmin (4, 64, 128) float64
│ └── 🇦 rootd (4, 64, 128) float64
└── 📁 land_monthly
└── 📁 native
├── 🇦 tasmax (4, 64, 128) float64
├── 🇦 rootd (4, 64, 128) float64
└── 🇦 tasmin (4, 64, 128) float64
Recommendations
-
Handle special characters, for example
:
, by quoting them with double quotes. For examplerepo.filter_metadata("'created:at:time' <= `2022-05-03`")
will not return any results but will not raise an error.
-
NaNs are strings, so NaN comparisons should use raw strings with single quotes
"someKey == 'NaN'"
The following will not match NaN values:
"someNaN == NaN"
"someNaN == `NaN`" -
The comparison of two missing keys is truthy! The following filter string will match all entries if
'foo'
is not an attribute that exists.filter='foo == "bar"'
Not supported at the moment
- Filtering by group name or array name is not supported at the moment.
- It is also not possible to limit results to only arrays or only groups at the moment.