Virtual Datasets
Arraylake's native data model is based on Zarr Version 3. However, Arraylake can ingest a wide range of other array formats, including
- Zarr Version 2
- HDF5
- NetCDF3
- NetCDF4
- GRIB
- TIFF / GeoTIFF / COGs (Cloud-Optimized GeoTIFF)
Importantly, these files do not have to be copied in order to be used in Arraylake. If you have existing data in these formats already in the cloud, Arraylake can build an index on top of it. These are called virtual datasets. They look and feel like regular Arraylake groups and arrays, but the chunk data live in the original file, rather than in your chunkstore. Arraylake stores references to these files in the metastore.
When using virtual datasets, Arraylake has no way to know if the files being referenced have been moved, changed, or deleted. Creating virtual datasets and then changing the referenced files can result in unexpected errors.
While the examples below showcase datasets stored on Amazon S3, it is possible to use Virtual Datasets to reference datasets in Google Cloud Storage. To do so, set
s3.endpoint_url
:
from arraylake import Client, config
client = Client()
# set s3 endpoint for GCS + anonomous access
config.set({"s3.endpoint_url": "https://storage.googleapis.com"})
repo = client.get_or_create_repo("ORG_NAME/REPO_NAME")
NetCDF
In this example, we will create virtual datsaets from NetCDF files. The example file chosen is from Unidata's Example NetCDF Files Archive.
The specific file we will use is described as follows:
From the Community Climate System Model (CCSM), one time step of precipitation flux, air temperature, and eastward wind.
We have copied the file to S3 at the following URI: s3://earthmover-sample-data/netcdf/sresa1b_ncar_ccsm3-example.nc
First we connect to a repo
from arraylake import Client
client = Client()
repo = client.get_or_create_repo("earthmover/virtual-files")
repo
<arraylake.repo.Repo 'earthmover/virtual-files'>
Adding a file is as simple as one line of code.
The second argument to add_virtual_netcdf
tells Arraylake where to put the file within the repo hierarchy.
s3_url = "s3://earthmover-sample-data/netcdf/sresa1b_ncar_ccsm3-example.nc"
repo.add_virtual_netcdf(s3_url, "netcdf/sresa1b_ncar_ccsm3_example")
The signature for all "virtual" methods mirrors cp source destination
, so pass the remote URI first followed by the path in the repo where you want to ingest data.
We can see that several arrays have been added to our dataset from this one file.
print(repo.root_group.tree())
/
└── netcdf
└── sresa1b_ncar_ccsm3_example
├── time (1,) >f8
├── msk_rgn (128, 256) >i4
├── time_bnds (1, 2) >f8
├── pr (1, 128, 256) >f4
├── lon_bnds (256, 2) >f8
├── plev (17,) >f8
├── lon (256,) >f4
├── lat (128,) >f4
├── area (128, 256) >f4
├── ua (1, 17, 128, 256) >f4
├── lat_bnds (128, 2) >f8
└── tas (1, 128, 256) >f4
We can check the data to make sure it looks good. Let's open it with Xarray.
import xarray as xr
ds = xr.open_dataset(repo.store, group="netcdf/sresa1b_ncar_ccsm3_example", zarr_version=3, engine="zarr")
print(ds)
<xarray.Dataset>
Dimensions: (lat: 128, lon: 256, bnds: 2, plev: 17, time: 1)
Coordinates:
* lat (lat) float32 -88.93 -87.54 -86.14 -84.74 ... 86.14 87.54 88.93
* lon (lon) float32 0.0 1.406 2.812 4.219 ... 354.4 355.8 357.2 358.6
* plev (plev) float64 1e+05 9.25e+04 8.5e+04 7e+04 ... 3e+03 2e+03 1e+03
* time (time) object 2000-05-16 12:00:00
Dimensions without coordinates: bnds
Data variables:
area (lat, lon) float32 ...
lat_bnds (lat, bnds) float64 ...
lon_bnds (lon, bnds) float64 ...
msk_rgn (lat, lon) int32 ...
pr (time, lat, lon) float32 ...
tas (time, lat, lon) float32 ...
time_bnds (time, bnds) object ...
ua (time, plev, lat, lon) float32 ...
Attributes: (12/18)
CVS_Id: $Id$
Conventions: CF-1.0
acknowledgment: Any use of CCSM data should acknowledge the contrib...
cmd_ln: bds -x 256 -y 128 -m 23 -o /data/zender/data/dst_T85.nc
comment: This simulation was initiated from year 2000 of \n C...
contact: ccsm@ucar.edu
... ...
project_id: IPCC Fourth Assessment
realization: 1
references: Collins, W.D., et al., 2005:\n The Community Climate...
source: CCSM3.0, version beta19 (2004): \natmosphere: CAM3.0...
table_id: Table A1
title: model output prepared for IPCC AR4
ds.pr.plot()
<matplotlib.collections.QuadMesh at 0x28d5b1300>
The data look good! All of the metadata made it through, and Xarray was able to use them to make a vert informative plot.
We're ready to commit.
repo.commit("Added virtual netCDF file")
650e10826db135fa2d117447
Incrementally Overwrite Virtual Datasets
Now that our dataset is in Arraylake, we can incrementally overwrite it with new data. The new writes will be stored in our chunkstore; the original file will not be modified.
For example, imagine we discover that the precipitation field needs to be rescaled by a factor of 2:
pr_array = repo.root_group.netcdf.sresa1b_ncar_ccsm3_example.pr
pr_array[:] = 2 * pr_array[:]
We can see that only a single chunk of data was modified:
repo.status()
🧊 Using repo earthmover-demos/virtual-files
📟 Session aac5ae7638a6469f92b09a6a3c12f965 started at 2023-09-22T22:09:07.513322
🌿 On branch main
paths modified in session
- 📝 data/root/netcdf/sresa1b_ncar_ccsm3_example/pr/c0/0/0
This new chunk has been stored the Arraylake chunkstore.
repo.commit("Rescaled precipitation")
650e10876db135fa2d117448
Other Formats
Zarr V2
If you have existing Zarr V2 data stored in object storage, you can ingest it virtually into Arraylake without making a copy of any of the chunk data.
As an example, we use a dataset from the AWS CMIP6 open data.
zarr_url = "s3://cmip6-pds/CMIP6/CMIP/AS-RCEC/TaiESM1/1pctCO2/r1i1p1f1/Amon/hfls/gn/v20200225"
Let's first open it directly with xarray and see what's inside.
print(xr.open_dataset(zarr_url, engine="zarr"))
<xarray.Dataset>
Dimensions: (time: 1800, lat: 192, lon: 288, bnds: 2)
Coordinates:
* lat (lat) float64 -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0
lat_bnds (lat, bnds) float64 ...
* lon (lon) float64 0.0 1.25 2.5 3.75 5.0 ... 355.0 356.2 357.5 358.8
lon_bnds (lon, bnds) float64 ...
* time (time) object 0001-01-16 12:00:00 ... 0150-12-16 12:00:00
time_bnds (time, bnds) object ...
Dimensions without coordinates: bnds
Data variables:
hfls (time, lat, lon) float32 ...
Attributes: (12/53)
Conventions: CF-1.7 CMIP-6.2
activity_id: CMIP
branch_method: Hybrid-restart from year 0701-01-01 of piControl
branch_time: 0.0
branch_time_in_child: 0.0
branch_time_in_parent: 182500.0
... ...
title: TaiESM1 output prepared for CMIP6
tracking_id: hdl:21.14100/813dbc9a-249f-4cde-a56c-fea0a42a5eb5
variable_id: hfls
variant_label: r1i1p1f1
netcdf_tracking_ids: hdl:21.14100/813dbc9a-249f-4cde-a56c-fea0a42a5eb5
version_id: v20200225
Now let's ingest it into Arraylake.
repo.add_virtual_zarr(zarr_url, "zarr/cmip6_example")
ds = xr.open_dataset(repo.store, group="zarr/cmip6_example", engine="zarr", zarr_version=3)
print(ds)
<xarray.Dataset>
Dimensions: (time: 1800, lat: 192, lon: 288, bnds: 2)
Coordinates:
* lat (lat) float64 -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0
lat_bnds (lat, bnds) float64 ...
* lon (lon) float64 0.0 1.25 2.5 3.75 5.0 ... 355.0 356.2 357.5 358.8
lon_bnds (lon, bnds) float64 ...
* time (time) object 0001-01-16 12:00:00 ... 0150-12-16 12:00:00
time_bnds (time, bnds) object ...
Dimensions without coordinates: bnds
Data variables:
hfls (time, lat, lon) float32 ...
Attributes: (12/51)
Conventions: CF-1.7 CMIP-6.2
activity_id: CMIP
branch_method: Hybrid-restart from year 0701-01-01 of piControl
branch_time: 0.0
branch_time_in_child: 0.0
branch_time_in_parent: 182500.0
... ...
table_id: Amon
table_info: Creation Date:(24 July 2019) MD5:0bb394a356ef9...
title: TaiESM1 output prepared for CMIP6
tracking_id: hdl:21.14100/813dbc9a-249f-4cde-a56c-fea0a42a5eb5
variable_id: hfls
variant_label: r1i1p1f1
ds.hfls[-1].plot()
<matplotlib.collections.QuadMesh at 0x28f414dc0>
Data looks good! Let's commit.
repo.commit("Added virtual CMIP6 Zarr")
650e10926db135fa2d117449
GRIB
To demonstrate GRIB ingest, we'll use a GRIB file from the Global Ensemble Forecast System (GEFS)
grib_uri = "s3://earthmover-sample-data/grib/cape_sfc_2000010100_c00.grib2"
repo.add_virtual_grib(grib_uri, "gefs-cape/")
print(repo.to_xarray("gefs-cape/"))
<xarray.Dataset>
Dimensions: (step: 24, latitude: 361, longitude: 720)
Coordinates:
* latitude (latitude) float64 90.0 89.5 89.0 88.5 ... -89.0 -89.5 -90.0
* longitude (longitude) float64 0.0 0.5 1.0 1.5 ... 358.0 358.5 359.0 359.5
number int64 ...
* step (step) timedelta64[ns] 10 days 06:00:00 ... 16 days 00:00:00
surface float64 ...
time datetime64[ns] ...
valid_time (step) datetime64[ns] ...
Data variables:
cape (step, latitude, longitude) float64 ...
Attributes:
GRIB_centre: kwbc
GRIB_centreDescription: US National Weather Service - NCEP
GRIB_edition: 2
GRIB_subCentre: 2
institution: US National Weather Service - NCEP
We see a step
dimension of size 24. Arraylake chooses to concatenate GRIB messages when it appears sensible to do so. However GRIB files comes in many varieties and can be hard to parse. Let us know if add_virtual_grib
cannot ingest the GRIB file you want.
repo.commit("add gefs GRIB")
654daf73d38077081f2f3500