Skip to main content

Virtual Datasets

Arraylake's native data model is based on Zarr Version 3. However, Arraylake can ingest a wide range of other array formats, including

  • Zarr Version 2
  • HDF5
  • NetCDF3
  • NetCDF4
  • GRIB
  • TIFF / GeoTIFF / COGs (Cloud-Optimized GeoTIFF)

Importantly, these files do not have to be copied in order to be used in Arraylake. If you have existing data in these formats already in the cloud, Arraylake can build an index on top of it. These are called virtual datasets. They look and feel like regular Arraylake groups and arrays, but the chunk data live in the original file, rather than in your chunkstore. Arraylake stores references to these files in the metastore.

caution

When using virtual datasets, Arraylake has no way to know if the files being referenced have been moved, changed, or deleted. Creating virtual datasets and then changing the referenced files can result in unexpected errors.

Working with Google Cloud Storage

While the examples below showcase datasets stored on Amazon S3, it is possible to use Virtual Datasets to reference datasets in Google Cloud Storage. To do so, set s3.endpoint_url:

from arraylake import Client, config

client = Client()
# set s3 endpoint for GCS + anonomous access
config.set({"s3.endpoint_url": "https://storage.googleapis.com"})
repo = client.get_or_create_repo("ORG_NAME/REPO_NAME")

NetCDF

In this example, we will create virtual datsaets from NetCDF files. The example file chosen is from Unidata's Example NetCDF Files Archive.

The specific file we will use is described as follows:

From the Community Climate System Model (CCSM), one time step of precipitation flux, air temperature, and eastward wind.

We have copied the file to S3 at the following URI: s3://earthmover-sample-data/netcdf/sresa1b_ncar_ccsm3-example.nc

First we connect to a repo

from arraylake import Client

client = Client()
repo = client.get_or_create_repo("earthmover/virtual-files")
repo
Output
<arraylake.repo.Repo 'earthmover/virtual-files'>

Adding a file is as simple as one line of code. The second argument to add_virtual_netcdf tells Arraylake where to put the file within the repo hierarchy.

s3_url = "s3://earthmover-sample-data/netcdf/sresa1b_ncar_ccsm3-example.nc"
repo.add_virtual_netcdf(s3_url, "netcdf/sresa1b_ncar_ccsm3_example")
tip

The signature for all "virtual" methods mirrors cp source destination, so pass the remote URI first followed by the path in the repo where you want to ingest data.

We can see that several arrays have been added to our dataset from this one file.

print(repo.root_group.tree())
Output
/
└── netcdf
└── sresa1b_ncar_ccsm3_example
├── time (1,) >f8
├── msk_rgn (128, 256) >i4
├── time_bnds (1, 2) >f8
├── pr (1, 128, 256) >f4
├── lon_bnds (256, 2) >f8
├── plev (17,) >f8
├── lon (256,) >f4
├── lat (128,) >f4
├── area (128, 256) >f4
├── ua (1, 17, 128, 256) >f4
├── lat_bnds (128, 2) >f8
└── tas (1, 128, 256) >f4

We can check the data to make sure it looks good. Let's open it with Xarray.

import xarray as xr
ds = xr.open_dataset(repo.store, group="netcdf/sresa1b_ncar_ccsm3_example", zarr_version=3, engine="zarr")
print(ds)
Output
<xarray.Dataset>
Dimensions: (lat: 128, lon: 256, bnds: 2, plev: 17, time: 1)
Coordinates:
* lat (lat) float32 -88.93 -87.54 -86.14 -84.74 ... 86.14 87.54 88.93
* lon (lon) float32 0.0 1.406 2.812 4.219 ... 354.4 355.8 357.2 358.6
* plev (plev) float64 1e+05 9.25e+04 8.5e+04 7e+04 ... 3e+03 2e+03 1e+03
* time (time) object 2000-05-16 12:00:00
Dimensions without coordinates: bnds
Data variables:
area (lat, lon) float32 ...
lat_bnds (lat, bnds) float64 ...
lon_bnds (lon, bnds) float64 ...
msk_rgn (lat, lon) int32 ...
pr (time, lat, lon) float32 ...
tas (time, lat, lon) float32 ...
time_bnds (time, bnds) object ...
ua (time, plev, lat, lon) float32 ...
Attributes: (12/18)
CVS_Id: $Id$
Conventions: CF-1.0
acknowledgment: Any use of CCSM data should acknowledge the contrib...
cmd_ln: bds -x 256 -y 128 -m 23 -o /data/zender/data/dst_T85.nc
comment: This simulation was initiated from year 2000 of \n C...
contact: ccsm@ucar.edu
... ...
project_id: IPCC Fourth Assessment
realization: 1
references: Collins, W.D., et al., 2005:\n The Community Climate...
source: CCSM3.0, version beta19 (2004): \natmosphere: CAM3.0...
table_id: Table A1
title: model output prepared for IPCC AR4
ds.pr.plot()
Output
<matplotlib.collections.QuadMesh at 0x28d5b1300>

png

The data look good! All of the metadata made it through, and Xarray was able to use them to make a vert informative plot.

We're ready to commit.

repo.commit("Added virtual netCDF file")
Output
650e10826db135fa2d117447

Incrementally Overwrite Virtual Datasets

Now that our dataset is in Arraylake, we can incrementally overwrite it with new data. The new writes will be stored in our chunkstore; the original file will not be modified.

For example, imagine we discover that the precipitation field needs to be rescaled by a factor of 2:

pr_array = repo.root_group.netcdf.sresa1b_ncar_ccsm3_example.pr
pr_array[:] = 2 * pr_array[:]

We can see that only a single chunk of data was modified:

repo.status()

🧊 Using repo earthmover-demos/virtual-files
📟 Session aac5ae7638a6469f92b09a6a3c12f965 started at 2023-09-22T22:09:07.513322
🌿 On branch main

paths modified in session

  • 📝 data/root/netcdf/sresa1b_ncar_ccsm3_example/pr/c0/0/0

This new chunk has been stored the Arraylake chunkstore.

repo.commit("Rescaled precipitation")
Output
650e10876db135fa2d117448

Other Formats

Zarr V2

If you have existing Zarr V2 data stored in object storage, you can ingest it virtually into Arraylake without making a copy of any of the chunk data.

As an example, we use a dataset from the AWS CMIP6 open data.

zarr_url = "s3://cmip6-pds/CMIP6/CMIP/AS-RCEC/TaiESM1/1pctCO2/r1i1p1f1/Amon/hfls/gn/v20200225"

Let's first open it directly with xarray and see what's inside.

print(xr.open_dataset(zarr_url, engine="zarr"))
Output
<xarray.Dataset>
Dimensions: (time: 1800, lat: 192, lon: 288, bnds: 2)
Coordinates:
* lat (lat) float64 -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0
lat_bnds (lat, bnds) float64 ...
* lon (lon) float64 0.0 1.25 2.5 3.75 5.0 ... 355.0 356.2 357.5 358.8
lon_bnds (lon, bnds) float64 ...
* time (time) object 0001-01-16 12:00:00 ... 0150-12-16 12:00:00
time_bnds (time, bnds) object ...
Dimensions without coordinates: bnds
Data variables:
hfls (time, lat, lon) float32 ...
Attributes: (12/53)
Conventions: CF-1.7 CMIP-6.2
activity_id: CMIP
branch_method: Hybrid-restart from year 0701-01-01 of piControl
branch_time: 0.0
branch_time_in_child: 0.0
branch_time_in_parent: 182500.0
... ...
title: TaiESM1 output prepared for CMIP6
tracking_id: hdl:21.14100/813dbc9a-249f-4cde-a56c-fea0a42a5eb5
variable_id: hfls
variant_label: r1i1p1f1
netcdf_tracking_ids: hdl:21.14100/813dbc9a-249f-4cde-a56c-fea0a42a5eb5
version_id: v20200225

Now let's ingest it into Arraylake.

repo.add_virtual_zarr(zarr_url, "zarr/cmip6_example")
ds = xr.open_dataset(repo.store, group="zarr/cmip6_example", engine="zarr", zarr_version=3)
print(ds)
Output
<xarray.Dataset>
Dimensions: (time: 1800, lat: 192, lon: 288, bnds: 2)
Coordinates:
* lat (lat) float64 -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0
lat_bnds (lat, bnds) float64 ...
* lon (lon) float64 0.0 1.25 2.5 3.75 5.0 ... 355.0 356.2 357.5 358.8
lon_bnds (lon, bnds) float64 ...
* time (time) object 0001-01-16 12:00:00 ... 0150-12-16 12:00:00
time_bnds (time, bnds) object ...
Dimensions without coordinates: bnds
Data variables:
hfls (time, lat, lon) float32 ...
Attributes: (12/51)
Conventions: CF-1.7 CMIP-6.2
activity_id: CMIP
branch_method: Hybrid-restart from year 0701-01-01 of piControl
branch_time: 0.0
branch_time_in_child: 0.0
branch_time_in_parent: 182500.0
... ...
table_id: Amon
table_info: Creation Date:(24 July 2019) MD5:0bb394a356ef9...
title: TaiESM1 output prepared for CMIP6
tracking_id: hdl:21.14100/813dbc9a-249f-4cde-a56c-fea0a42a5eb5
variable_id: hfls
variant_label: r1i1p1f1
ds.hfls[-1].plot()
Output
<matplotlib.collections.QuadMesh at 0x28f414dc0>

png

Data looks good! Let's commit.

repo.commit("Added virtual CMIP6 Zarr")
Output
650e10926db135fa2d117449

GRIB

To demonstrate GRIB ingest, we'll use a GRIB file from the Global Ensemble Forecast System (GEFS)

grib_uri = "s3://earthmover-sample-data/grib/cape_sfc_2000010100_c00.grib2"
repo.add_virtual_grib(grib_uri, "gefs-cape/")
print(repo.to_xarray("gefs-cape/"))
Output
<xarray.Dataset>
Dimensions: (step: 24, latitude: 361, longitude: 720)
Coordinates:
* latitude (latitude) float64 90.0 89.5 89.0 88.5 ... -89.0 -89.5 -90.0
* longitude (longitude) float64 0.0 0.5 1.0 1.5 ... 358.0 358.5 359.0 359.5
number int64 ...
* step (step) timedelta64[ns] 10 days 06:00:00 ... 16 days 00:00:00
surface float64 ...
time datetime64[ns] ...
valid_time (step) datetime64[ns] ...
Data variables:
cape (step, latitude, longitude) float64 ...
Attributes:
GRIB_centre: kwbc
GRIB_centreDescription: US National Weather Service - NCEP
GRIB_edition: 2
GRIB_subCentre: 2
institution: US National Weather Service - NCEP

We see a step dimension of size 24. Arraylake chooses to concatenate GRIB messages when it appears sensible to do so. However GRIB files comes in many varieties and can be hard to parse. Let us know if add_virtual_grib cannot ingest the GRIB file you want.

repo.commit("add gefs GRIB")
Output
654daf73d38077081f2f3500