Skip to main content

Arraylake and Icechunk

Arraylake helps you manage your Icechunk data Repositories (repos).

On top of Icechunk, Arraylake provides data management and collaboration functionality, helping you:

  1. manage cloud data in Icechunk repos,
  2. explore your data intuitively in our web app,
  3. control access to your data, and
  4. distribute your data over various industry-standard protocols.

Since Arraylake builds on top of the Icechunk storage engine, detailed docs are available at the Icechunk project. This page will provide a flavor of how you, as an Arraylake user, can accomplish many common tasks.

Open an Icechunk Repository​

import warnings

warnings.simplefilter("ignore", UserWarning)
import arraylake as al

client = al.Client()
repo = client.get_or_create_repo("earthmover/icechunk-demo")
repo
Output
<icechunk.repository.Repository at 0x11e8e1d30>

We see that we get back an Icechunk Repository. In the background, Arraylake has handled all the complexity of setting up an appropriate icechunk.Storage instance with the necessary credentials.

This repo has a single commit initializing it.

list(repo.ancestry(branch="main"))
Output
[SnapshotInfo(id="8877AVAMNTE56ZKQQW20", parent_id=None, written_at=datetime.datetime(2025,2,20,23,42,19,269491, tzinfo=datetime.timezone.utc), message="Repository...")]

With Zarr​

tip

More extensive documentation is available at the Icechunk documentation on Zarr and the Zarr project.

Create a writable session to begin adding data to your repo. Next extract a Zarr store from Icechunk using session.store; this IcechunkStore can be provided to any existing API that accepts a Zarr V3 store.

session = repo.writable_session("main")
import zarr

group = zarr.open_group(session.store, mode="w")
group
Output
<Group <icechunk.store.IcechunkStore object at 0x11f331ca0>>
group.attrs["description"] = "This is a root group, created with the Zarr API"

Let's create an array

array = group.create_array("chunks", shape=(10,), dtype=int, dimension_names=("x",))

Write some data

import numpy as np

array[:] = np.arange(10)

Use session.status to view your modifications

session.status()
Output
Groups created:
/

Arrays created:
/chunks

Chunks updated:
/chunks:
[0]

This looks right. It's time to commit!

session.commit("Initialized my icechunk dataset!")
Output
'PJ1NZXR28TH7S01Y3TPG'

With Xarray​

tip

More extensive documentation is available at the Icechunk documentation on using Xarray.

Icechunk provides a Zarr store that fits in with Xarray's open_dataset, open_zarr, and to_zarr APIs.

import xarray as xr

xr.set_options(display_style={{}})

session = repo.readonly_session("main")
xr.open_zarr(session.store, consolidated=False)
<xarray.Dataset> Size: 80B
Dimensions:  (x: 10)
Dimensions without coordinates: x
Data variables:
chunks   (x) int64 80B dask.array<chunksize=(10,), meta=np.ndarray>
Attributes:
description:  This is a root group, created with the Zarr API

Let us add a new variable to this store using Xarray, and the "append" mode (mode='a').

Start by creating a writable session.

ds = xr.Dataset()
ds["skunks"] = ("x", np.arange(10, 20))
ds
<xarray.Dataset> Size: 80B
Dimensions:  (x: 10)
Dimensions without coordinates: x
Data variables:
skunks   (x) int64 80B 10 11 12 13 14 15 16 17 18 19
session = repo.writable_session("main")
ds.to_zarr(session.store, mode="a", consolidated=False)
Output
<xarray.backends.zarr.ZarrStore at 0x308e75120>
session.status()
Output
Arrays created:
/skunks

Group definitions updated:
/

Chunks updated:
/skunks:
[0]
session.commit("added skunks!")
Output
'E6CYPC5W6DKE3EFMVD70'
xr.open_zarr(session.store, consolidated=False)
<xarray.Dataset> Size: 160B
Dimensions:  (x: 10)
Dimensions without coordinates: x
Data variables:
chunks   (x) int64 80B dask.array<chunksize=(10,), meta=np.ndarray>
skunks   (x) int64 80B dask.array<chunksize=(10,), meta=np.ndarray>

Virtual Datasets​

Virtual datasets let you keep track of array data in other file formats such as netCDF, GRIB, GeoTIFF, and others using the Icechunk storage engine (and thereby Arraylake). Icechunk merely stores the "chunk references" to binary data in these files. Use the VirtualiZarr package to "scan" these files and generate the necessary chunk references.

tip

More extensive documentation is available at the Icechunk documentation on Virtual Datasets and VirtualiZarr projects.

We illustrate using a single netCDF file.

import virtualizarr as vz
s3_url = "s3://noaa-cdr-sea-surface-temp-optimum-interpolation-pds/data/v2.1/avhrr/201001/oisst-avhrr-v02r01.20100101.nc"
vds = vz.open_virtual_dataset(s3_url, indexes={})
vds
<xarray.Dataset> Size: 33MB
Dimensions:  (time: 1, zlev: 1, lat: 720, lon: 1440)
Coordinates:
lat      (lat) float32 3kB ManifestArray<shape=(720,), dtype=float32, chu...
lon      (lon) float32 6kB ManifestArray<shape=(1440,), dtype=float32, ch...
time     (time) float32 4B ManifestArray<shape=(1,), dtype=float32, chunk...
zlev     (zlev) float32 4B ManifestArray<shape=(1,), dtype=float32, chunk...
Data variables:
anom     (time, zlev, lat, lon) float64 8MB ManifestArray<shape=(1, 1, 72...
err      (time, zlev, lat, lon) float64 8MB ManifestArray<shape=(1, 1, 72...
ice      (time, zlev, lat, lon) float64 8MB ManifestArray<shape=(1, 1, 72...
sst      (time, zlev, lat, lon) float64 8MB ManifestArray<shape=(1, 1, 72...
Attributes: (12/38)
title:                      NOAA/NCEI 1/4 Degree Daily Optimum Interpolat...
Description:                Reynolds, et al.(2007) Daily High-resolution ...
source:                     ICOADS, NCEP_GTS, GSFC_ICE, NCEP_ICE, Pathfin...
id:                         oisst-avhrr-v02r01.20100101.nc
naming_authority:           gov.noaa.ncei
summary:                    NOAAs 1/4-degree Daily Optimum Interpolation ...
...                         ...
metadata_link:              https://doi.org/10.25921/RE9P-PT57
ncei_template_version:      NCEI_NetCDF_Grid_Template_v2.0
comment:                    Data was converted from NetCDF-3 to NetCDF-4 ...
sensor:                     Thermometer, AVHRR
Conventions:                CF-1.6, ACDD-1.3
references:                 Reynolds, et al.(2007) Daily High-Resolution-...
session = repo.writable_session("main")

vds.virtualize.to_icechunk(session.store, group="virtual/netcdf")
session.status()
Output
Groups created:
/virtual
/virtual/netcdf

Arrays created:
/virtual/netcdf/anom
/virtual/netcdf/err
/virtual/netcdf/ice
/virtual/netcdf/lat
/virtual/netcdf/lon
/virtual/netcdf/sst
/virtual/netcdf/time
/virtual/netcdf/zlev

Chunks updated:
/virtual/netcdf/anom:
[0, 0, 0, 0]
/virtual/netcdf/err:
[0, 0, 0, 0]
/virtual/netcdf/ice:
[0, 0, 0, 0]
/virtual/netcdf/lat:
[0]
/virtual/netcdf/lon:
[0]
/virtual/netcdf/sst:
[0, 0, 0, 0]
/virtual/netcdf/time:
[0]
/virtual/netcdf/zlev:
[0]
session.commit("added virtual netCDF!")
Output
'45FB2RJVVC6YZXCRZJ50'

Clean up​

client.delete_repo("earthmover/icechunk-demo", imsure=True, imreallysure=True)