Skip to main content

Arraylake and Icechunk

Arraylake helps you manage your Icechunk data Repositories (repos).

On top of Icechunk, Arraylake provides data management and collaboration functionality, helping you:

  1. manage cloud data in Icechunk repos,
  2. explore your data intuitively in our web app,
  3. control access to your data, and
  4. distribute your data over various industry-standard protocols.

Since Arraylake builds on top of the Icechunk storage engine, detailed docs are available at the Icechunk project. This page will provide a flavor of how you, as an Arraylake user, can accomplish many common tasks.

Open an Icechunk Repository

import warnings

warnings.simplefilter("ignore", UserWarning)
import arraylake as al

client = al.Client()
repo = client.get_or_create_repo("earthmover/icechunk-demo")
repo
Output
<icechunk.repository.Repository at 0x10d9c4d70>

We see that we get back an Icechunk Repository. In the background, Arraylake has handled all the complexity of setting up an appropriate icechunk.Storage instance with the necessary credentials.

This repo has a single commit initializing it.

print(list(repo.ancestry(branch="main")))
Output
[SnapshotInfo(id="1CECHNKREP0F1RSTCMT0", parent_id=None, written_at=datetime.datetime(2025,9,18,18,31,55,468460, tzinfo=datetime.timezone.utc), message="Repository...")]

With Zarr

tip

More extensive documentation is available at the Icechunk documentation on Zarr and the Zarr project.

Create a writable session to begin adding data to your repo. Next extract a Zarr store from Icechunk using session.store; this IcechunkStore can be provided to any existing API that accepts a Zarr V3 store.

session = repo.writable_session("main")
import zarr

root = zarr.open_group(session.store, mode="a")
group = root.create_group("my-group-name")
group.attrs["description"] = "This is a group, created with the Zarr API"

Let's create an array

group_array = group.create_array("chunks", shape=(10,), dtype=int, dimension_names=("z",))

Write some data

import numpy as np

group_array[:] = np.arange(10)

Use session.status to view your modifications

session.status()
Output
Groups created:
/
/my-group-name

Arrays created:
/my-group-name/chunks

Chunks updated:
/my-group-name/chunks:
[0]

This looks right. It's time to commit!

session.commit("Initialized my icechunk dataset!")
Output
'QPJGYNDXGV3ND5PB6NP0'

This showed writing a group in the zarr hierarchy. You can also create an write directly to the root group. For complete documentation on how to do this please see the Zarr Groups Documentation.

In order to write new data you will need to create a new writeable session

session = repo.writable_session("main")
root_group = zarr.open_group(session.store, mode="a")
root_group
Output
<Group <icechunk.store.IcechunkStore object at 0x10da4c2d0>>
root_group.attrs["description"] = "This is a root group, created with the Zarr API"
root_array = root_group.create_array("root_chunks", shape=(15,), dtype=int, dimension_names=("x",))

root_array[:] = np.arange(15)
session.status()
Output
Arrays created:
/root_chunks

Group definitions updated:
/

Chunks updated:
/root_chunks:
[0]
session.commit("Wrote to the root group")
Output
'VCA0BEH3E0PZGPAJ44YG'
# using the zarr API you can see the zarr hierarchy we have created
root_group.tree()
/
├── my-group-name
│   └── chunks (10,) int64
└── root_chunks (15,) int64

With Xarray

tip

More extensive documentation is available at the Icechunk documentation on using Xarray.

Icechunk provides a Zarr store that fits in with Xarray's open_dataset, open_zarr, and to_zarr APIs.

import xarray as xr

xr.set_options(display_style={{}})

session = repo.readonly_session("main")
xr.open_zarr(session.store, consolidated=False)
<xarray.Dataset> Size: 120B
Dimensions:      (x: 15)
Dimensions without coordinates: x
Data variables:
root_chunks  (x) int64 120B ...
Attributes:
description:  This is a root group, created with the Zarr API

Let us add a new variable to this store using Xarray, and the "append" mode (mode='a').

Start by creating a writable session.

ds = xr.Dataset()
ds["striped-skunk"] = ("x", np.arange(10, 20))
ds
<xarray.Dataset> Size: 80B
Dimensions:        (x: 10)
Dimensions without coordinates: x
Data variables:
striped-skunk  (x) int64 80B 10 11 12 13 14 15 16 17 18 19

To write the data start by creating a writeable session, and using the to_zarr method. There are two important things to note here.

  1. mode='a' If we used w it would overwrite the chunks we wrote earlier
  2. group="skunks" Placing the Dataset in a subgroup makes it significantly easier to reorganize data later.
tip

If your data has a hierarchial relationship you may want to consider using Xarray's built in DataTree to manage it.

session = repo.writable_session("main")
ds.to_zarr(session.store, mode="a", group="skunks", consolidated=False)
Output
<xarray.backends.zarr.ZarrStore at 0x10806aa20>
session.status()
Output
Groups created:
/skunks

Arrays created:
/skunks/striped-skunk

Chunks updated:
/skunks/striped-skunk:
[0]
session.commit("added skunks!")
Output
'XN1JFNFNCH4JEB3SE3KG'
# get the dataset back
xr.open_zarr(repo.readonly_session("main").store, consolidated=False, group="skunks")
<xarray.Dataset> Size: 80B
Dimensions:        (x: 10)
Dimensions without coordinates: x
Data variables:
striped-skunk  (x) int64 80B ...

You can also write data to the root group by not specifying a group parameter:

session = repo.writable_session("main")
xr.open_zarr(session.store, consolidated=False)

ds_root = xr.Dataset()
ds_root["xarray-root-data"] = ("y", np.arange(10, 15))

# Write to a non-root group
ds_root.to_zarr(session.store, mode="a", consolidated=False)
session.status()
Output
Arrays created:
/xarray-root-data

Group definitions updated:
/

Chunks updated:
/xarray-root-data:
[0]
session.commit("Added root data with Xarray")
Output
'3XJDX5VBCGBTN2ZY1ZM0'

Finally here is the zarr tree we have constructed.

zarr.open(repo.readonly_session("main").store).tree()
/
├── my-group-name
│   └── chunks (10,) int64
├── root_chunks (15,) int64
├── skunks
│   └── striped-skunk (10,) int64
└── xarray-root-data (5,) int64

Custom Configuration

tip

More extensive documentation is available at the Icechunk documentation on Configuration.

Custom configuration can be passed to Icechunk using the config kwarg of Client.get_repo and Client.create_repo.

For example, the code below sets custom concurrency and storage class settings:

import arraylake as al
import icechunk

config = icechunk.RepositoryConfig.default()
config.storage = icechunk.StorageSettings(
concurrency=icechunk.StorageConcurrencySettings(
max_concurrent_requests_for_object=10,
ideal_concurrent_request_size=1_000_000,
),
storage_class="STANDARD",
metadata_storage_class="STANDARD_IA",
chunks_storage_class="STANDARD_IA",
)

client = al.Client()
repo = client.get_or_create_repo(
"earthmover/icechunk-demo",
config=config,
)
repo.config
Output
RepositoryConfig(inline_chunk_threshold_bytes=None, get_partial_values_concurrency=None, compression=None, caching=None, storage=StorageSettings(concurrency=StorageConcurrencySettings(max_concurrent_requests_for_object=10, ideal_concurrent_request_size=1000000), retries=None, unsafe_use_conditional_create=None, unsafe_use_conditional_update=None, unsafe_use_metadata=None, storage_class="STANDARD", metadata_storage_class="STANDARD_IA", chunks_storage_class="STANDARD_IA"), manifest=None)

Remember that custom configuration can be persisted using Repository.save_config

repo.save_config()

Virtual Datasets

Virtual datasets let you keep track of array data in other file formats such as netCDF, GRIB, GeoTIFF, and others using the Icechunk storage engine (and thereby Arraylake). Icechunk merely stores the "chunk references" to binary data in these files. Use the VirtualiZarr package to "scan" these files and generate the necessary chunk references.

tip

More extensive documentation is available at the Icechunk documentation on Virtual Datasets and VirtualiZarr projects.

We illustrate using a single netCDF file.

import virtualizarr as vz
s3_url = "s3://noaa-cdr-sea-surface-temp-optimum-interpolation-pds/data/v2.1/avhrr/201001/oisst-avhrr-v02r01.20100101.nc"
vds = vz.open_virtual_dataset(s3_url, indexes={})
vds
<xarray.Dataset> Size: 33MB
Dimensions:  (time: 1, zlev: 1, lat: 720, lon: 1440)
Coordinates:
lat      (lat) float32 3kB ManifestArray<shape=(720,), dtype=float32, chu...
lon      (lon) float32 6kB ManifestArray<shape=(1440,), dtype=float32, ch...
time     (time) float32 4B ManifestArray<shape=(1,), dtype=float32, chunk...
zlev     (zlev) float32 4B ManifestArray<shape=(1,), dtype=float32, chunk...
Data variables:
anom     (time, zlev, lat, lon) float64 8MB ManifestArray<shape=(1, 1, 72...
err      (time, zlev, lat, lon) float64 8MB ManifestArray<shape=(1, 1, 72...
ice      (time, zlev, lat, lon) float64 8MB ManifestArray<shape=(1, 1, 72...
sst      (time, zlev, lat, lon) float64 8MB ManifestArray<shape=(1, 1, 72...
Attributes: (12/38)
title:                      NOAA/NCEI 1/4 Degree Daily Optimum Interpolat...
Description:                Reynolds, et al.(2007) Daily High-resolution ...
source:                     ICOADS, NCEP_GTS, GSFC_ICE, NCEP_ICE, Pathfin...
id:                         oisst-avhrr-v02r01.20100101.nc
naming_authority:           gov.noaa.ncei
summary:                    NOAAs 1/4-degree Daily Optimum Interpolation ...
...                         ...
metadata_link:              https://doi.org/10.25921/RE9P-PT57
ncei_template_version:      NCEI_NetCDF_Grid_Template_v2.0
comment:                    Data was converted from NetCDF-3 to NetCDF-4 ...
sensor:                     Thermometer, AVHRR
Conventions:                CF-1.6, ACDD-1.3
references:                 Reynolds, et al.(2007) Daily High-Resolution-...
session = repo.writable_session("main")

vds.virtualize.to_icechunk(session.store, group="virtual/netcdf")
session.status()
Output
Groups created:
/virtual
/virtual/netcdf

Arrays created:
/virtual/netcdf/anom
/virtual/netcdf/err
/virtual/netcdf/ice
/virtual/netcdf/lat
/virtual/netcdf/lon
/virtual/netcdf/sst
/virtual/netcdf/time
/virtual/netcdf/zlev

Chunks updated:
/virtual/netcdf/anom:
[0, 0, 0, 0]
/virtual/netcdf/err:
[0, 0, 0, 0]
/virtual/netcdf/ice:
[0, 0, 0, 0]
/virtual/netcdf/lat:
[0]
/virtual/netcdf/lon:
[0]
/virtual/netcdf/sst:
[0, 0, 0, 0]
/virtual/netcdf/time:
[0]
/virtual/netcdf/zlev:
[0]
session.commit("added virtual netCDF!")
Output
'1MXWAKQNV4K8HXJP6JMG'

Clean up

client.delete_repo("earthmover/icechunk-demo", imsure=True, imreallysure=True)