Skip to main content

🧊 Icechunk Migration Guide

Icechunk is Earthmover's new open-source storage format and library. In February 2025, we are beginning migrating customers to Icechunk. This will bring massive benefits to everyone. It will also involve some changes to code, as well as migration of existing data sets. We will work with you to minimize any disruptions so that everyone can realize the benefits of Icechunk as soon as possible!

What is Icechunk? How does it benefit me?

Icechunk is an evolution of how the Earthmover platform stores data. With the release of Icechunk, there are now two flavors of Repos in the Earthmover platform

  • Arraylake V1 Repos - the original flavor of Repo, available since the very beginning.
  • Icechunk Repos - the new flavor of Repo, available since February 2025.

In the design of Icechunk, we incorporated many of the lessons learned on the first iteration of Arraylake Repos. The result is a new type of Repo that is faster, more flexible, and more robust than V1 Repos. Even better, Icechunk is 100% open source. You can read a lot more about Icechunk, how it works, and why we built it in the launch blogpost.

A key technical difference is that V1 Repos require frequent communication with the Arraylake backend for all data and metadata operations, which adds some latency. Icechunk, in contrast, only needs to talk to the object store.

Icechunk Repos will eventually support all of the features of Arraylake repos and more! Here's a rundown of features.

FeatureArraylake V1 RepoIcechunk Repo
Xarray compatibility
ACID transactions
Tags & branches
Conflict resolution
Data expiration & garbage collection
Virtual files
VirtualiZarr compatibility
Speed🚂🚀
Browse in Arraylake Catalog
Search in Arraylake Catalog🚧 (coming soon)
Integration with Query Service🚧 (coming soon)
Open Source
Zarr Python Compatibilityzarr<3zarr>3

Icechunk is part of Arraylake, but it's also a standalone open source library. Icechunk has its own website at https://icechunk.io, which you should go check out!

How do I start using Icechunk?

Icechunk requires you to upgrade to Zarr Python 3. This new major release of Zarr has many enhancements which benefit Icechunk.

To install such an environment, run

pip install "arraylake[icechunk]>=0.15"

or see Setup and Installation for more details.

warning

In environments with zarr>3, existing V1 Arraylake repos will be available only in read-only mode.

At this point, you can create Icechunk repos. To create a new repo from Python:

import arraylake as al
client = al.Client()
repo = client.create_repo("my-org/my-icechunk-repo")

New Repos are Icechunk repos by default if icechunk and zarr>=3 are present in the environment.

How does my code have to change with Icechunk?

First, Zarr Python 3 has some API changes compared to the 2.x series. Please consult the Zarr 3 migration guide when upgrading to zarr>3.

Furthermore, Icechunk Repos are not API compatible with V1 Repos. The Icechunk API is superior and incorporates many lessons learned from V1 Repos.

These changes bring the following improvements compared to V1 repos:

  • There is less implicit state on a Repo, due to the introduction of a seperate Session object. This also makes it possible to open multiple Sessions on the same Repo.
  • Read-only vs writable sessions are more obvious, giving you more control with less risk.
  • Easer to understand what you are discarding if you discard changes.
  • Much richer functionality around conflict detection and resolution.

We recommending reviewing the Icechunk documentation to get a feel for the API.

When migrating code, here are some of the small changes you have to make.

Opening an array or group with Zarr

The big difference is that Icechunk requires you to explicitly create a Session when interacting with data, using either repo.writable_session() or repo.readonly_session.

# V1 Repo
import zarr
repo = client.get_repo("my-org/v1-repo")
array = zarr.open(repo.store, path="path/to/array")
# ... make changes
repo.commit("wrote changes to V1 repo")

# Icechunk Repo
repo = client.get_repo("my-org/icechunk-repo")
# first create a session
session = repo.writable_session(branch="main")
array = zarr.open(session.store, path="path/to/array")
# ... make changes
session.commit("wrote changes to Icechunk repo")

Opening an array or group with Xarray

Similarly, you need to use a Session when opening data with Xarray

# V1 Repo
import xarray as xr
repo = client.get_repo("my-org/v1-repo")
ds = repo.to_xarray("path/to/group")

# Icechunk Repo
repo = client.get_repo("my-org/icechunk-repo")
# first create a session
session = repo.readonly_session(branch="main")
ds = xr.open_dataset(session.store, engine="zarr", consolidated=False)

Distributed writes with Dask / Xarray + Dask

You can use Icechunk in conjunction with Xarray and Dask to perform large-scale distributed writes from a multi-node cluster. However, because of how Icechunk works, it's not possible to use the existing Dask.Array.to_zarr or Xarray.Dataset.to_zarr functions with either the Dask multiprocessing or distributed schedulers.

Instead, Icechunk provides its own specialized functions to make distributed writes with Dask and Xarray like icechunk.xarray.to_icechunk. To learn how to use these functions, consult the Icechunk Docs

Committing changes and resolving conflicts

Icechunk has a much more sophisticated mechanism for committing changes and resolving conflicts between concurrent writers. Where Arraylake would simply fail, Icechunk offers powerful and more flexible conflict resolution functions.

How do I migrate my existing V1 Repos to Icechunk Repos?

You have two options here:

  1. Manually copy your data from a V1 repo to and Icechunk repo. This is suitable for small repos.
  2. Request support from the Earthmover team. We have an internal process we can run which is scalable to large repos. The migration progress is non-destructive, so you can start to experiment with Icechunk while still using your V1 Repos in production. Please reach out via Slack to coordinate your migration!

What is the timeline for this migration?

We are committed to supporting all customers through this transition with minimal disruption. Below is a tentative timeline for the migration.

  • February 2025 - Icechunk is available in Arraylake
  • March 2025 - Icechunk becomes the default for new repos
  • April 2025 - V1 repos can no longer be created
  • June 2025 - V1 repos move to read-only mode
  • September 2025 - V1 repos no longer supported