Skip to main content

Version Control with Icechunk

Arraylake helps you collaborate around your cloud data holdings by providing a rich and intuitive interface that understands the underlying array data model (Zarr), and the underlying version control model (Icechunk).

This notebook will illustrate the latter using the Arraylake app.

tip

See the Icechunk documentation on version control for a more detailed tutorial on this topic.

import arraylake as al

client = al.Client()

repo = client.create_repo("earthmover-public/vcs-demo")

By default all Icechunk repos have a main branch.

repo.list_branches()
Output
{'main'}

View the commit history of this branch.

list(repo.ancestry(branch="main"))
Output
[SnapshotInfo(id="8GNCC3M3S696Y842H2N0", parent_id=None, written_at=datetime.datetime(2025,2,21,18,3,6,875702, tzinfo=datetime.timezone.utc), message="Repository...")]

There is a single commit that initialized this repository.

In the app​

Click here to view this Repository in the Arraylake Web App.

image

Commit history​

Let's make a second commit adding some attributes. We will use the Zarr API for this.

import zarr

session = repo.writable_session("main")
root = zarr.group(session.store)
root.attrs["foo"] = "bar"
session.commit(message="Add foo attribute to root group")
Output
'BFA63WSKK2NNBSFP2MEG'

Now we have two commits.

list(repo.ancestry(branch="main"))
Output
[SnapshotInfo(id="BFA63WSKK2NNBSFP2MEG", parent_id="8GNCC3M3S696Y842H2N0", written_at=datetime.datetime(2025,2,21,18,3,7,475453, tzinfo=datetime.timezone.utc), message="Add foo at..."),
SnapshotInfo(id="8GNCC3M3S696Y842H2N0", parent_id=None, written_at=datetime.datetime(2025,2,21,18,3,6,875702, tzinfo=datetime.timezone.utc), message="Repository...")]

View the history over in the Arraylake App.

image.png

Branches​

Let us create two new branches that branch off main

Use lookup_branch to grab the snapshot ID. Alternatively you can copy the ID from the web-app.

image.png

main_snapshot = repo.lookup_branch("main")
repo.create_branch("stage", main_snapshot)
repo.create_branch("prod", main_snapshot)

Use list_branches to view branches for the repo

repo.list_branches()
Output
{'main', 'prod', 'stage'}

Let us update an attribute on both branches.

from datetime import datetime
session = repo.writable_session("stage")
root = zarr.group(session.store)
root.attrs["branch"] = "stage"
root.attrs["updated_on"] = str(datetime.now())
session.commit("updated stage")
Output
'C51RZMWW10KAKGKB7340'
session = repo.writable_session("prod")
root = zarr.group(session.store)
root.attrs["branch"] = "prod"
root.attrs["updated_on"] = str(datetime.now())
session.commit("updated prod")
Output
'596VG3G6XKC7QYBR8720'

View the branches in the branch picker at the app.

image.png

Clicking on them will present the Repository as it is on that branch. Click these links or use the branch picker to navigate there.

  1. stage
  2. prod

Tags​

We add tags with create_tag. Here we will use lookup_branch to add the tag to the "tip" of each branch.

repo.create_tag("v1", snapshot_id=repo.lookup_branch("prod"))
repo.create_tag("latest-stage", snapshot_id=repo.lookup_branch("stage"))
repo.list_tags()
Output
{'latest-stage', 'v1'}
repo.lookup_tag("v1")
Output
'596VG3G6XKC7QYBR8720'

This is the same SnapshotID as the tip of the prod branch.

repo.lookup_branch("prod")
Output
'596VG3G6XKC7QYBR8720'