Skip to main content

Manage Zarr Data

Once you have created a repo, you are ready to manage Zarr data. Our goal on this page is to learn how to:

  • Check out a specific snapshot of our repository
  • Create Zarr Groups and Arrays
  • Write Zarr attributes (metadata)
  • Write Array data
  • Commit changes
  • Roll back to previous versions
tip

If you're completely new to Zarr, you might want to first check out the Zarr Tutorial before diving in.

We start by importing arraylake.

from arraylake import Client

Connect to Repo

The name of our repo will be earthmover/ocean. Here earthmover represents the name our org; you should replace this with the name of your own org.

client = Client()
repo = client.create_repo("earthmover/ocean")
repo
Output
<arraylake.repo.Repo 'earthmover/ocean'>

We can inspect the status of our repo as follows

repo.status()

🧊 Using repo earthmover/ocean
📟 Session 5ea5544b8a7d4809a0bbf00496602e12 started at 2024-02-26T15:19:56.512000
🌿 On branch main

No changes in current session

Open the Root Group

The repo object exposes a valid Zarr V3 Store. We can access the root group as follows:

repo.root_group
Output
<zarr.hierarchy.Group '/'>

We'll explore Arraylake's version control system by adding some metadata to this group.

When your store is created, by default there are no metadata (Zarr attributes) associated with the root group.

dict(repo.root_group.attrs)
Output
{}

Let's create some now.

repo.root_group.attrs["title"] = "Ocean Data Repository 🐳"
dict(repo.root_group.attrs)
Output
{'title': 'Ocean Data Repository 🐳'}

Zarr attributes are completely arbitrary. 🤷‍♀️ You can use whatever key / value pairs make sense for you.

Arraylake knows we have made changes. We can see the changes as follows:

repo.status()

🧊 Using repo earthmover/ocean
📟 Session 5ea5544b8a7d4809a0bbf00496602e12 started at 2024-02-26T15:19:56.512000
🌿 On branch main

paths modified in session

  • 📝 meta/root.group.json

For now, our changes are in an uncommitted state.

Commit Changes

The changes we made are visible to us. But they can't be seen by anyone else using the repo...until we commit! A commit creates a snapshot of the repository state that can be seen by everyone.

We create a commit like this:

cid = repo.commit("My first commit 🥹")
cid
Output
65dcad36fb58969e1179ef90

We can see the commit log for our repo like this:

repo.commit_log
  • Commit ID65dcad36fb58969e1179ef90
    AuthorRyan Abernathey <ryan@earthmover.io>
    Date2024-02-26T15:24:38.616000

    My first commit 🥹

Now let's make another commit to update the metadata with a description.

repo.root_group.attrs["description"] = "Data about the ocean."
repo.commit("Added description field")
Output
65dcad3afb58969e1179ef91

Check out an Earlier Commit

If we want to roll back the state of our repo, we can check out an earlier commit. We get the commit ID by looking at the commit log.

repo.commit_log
  • Commit ID65dcad3afb58969e1179ef91
    AuthorRyan Abernathey <ryan@earthmover.io>
    Date2024-02-26T15:24:42.007000

    Added description field

  • Commit ID65dcad36fb58969e1179ef90
    AuthorRyan Abernathey <ryan@earthmover.io>
    Date2024-02-26T15:24:38.616000

    My first commit 🥹

repo.checkout(cid)
Output
/Users/rabernat/mambaforge/envs/arraylake-local/lib/python3.11/site-packages/arraylake/repo.py:377: UserWarning: You are not on a branch tip, so you can't commit changes.
warnings.warn("You are not on a branch tip, so you can't commit changes.")
Output
65dcad36fb58969e1179ef90

The warning tells us that we can't make any changes in this state.

If we look at the attributes, we can see the first change (title) but not the second (description):

dict(repo.root_group.attrs)
Output
{'title': 'Ocean Data Repository 🐳'}

After checking out the latest commit, we can see all the metadata again.

repo.checkout("main")
dict(repo.root_group.attrs)
Output
{'description': 'Data about the ocean.', 'title': 'Ocean Data Repository 🐳'}

Create Arrays and Groups

We can create whatever Zarr groups and arrays we want in our repo. In this example, we create a sub-group and create an array within it.

import numpy as np
atlantic_group = repo.root_group.create_group("atlantic")
atlantic_group.attrs["title"] = "Atlantic Ocean"
temperature_array = atlantic_group.create(
"temperature",
shape=100, chunks=10, dtype="f4", fill_value=np.nan
)
temperature_array.attrs["name"] = "Atlantic Ocean Temperature"
temperature_array.attrs["units"] = "degrees Celsius"

repo.tree().__rich__()
Output
/
└── 📁 atlantic
└── 🇦 temperature (100,) float32

We didn't write any data to our array yet, so accessing elements just returns the fill value.

temperature_array[0]
Output
nan

Let's assign some data to part of the array.

temperature_array[:50] = 10.0

Now let's see what changes we have made.

repo.status()

🧊 Using repo earthmover/ocean
📟 Session 7bebc5626d7648adb54ebe35963d84a3 started at 2024-02-26T15:24:49.985000
🌿 On branch main

paths modified in session

  • 📝 data/root/atlantic/temperature/c1
  • 📝 data/root/atlantic/temperature
  • 📝 meta/root/atlantic.group.json
  • 📝 data/root/atlantic/temperature/c0
  • 📝 data/root/atlantic/temperature/c2
  • 📝 meta/root/atlantic/temperature.array.json
  • 📝 data/root/atlantic/temperature/c4
  • 📝 data/root/atlantic/temperature/c3

We're ready to commit our changes!

repo.commit("Created a group and array")
Output
65dcad55fb58969e1179ef98

Examine Array Info

The .info property on arrays shows a lot of useful details.

temperature_array.info
Name/atlantic/temperature
Typezarr.core.Array
Data typefloat32
Shape(100,)
Chunk shape(10,)
OrderC
Read-onlyFalse
CompressorBlosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
Store typearraylake.repo.ArraylakeStore
No. bytes400
No. bytes stored56
Storage ratio7.1
Chunks initialized5/10

Extend an Array

Zarr supports resizing and appending to arrays. Let's resize that temperature array to be twice as long.

The appending process consists of two steps:

  1. resize the array
  2. write new data to the newly initialized region

Thanks to Arraylake's transaction system, both these steps can be done as part of a single ACID transaction. Other readers of the data will never see the array in an intermediate state (e.g. resized but not yet populated with data), as they would with file storage or cloud object storage.

Step 1 - Resize the array:

repo.root_group.atlantic.temperature.resize(200)
repo.root_group.atlantic.temperature
Output
<zarr.core.Array '/atlantic/temperature' (200,) float32>

Step 2 - write new data to the new region of the array:

repo.root_group.atlantic.temperature[100:] = 99

Let's look at the repo status to see what was changed:

repo.status()

🧊 Using repo earthmover/ocean
📟 Session 187e5955926042719616fe51cd50a0f1 started at 2024-02-26T15:25:10.017000
🌿 On branch main

paths modified in session

  • 📝 data/root/atlantic/temperature/c15
  • 📝 data/root/atlantic/temperature/c11
  • 📝 data/root/atlantic/temperature/c18
  • 📝 data/root/atlantic/temperature/c19
  • 📝 data/root/atlantic/temperature/c17
  • 📝 data/root/atlantic/temperature/c12
  • 📝 meta/root/atlantic/temperature.array.json
  • 📝 data/root/atlantic/temperature/c10
  • 📝 data/root/atlantic/temperature/c13
  • 📝 data/root/atlantic/temperature/c16
  • 📝 data/root/atlantic/temperature/c14

As we can see we modified both a metadata document (related to resizing the array) and wrote new chunks.

Now that our change is complete, let's commit.

repo.commit("extended temperature array")
Output
65dcad5efb58969e1179ef99

Move an Array or Group

The interface to move arrays and groups is the zarr.Group.move method - Arraylake makes this fast and efficient. In standard zarr, this operation is costly, because all of the chunks in the moved hierarchy will be copied from the source to the destination. For example, imagine moving a group that contains two arrays with a million chunks each on S3. This operation would require reading and writing the two million chunks, from and to S3, which could incurr significant time and transfer costs.

Arraylake makes this operation very efficient, no chunk or metadata is copied. As a consequence, the operation is fast, cheap and easily reversable through Arraylake's full versioning system.

Let's try it:

atlantic = repo.root_group.atlantic
atlantic.move("temperature", "temperature_array")

repo.root_group.move("atlantic", "atlantic_group")

This operation will be very quick, regardless of the data volume being moved.

Thanks to Arraylake's version control system, nobody outside of our session can see that change yet - to make the move visible to other users of the repository, we need to commit it:

repo.status()

🧊 Using repo earthmover/ocean
📟 Session cbf6c26eba3347859de4c60242467836 started at 2024-02-26T15:25:19.119000
🌿 On branch main

paths modified in session

  • 📝 meta/root/atlantic_group/temperature_array.array.json
  • ❌ meta/root/atlantic/temperature.array.json
  • 📝 meta/root/atlantic_group.group.json
  • 📝 data/root/atlantic_group/temperature_array
  • ❌ data/root/atlantic/temperature
  • ❌ meta/root/atlantic.group.json
repo.commit("Moved group and array")
Output
65dcad63fb58969e1179efa2

zarr.Group.move is a tool for organizing your repository hierarchy. As in many other systems, behavior is undefined for concurrent moves within the same session. We recommend you review your changes using tree() or other listing operations after moving. Like other operations in Arraylake, in the case of an undesirable change, it's trivial to discard your session and start afresh.

Delete an Array or Group

If we want to remove objects from our repo's Zarr hierarchy, we use the del statement.

del repo.root_group.atlantic_group["temperature_array"]

We can see that the array has been deleted.

list(repo.root_group.atlantic_group)
Output
[]

We can apply the same operation to a group:

del repo.root_group["atlantic_group"]
list(repo.root_group)
Output
[]
note

You have to use square bracket syntax for del operations to work. It would not work to say del repo.root_group.atlantic.

Finally, we are going to clean up by deleting the entire repo we created for this tutorial.

client.delete_repo("earthmover/ocean", imsure=True, imreallysure=True)