Manage Zarr Data
Once you have created a repo, you are ready to manage Zarr data. Our goal on this page is to learn how to:
- Check out a specific snapshot of our repository
- Create Zarr Groups and Arrays
- Write Zarr attributes (metadata)
- Write Array data
- Commit changes
- Roll back to previous versions
If you're completely new to Zarr, you might want to first check out the Zarr Tutorial before diving in.
We start by importing arraylake
.
from arraylake import Client
Connect to Repo
The name of our repo will be earthmover/ocean
. Here earthmover
represents the name our org; you should replace this with the name of your own org.
client = Client()
repo = client.create_repo("earthmover/ocean")
repo
<arraylake.repo.Repo 'earthmover/ocean'>
We can inspect the status of our repo as follows
repo.status()
🧊 Using repo earthmover/ocean
📟 Session 5ea5544b8a7d4809a0bbf00496602e12 started at 2024-02-26T15:19:56.512000
🌿 On branch main
No changes in current session
Open the Root Group
The repo
object exposes a valid Zarr V3 Store. We can access the root group as follows:
repo.root_group
<zarr.hierarchy.Group '/'>
We'll explore Arraylake's version control system by adding some metadata to this group.
When your store is created, by default there are no metadata (Zarr attributes) associated with the root group.
dict(repo.root_group.attrs)
{}
Let's create some now.
repo.root_group.attrs["title"] = "Ocean Data Repository 🐳"
dict(repo.root_group.attrs)
{'title': 'Ocean Data Repository 🐳'}
Zarr attributes are completely arbitrary. 🤷♀️ You can use whatever key / value pairs make sense for you.
Arraylake knows we have made changes. We can see the changes as follows:
repo.status()
🧊 Using repo earthmover/ocean
📟 Session 5ea5544b8a7d4809a0bbf00496602e12 started at 2024-02-26T15:19:56.512000
🌿 On branch main
paths modified in session
- 📝 meta/root.group.json
For now, our changes are in an uncommitted state.
Commit Changes
The changes we made are visible to us. But they can't be seen by anyone else using the repo...until we commit! A commit creates a snapshot of the repository state that can be seen by everyone.
We create a commit like this:
cid = repo.commit("My first commit 🥹")
cid
65dcad36fb58969e1179ef90
We can see the commit log for our repo like this:
repo.commit_log
Commit ID 65dcad36fb58969e1179ef90 Author Ryan Abernathey <ryan@earthmover.io> Date 2024-02-26T15:24:38.616000 My first commit 🥹
Now let's make another commit to update the metadata with a description
.
repo.root_group.attrs["description"] = "Data about the ocean."
repo.commit("Added description field")
65dcad3afb58969e1179ef91
Check out an Earlier Commit
If we want to roll back the state of our repo, we can check out an earlier commit. We get the commit ID by looking at the commit log.
repo.commit_log
Commit ID 65dcad3afb58969e1179ef91 Author Ryan Abernathey <ryan@earthmover.io> Date 2024-02-26T15:24:42.007000 Added description field
Commit ID 65dcad36fb58969e1179ef90 Author Ryan Abernathey <ryan@earthmover.io> Date 2024-02-26T15:24:38.616000 My first commit 🥹
repo.checkout(cid)
/Users/rabernat/mambaforge/envs/arraylake-local/lib/python3.11/site-packages/arraylake/repo.py:377: UserWarning: You are not on a branch tip, so you can't commit changes.
warnings.warn("You are not on a branch tip, so you can't commit changes.")
65dcad36fb58969e1179ef90
The warning tells us that we can't make any changes in this state.
If we look at the attributes, we can see the first change (title
) but not the second (description
):
dict(repo.root_group.attrs)
{'title': 'Ocean Data Repository 🐳'}
After checking out the latest commit, we can see all the metadata again.
repo.checkout("main")
dict(repo.root_group.attrs)
{'description': 'Data about the ocean.', 'title': 'Ocean Data Repository 🐳'}
Create Arrays and Groups
We can create whatever Zarr groups and arrays we want in our repo. In this example, we create a sub-group and create an array within it.
import numpy as np
atlantic_group = repo.root_group.create_group("atlantic")
atlantic_group.attrs["title"] = "Atlantic Ocean"
temperature_array = atlantic_group.create(
"temperature",
shape=100, chunks=10, dtype="f4", fill_value=np.nan
)
temperature_array.attrs["name"] = "Atlantic Ocean Temperature"
temperature_array.attrs["units"] = "degrees Celsius"
repo.tree().__rich__()
/
└── 📁 atlantic
└── 🇦 temperature (100,) float32
We didn't write any data to our array yet, so accessing elements just returns the fill value.
temperature_array[0]
nan
Let's assign some data to part of the array.
temperature_array[:50] = 10.0
Now let's see what changes we have made.
repo.status()
🧊 Using repo earthmover/ocean
📟 Session 7bebc5626d7648adb54ebe35963d84a3 started at 2024-02-26T15:24:49.985000
🌿 On branch main
paths modified in session
- 📝 data/root/atlantic/temperature/c1
- 📝 data/root/atlantic/temperature
- 📝 meta/root/atlantic.group.json
- 📝 data/root/atlantic/temperature/c0
- 📝 data/root/atlantic/temperature/c2
- 📝 meta/root/atlantic/temperature.array.json
- 📝 data/root/atlantic/temperature/c4
- 📝 data/root/atlantic/temperature/c3
We're ready to commit our changes!
repo.commit("Created a group and array")
65dcad55fb58969e1179ef98
Examine Array Info
The .info
property on arrays shows a lot of useful details.
temperature_array.info
Name | /atlantic/temperature |
---|---|
Type | zarr.core.Array |
Data type | float32 |
Shape | (100,) |
Chunk shape | (10,) |
Order | C |
Read-only | False |
Compressor | Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0) |
Store type | arraylake.repo.ArraylakeStore |
No. bytes | 400 |
No. bytes stored | 56 |
Storage ratio | 7.1 |
Chunks initialized | 5/10 |
Extend an Array
Zarr supports resizing and appending to arrays. Let's resize that temperature array to be twice as long.
The appending process consists of two steps:
- resize the array
- write new data to the newly initialized region
Thanks to Arraylake's transaction system, both these steps can be done as part of a single ACID transaction. Other readers of the data will never see the array in an intermediate state (e.g. resized but not yet populated with data), as they would with file storage or cloud object storage.
Step 1 - Resize the array:
repo.root_group.atlantic.temperature.resize(200)
repo.root_group.atlantic.temperature
<zarr.core.Array '/atlantic/temperature' (200,) float32>
Step 2 - write new data to the new region of the array:
repo.root_group.atlantic.temperature[100:] = 99
Let's look at the repo status to see what was changed:
repo.status()
🧊 Using repo earthmover/ocean
📟 Session 187e5955926042719616fe51cd50a0f1 started at 2024-02-26T15:25:10.017000
🌿 On branch main
paths modified in session
- 📝 data/root/atlantic/temperature/c15
- 📝 data/root/atlantic/temperature/c11
- 📝 data/root/atlantic/temperature/c18
- 📝 data/root/atlantic/temperature/c19
- 📝 data/root/atlantic/temperature/c17
- 📝 data/root/atlantic/temperature/c12
- 📝 meta/root/atlantic/temperature.array.json
- 📝 data/root/atlantic/temperature/c10
- 📝 data/root/atlantic/temperature/c13
- 📝 data/root/atlantic/temperature/c16
- 📝 data/root/atlantic/temperature/c14
As we can see we modified both a metadata document (related to resizing the array) and wrote new chunks.
Now that our change is complete, let's commit.
repo.commit("extended temperature array")
65dcad5efb58969e1179ef99
Move an Array or Group
The interface to move arrays and groups is the zarr.Group.move
method - Arraylake makes this fast and efficient. In standard zarr, this operation is costly, because all of the chunks in the moved hierarchy will be copied from the source to the destination. For example, imagine moving a group that contains two arrays with a million chunks each on S3. This operation would require reading and writing the two million chunks, from and to S3, which could incurr significant time and transfer costs.
Arraylake makes this operation very efficient, no chunk or metadata is copied. As a consequence, the operation is fast, cheap and easily reversable through Arraylake's full versioning system.
Let's try it:
atlantic = repo.root_group.atlantic
atlantic.move("temperature", "temperature_array")
repo.root_group.move("atlantic", "atlantic_group")
This operation will be very quick, regardless of the data volume being moved.
Thanks to Arraylake's version control system, nobody outside of our session can see that change yet - to make the move visible to other users of the repository, we need to commit it:
repo.status()
🧊 Using repo earthmover/ocean
📟 Session cbf6c26eba3347859de4c60242467836 started at 2024-02-26T15:25:19.119000
🌿 On branch main
paths modified in session
- 📝 meta/root/atlantic_group/temperature_array.array.json
- ❌ meta/root/atlantic/temperature.array.json
- 📝 meta/root/atlantic_group.group.json
- 📝 data/root/atlantic_group/temperature_array
- ❌ data/root/atlantic/temperature
- ❌ meta/root/atlantic.group.json
repo.commit("Moved group and array")
65dcad63fb58969e1179efa2
zarr.Group.move
is a tool for organizing your repository hierarchy. As in many other systems, behavior is undefined for concurrent moves within the same session. We recommend you review your changes using tree()
or other listing operations after moving. Like other operations in Arraylake, in the case of an undesirable change, it's trivial to discard your session and start afresh.
Delete an Array or Group
If we want to remove objects from our repo's Zarr hierarchy, we use the del
statement.
del repo.root_group.atlantic_group["temperature_array"]
We can see that the array has been deleted.
list(repo.root_group.atlantic_group)
[]
We can apply the same operation to a group:
del repo.root_group["atlantic_group"]
list(repo.root_group)
[]
You have to use square bracket syntax for del
operations to work. It would not work to say del repo.root_group.atlantic
.
Finally, we are going to clean up by deleting the entire repo we created for this tutorial.
client.delete_repo("earthmover/ocean", imsure=True, imreallysure=True)