Skip to main content

Version Control with Arraylake

Arraylake carries over concepts from other version control software (e.g. Git) to multidimensional arrays. Doing so helps ease the burden of managing multiple versions of your data, and helps you be precise about which version of your dataset is being used for downstream purposes.

Core concepts of Arraylake's version control system are:

  1. A commit bundles together related data and metadata changes in a single "transaction".
  2. A branch points to the latest commit in a series of commits. Multiple branches can co-exist at a given time, and multiple users can add commits to a single branch. One common pattern is to use dev, stage, and prod branches to separate versions of a dataset.
  3. A tag is an immutable reference to a commit, usually used to represent an "important" version of the dataset such as a release.

Commits, branches, and tags all refer to specific versions of your dataset. You can time-travel/navigate back to any version of your data as referenced by a commit, a branch, or a tag by passing a commit ID, a branch name, or a tag name to Repo.checkout

tip

See here for more on commits and branches.

Setup

First lets create a new repo for demonstration purposes

import arraylake as al

client = al.Client()
repo = client.get_or_create_repo("earthmover-demos/vcs")
repo
Output
<arraylake.repo.Repo 'earthmover-demos/vcs'>

Committing

Concepts

  1. A commit is created by an Author, and links a set of data and metadata changes together as a single transaction with the data store.
  2. Each commit is associated with a unique auto-generated commit_id; the exact timestamp for when it was created; parent_commit, a pointer to the commit that came before; and a message.
  3. The commit_id is an immutable identifier for the state of the repo. Multiple clients can check out any commit and are guaranteed to always see the exact same data.
  4. Any changes you make to your dataset are not reflected remotely until a commit has been made.

Create a commit

First make a change.

tip

For simplicity, this tutorial only modifies the attributes for the Zarr group.

repo.root_group.attrs["what"] = "a change"
repo.root_group.attrs.asdict()
Output
{'what': 'a change'}

Inspect any uncommitted changes by calling the Repo.status() method

repo.status()

🧊 Using repo earthmover-demos/vcs
📌 Base commit is 6697354dd8e9064ad7e6c6d6
📟 Session df68a5f10793450281893b364b421860 started at 2024-07-17T03:07:04.293000
🌿 On branch main

paths modified in session

  • 📝 meta/root.group.json

Now create a new commit using Repo.commit(). This will return the commit ID that we will use later.

first_commit_id = repo.commit(message="I made my first change")
first_commit_id
Output
6697355ce00e05ed1cce08d2

Make a second change, and a second commit.

repo.root_group.attrs["what"] = "another change"
repo.commit(message="I made my second change")
Output
6697355f3f5775e74408ab47

Call repo.status() to see the updated base commit.

repo.status()

🧊 Using repo earthmover-demos/vcs
📌 Base commit is 6697355f3f5775e74408ab47
📟 Session 7d1b9f87a1404f0dbce9623c055901bc started at 2024-07-17T03:07:11.985000
🌿 On branch main

No changes in current session

List commits

View the commit history using Repo.commit_log. Notice the associated author and timestamp.

repo.commit_log
  • Commit ID6697355f3f5775e74408ab47
    AuthorEmma Marshall <emma@earthmover.io>
    Date2024-07-17T03:09:40.340000

    I made my second change

  • Commit ID6697355ce00e05ed1cce08d2
    AuthorEmma Marshall <emma@earthmover.io>
    Date2024-07-17T03:09:37.024000

    I made my first change

  • Commit ID6697354dd8e9064ad7e6c6d6
    AuthorEmma Marshall <emma@earthmover.io>
    Date2024-07-17T03:09:21.979000

    I made my first change

  • Commit ID6632e95b26b204d68d6226ab
    AuthorDeepak Cherian <deepak@earthmover.io>
    Date2024-05-02T01:16:10.950000
    Tagsv1, v1_second_time

    update for v1

  • Commit ID6632e95726b204d68d6226a8
    AuthorDeepak Cherian <deepak@earthmover.io>
    Date2024-05-02T01:16:07.379000
    Tagsv0

    yet another change

  • Commit ID6632e95526b204d68d6226a7
    AuthorDeepak Cherian <deepak@earthmover.io>
    Date2024-05-02T01:16:05.419000

    I made my second change

  • Commit ID6632e95426b204d68d6226a6
    AuthorDeepak Cherian <deepak@earthmover.io>
    Date2024-05-02T01:16:04.572000

    I made my first change

tip

Arraylake commit IDs are not completely random as in some other version control systems (e.g. Git). Identify different commits using the last few characters, rather than the first few.

Time-travel to a commit

Time travel back to a specific commit with Repo.checkout.

Here's where we are currently

repo.root_group.attrs.asdict()
Output
{'what': 'another change'}

Check out the first commit by passing the commit ID.

tip

You could also copy this ID from the commit log, or the web app.

repo.checkout(first_commit_id)
Output
/home/emmamarshall/Desktop/earthmover/arraylake/client/arraylake/asyn.py:135: UserWarning: You are not on a branch tip, so you can't commit changes.
result[0] = await coro
Output
6697355ce00e05ed1cce08d2

Voila! we have traveled back in time.

repo.root_group.attrs.asdict()
Output
{'what': 'a change'}
warning

Note the warning about being on a "branch tip". We will describe this next. For now, remember that while you can travel back to any commit, you cannot make any changes.

Branching

Branches are great for testing out updates for your data before you commit them to your main branch.

Concepts

  1. A branch is a mutable, or modifiable, reference to a commit_id. By default, all repos begin with a main branch.
  2. There is no specific Author associated with a branch.
  3. A "branch tip" refers to the most recent commit on a branch.
  4. You can only make changes to your data while on a branch tip. A new commit will update the branch pointer to point to that new commit.

Arraylake branches differ from Github branches in one key way: Github branches are designed to be merged with main while Arraylake branches are not. In Arraylake, creating a branch from a reference point initializes an independent development path with a copy of the data using the reference state. On the new branch, you can safely prototype changes without affecting the root branch. New writes to a branch are saved separately from the state when the branch was created.

Below is a step-by-step walk through of creating branches, committing changes to different branches, switching between branches, and deleting branches.

Checkout the main branch with Repo.checkout

repo.checkout("main")
Output
6697355f3f5775e74408ab47

View the current branch, together with current but uncommitted modifications, in Repo.status.

repo.status()

🧊 Using repo earthmover-demos/vcs
📌 Base commit is 6697355f3f5775e74408ab47
📟 Session b788b0c1619c4cad9129f81a277546c6 started at 2024-07-17T03:07:37.195000
🌿 On branch main

No changes in current session

Updating a branch

Making a commit will update the branch pointer too

repo.root_group.attrs["what"] = "more change"
repo.commit("yet another change")
Output
6697357f3f5775e74408ab8d
repo.status()

🧊 Using repo earthmover-demos/vcs
📌 Base commit is 6697357f3f5775e74408ab8d
📟 Session ee3f4f96594f41a39fbe8f22eb72d60e started at 2024-07-17T03:07:44.233000
🌿 On branch main

No changes in current session

At this point, our working tree looks like this:

Create a branch

Create a new branch with Repo.new_branch

repo.new_branch("stage")

List branches

View a list of branches with Repo.branches

repo.branches
Output
(Branch(id='main', commit_id=6697357f3f5775e74408ab8d),)
caution

Note that the new branch "stage" is not present in the list of branches!

Currently, new branches are created remotely only when a new commit is created.

Make a new commit on the new branch

repo.root_group.attrs["updated_on"] = "stage"
repo.commit("Committed to 'stage'")
Output
66973586d8e9064ad7e6c759

Now check the list of branches again.

repo.branches
Output
(Branch(id='main', commit_id=6697357f3f5775e74408ab8d),
Branch(id='stage', commit_id=66973586d8e9064ad7e6c759))

View commit history for a branch with commit_log.

repo.commit_log
  • Commit ID66973586d8e9064ad7e6c759
    AuthorEmma Marshall <emma@earthmover.io>
    Date2024-07-17T03:10:19.456000

    Committed to 'stage'

  • Commit ID6697357f3f5775e74408ab8d
    AuthorEmma Marshall <emma@earthmover.io>
    Date2024-07-17T03:10:12.295000

    yet another change

  • Commit ID6697355f3f5775e74408ab47
    AuthorEmma Marshall <emma@earthmover.io>
    Date2024-07-17T03:09:40.340000

    I made my second change

  • Commit ID6697355ce00e05ed1cce08d2
    AuthorEmma Marshall <emma@earthmover.io>
    Date2024-07-17T03:09:37.024000

    I made my first change

  • Commit ID6697354dd8e9064ad7e6c6d6
    AuthorEmma Marshall <emma@earthmover.io>
    Date2024-07-17T03:09:21.979000

    I made my first change

  • Commit ID6632e95b26b204d68d6226ab
    AuthorDeepak Cherian <deepak@earthmover.io>
    Date2024-05-02T01:16:10.950000
    Tagsv1, v1_second_time

    update for v1

  • Commit ID6632e95726b204d68d6226a8
    AuthorDeepak Cherian <deepak@earthmover.io>
    Date2024-05-02T01:16:07.379000
    Tagsv0

    yet another change

  • Commit ID6632e95526b204d68d6226a7
    AuthorDeepak Cherian <deepak@earthmover.io>
    Date2024-05-02T01:16:05.419000

    I made my second change

  • Commit ID6632e95426b204d68d6226a6
    AuthorDeepak Cherian <deepak@earthmover.io>
    Date2024-05-02T01:16:04.572000

    I made my first change

Now that the new branch has ben committed, the commit history looks like this:

repo.root_group.attrs.asdict()
Output
{'updated_on': 'stage', 'what': 'more change'}

Switch branches

Switch branches with Repo.checkout

repo.checkout("main")
Output
6697357f3f5775e74408ab8d
repo.root_group.attrs.asdict()
Output
{'what': 'more change'}

Let's check the commit log again

repo.commit_log
  • Commit ID6697357f3f5775e74408ab8d
    AuthorEmma Marshall <emma@earthmover.io>
    Date2024-07-17T03:10:12.295000

    yet another change

  • Commit ID6697355f3f5775e74408ab47
    AuthorEmma Marshall <emma@earthmover.io>
    Date2024-07-17T03:09:40.340000

    I made my second change

  • Commit ID6697355ce00e05ed1cce08d2
    AuthorEmma Marshall <emma@earthmover.io>
    Date2024-07-17T03:09:37.024000

    I made my first change

  • Commit ID6697354dd8e9064ad7e6c6d6
    AuthorEmma Marshall <emma@earthmover.io>
    Date2024-07-17T03:09:21.979000

    I made my first change

  • Commit ID6632e95b26b204d68d6226ab
    AuthorDeepak Cherian <deepak@earthmover.io>
    Date2024-05-02T01:16:10.950000
    Tagsv1, v1_second_time

    update for v1

  • Commit ID6632e95726b204d68d6226a8
    AuthorDeepak Cherian <deepak@earthmover.io>
    Date2024-05-02T01:16:07.379000
    Tagsv0

    yet another change

  • Commit ID6632e95526b204d68d6226a7
    AuthorDeepak Cherian <deepak@earthmover.io>
    Date2024-05-02T01:16:05.419000

    I made my second change

  • Commit ID6632e95426b204d68d6226a6
    AuthorDeepak Cherian <deepak@earthmover.io>
    Date2024-05-02T01:16:04.572000

    I made my first change

Delete a branch

Delete a branch using Repo.delete_branch().

tip

The main branch is special and cannot be deleted. Please let us know if you have a need for this feature.

repo.delete_branch("stage")

Confirm that branch was deleted with Repo.branches

repo.branches
Output
(Branch(id='main', commit_id=6697357f3f5775e74408ab8d),)

Tagging

Concepts

  1. Tags are created by an Author, and associated with a timestamp and a particular commit through a commit_id.
  2. Tags may be associated with an optional message.
  3. Tags are immutable.

Create a tag

Create a new tag with Repo.tag. By default this will tag the latest commit on the current branch

repo.tag("v0")
Output
Tag(id='v0', label='v0', created_at=datetime.datetime(2024, 5, 2, 1, 16, 10, 305000), commit=Commit(id=6632e95726b204d68d6226a8, session_start_time=datetime.datetime(2024, 5, 2, 1, 16, 6, 683000), parent_commit=6632e95526b204d68d6226a7, commit_time=datetime.datetime(2024, 5, 2, 1, 16, 7, 379000), author_name='Deepak Cherian', author_email='deepak@earthmover.io', message='yet another change'), message=None, author_name='Deepak Cherian', author_email='deepak@earthmover.io', commit_id=6632e95726b204d68d6226a8)

View the tag in the commit log

repo.commit_log
  • Commit ID6632e95726b204d68d6226a8
    AuthorDeepak Cherian <deepak@earthmover.io>
    Date2024-05-02T01:16:07.379000
    Tagsv0

    yet another change

  • Commit ID6632e95526b204d68d6226a7
    AuthorDeepak Cherian <deepak@earthmover.io>
    Date2024-05-02T01:16:05.419000

    I made my second change

  • Commit ID6632e95426b204d68d6226a6
    AuthorDeepak Cherian <deepak@earthmover.io>
    Date2024-05-02T01:16:04.572000

    I made my first change

List tags

View a list of tags with Repo.tags

repo.tags
Output
(Tag(id='v0', label='v0', created_at=datetime.datetime(2024, 5, 2, 1, 16, 10, 305000), commit=Commit(id=6632e95726b204d68d6226a8, session_start_time=datetime.datetime(2024, 5, 2, 1, 16, 6, 683000), parent_commit=6632e95526b204d68d6226a7, commit_time=datetime.datetime(2024, 5, 2, 1, 16, 7, 379000), author_name='Deepak Cherian', author_email='deepak@earthmover.io', message='yet another change'), message=None, author_name='Deepak Cherian', author_email='deepak@earthmover.io', commit_id=6632e95726b204d68d6226a8),)

Customizing tags

Tags can be customized with a "message". First lets make a new commit and then tag it

repo.root_group.attrs["what"] = "a change for version 1"
v1_commit = repo.commit("update for v1")
repo.tag("v1", message="I like this version. Tagging it as v1")
Output
Tag(id='v1', label='v1', created_at=datetime.datetime(2024, 5, 2, 1, 16, 11, 424000), commit=Commit(id=6632e95b26b204d68d6226ab, session_start_time=datetime.datetime(2024, 5, 2, 1, 16, 9, 720000), parent_commit=6632e95726b204d68d6226a8, commit_time=datetime.datetime(2024, 5, 2, 1, 16, 10, 950000), author_name='Deepak Cherian', author_email='deepak@earthmover.io', message='update for v1'), message='I like this version. Tagging it as v1', author_name='Deepak Cherian', author_email='deepak@earthmover.io', commit_id=6632e95b26b204d68d6226ab)

View all tags in the commit log

repo.commit_log
  • Commit ID6632e95b26b204d68d6226ab
    AuthorDeepak Cherian <deepak@earthmover.io>
    Date2024-05-02T01:16:10.950000
    Tagsv1

    update for v1

  • Commit ID6632e95726b204d68d6226a8
    AuthorDeepak Cherian <deepak@earthmover.io>
    Date2024-05-02T01:16:07.379000
    Tagsv0

    yet another change

  • Commit ID6632e95526b204d68d6226a7
    AuthorDeepak Cherian <deepak@earthmover.io>
    Date2024-05-02T01:16:05.419000

    I made my second change

  • Commit ID6632e95426b204d68d6226a6
    AuthorDeepak Cherian <deepak@earthmover.io>
    Date2024-05-02T01:16:04.572000

    I made my first change

Tag a specific commit by passing the commit_id

repo.tag("v1_second_time", commit_id=v1_commit, message="Oops, I did it again.")
Output
Tag(id='v1_second_time', label='v1_second_time', created_at=datetime.datetime(2024, 5, 2, 1, 16, 11, 688000), commit=Commit(id=6632e95b26b204d68d6226ab, session_start_time=datetime.datetime(2024, 5, 2, 1, 16, 9, 720000), parent_commit=6632e95726b204d68d6226a8, commit_time=datetime.datetime(2024, 5, 2, 1, 16, 10, 950000), author_name='Deepak Cherian', author_email='deepak@earthmover.io', message='update for v1'), message='Oops, I did it again.', author_name='Deepak Cherian', author_email='deepak@earthmover.io', commit_id=6632e95b26b204d68d6226ab)

The commit_log will list all tags associated with a commit

repo.commit_log
  • Commit ID6632e95b26b204d68d6226ab
    AuthorDeepak Cherian <deepak@earthmover.io>
    Date2024-05-02T01:16:10.950000
    Tagsv1, v1_second_time

    update for v1

  • Commit ID6632e95726b204d68d6226a8
    AuthorDeepak Cherian <deepak@earthmover.io>
    Date2024-05-02T01:16:07.379000
    Tagsv0

    yet another change

  • Commit ID6632e95526b204d68d6226a7
    AuthorDeepak Cherian <deepak@earthmover.io>
    Date2024-05-02T01:16:05.419000

    I made my second change

  • Commit ID6632e95426b204d68d6226a6
    AuthorDeepak Cherian <deepak@earthmover.io>
    Date2024-05-02T01:16:04.572000

    I made my first change

Time-travel to a tag

Pass a tag name to Repo.checkout.

repo.commit_log
  • Commit ID6632ac858bab9b29700d02b0
    AuthorDeepak Cherian <deepak@earthmover.io>
    Date2024-05-01T20:56:37.446000
    Tagsv0

    yet another change

  • Commit ID6632ac838bab9b29700d02af
    AuthorDeepak Cherian <deepak@earthmover.io>
    Date2024-05-01T20:56:35.517000

    I made my second change

  • Commit ID6632ac828bab9b29700d02ae
    AuthorDeepak Cherian <deepak@earthmover.io>
    Date2024-05-01T20:56:34.663000

    I made my first change

Let's get back to the main branch

repo.checkout("main")
Output
6632ac898bab9b29700d02b3

Delete a tag

A single tag can only used once in a repo, but a tag with the same name can exist in multiple repos of the same organization. Tags are immutable, so they cannot be edited to point to a different commit. But they can be deleted with Repo.delete_tag.

repo.delete_tag('v1')

Verify that the v1 tag is deleted by checking Repo.commit_log

repo.commit_log
  • Commit ID6632ac898bab9b29700d02b3
    AuthorDeepak Cherian <deepak@earthmover.io>
    Date2024-05-01T20:56:41.156000
    Tagsv1_second_time

    update for v1

  • Commit ID6632ac858bab9b29700d02b0
    AuthorDeepak Cherian <deepak@earthmover.io>
    Date2024-05-01T20:56:37.446000
    Tagsv0

    yet another change

  • Commit ID6632ac838bab9b29700d02af
    AuthorDeepak Cherian <deepak@earthmover.io>
    Date2024-05-01T20:56:35.517000

    I made my second change

  • Commit ID6632ac828bab9b29700d02ae
    AuthorDeepak Cherian <deepak@earthmover.io>
    Date2024-05-01T20:56:34.663000

    I made my first change

And in Repo.tags

repo.tags
Output
(Tag(id='v0', label='v0', created_at=datetime.datetime(2024, 5, 1, 20, 56, 40, 502000), commit=Commit(id=6632ac858bab9b29700d02b0, session_start_time=datetime.datetime(2024, 5, 1, 20, 56, 36, 721000), parent_commit=6632ac838bab9b29700d02af, commit_time=datetime.datetime(2024, 5, 1, 20, 56, 37, 446000), author_name='Deepak Cherian', author_email='deepak@earthmover.io', message='yet another change'), message=None, author_name='Deepak Cherian', author_email='deepak@earthmover.io', commit_id=6632ac858bab9b29700d02b0),
Tag(id='v1_second_time', label='v1_second_time', created_at=datetime.datetime(2024, 5, 1, 20, 56, 42, 113000), commit=Commit(id=6632ac898bab9b29700d02b3, session_start_time=datetime.datetime(2024, 5, 1, 20, 56, 39, 725000), parent_commit=6632ac858bab9b29700d02b0, commit_time=datetime.datetime(2024, 5, 1, 20, 56, 41, 156000), author_name='Deepak Cherian', author_email='deepak@earthmover.io', message='update for v1'), message='Oops, I did it again.', author_name='Deepak Cherian', author_email='deepak@earthmover.io', commit_id=6632ac898bab9b29700d02b3))