Skip to main content

Version Control with Arraylake

Arraylake carries over concepts from other version control software (e.g. Git) to multidimensional arrays. Doing so helps ease the burden of managing multiple versions of your data, and helps you be precise about which version of your dataset is being used for downstream purposes.

Core concepts of Arraylake's version control system are:

  1. A commit bundles together related data and metadata changes in a single "transaction".
  2. A branch points to the latest commit in a series of commits. Multiple branches can co-exist at a given time, and multiple users can add commits to a single branch. One common pattern is to use dev, stage, and prod branches to separate versions of a dataset.
  3. A tag is an immutable reference to a commit, usually used to represent an "important" version of the dataset such as a release.

Commits, branches, and tags all refer to specific versions of your dataset. You can time-travel/navigate back to any version of your data as referenced by a commit, a branch, or a tag by passing a commit ID, a branch name, or a tag name to Repo.checkout

tip

See here for more on commits and branches.

Setup

First lets create a new repo for demonstration purposes

import arraylake as al

client = al.Client()
repo = client.create_repo("earthmover-demos/vcs")
repo
Output
<arraylake.repo.Repo 'earthmover-demos/vcs'>

Committing

Concepts

  1. A commit is created by an Author, and links a set of data and metadata changes together as a single transaction with the data store.
  2. Each commit is associated with a unique auto-generated commit_id; the exact timestamp for when it was created; parent_commit, a pointer to the commit that came before; and a message.
  3. The commit_id is an immutable identifier for the state of the repo. Multiple clients can check out any commit and are guaranteed to always see the exact same data.
  4. Any changes you make to your dataset are not reflected remotely until a commit has been made.

Create a commit

First make a change.

tip

For simplicity, this tutorial only modifies the attributes for the Zarr group.

repo.root_group.attrs["what"] = "a change"
repo.root_group.attrs.asdict()
Output
{'what': 'a change'}

Inspect any uncommitted changes by calling the Repo.status() method

repo.status()

🧊 Using repo earthmover-demos/vcs
📟 Session f53e861dc3974f548662e575e1cf20d7 started at 2024-05-02T01:16:03.245000
🌿 On branch main

paths modified in session

  • 📝 meta/root.group.json

Now create a new commit using Repo.commit(). This will return the commit ID that we will use later.

first_commit_id = repo.commit(message="I made my first change")
first_commit_id
Output
6632e95426b204d68d6226a6

Make a second change, and a second commit.

repo.root_group.attrs["what"] = "another change"
repo.commit(message="I made my second change")
Output
6632e95526b204d68d6226a7

List commits

View the commit history using Repo.commit_log. Notice the associated author and timestamp.

repo.commit_log
  • Commit ID6632e95526b204d68d6226a7
    AuthorDeepak Cherian <deepak@earthmover.io>
    Date2024-05-02T01:16:05.419000

    I made my second change

  • Commit ID6632e95426b204d68d6226a6
    AuthorDeepak Cherian <deepak@earthmover.io>
    Date2024-05-02T01:16:04.572000

    I made my first change

tip

Arraylake commit IDs are not completely random as in some other version control systems (e.g. Git). Identify different commits using the last few characters, rather than the first few.

Time-travel to a commit

Time travel back to a specific commit with Repo.checkout.

Here's where we are currently

repo.root_group.attrs.asdict()
Output
{'what': 'another change'}

Check out the first commit by passing the commit ID.

tip

You could also copy this ID from the commit log, or the web app.

repo.checkout(first_commit_id)
Output
/Users/deepak/repos/arraylake/client/arraylake/repo.py:426: UserWarning: You are not on a branch tip, so you can't commit changes.
warnings.warn("You are not on a branch tip, so you can't commit changes.")
Output
6632e95426b204d68d6226a6

Voila! we have traveled back in time.

repo.root_group.attrs.asdict()
Output
{'what': 'a change'}
warning

Note the warning about being on a "branch tip". We will describe this next. For now, remember that while you can travel back to any commit, you cannot make any changes.

Branching

Concepts

  1. A branch is a mutable, or modifiable, reference to a commit_id. By default, all repos begin with a main branch.
  2. There is no specific Author associated with a branch.
  3. A "branch tip" refers to the most recent commit on a branch.
  4. You can only make changes to your data while on a branch tip. A new commit will update the branch pointer to point to that new commit.

Checkout the main branch with Repo.checkout

repo.checkout("main")
Output
6632e95526b204d68d6226a7

View the current branch, together with current but uncommitted modifications, in Repo.status.

repo.status()

🧊 Using repo earthmover-demos/vcs
📟 Session dfca330687e54443a13b1949c0a12c97 started at 2024-05-02T01:16:06.683000
🌿 On branch main

No changes in current session

Updating a branch

Making a commit will update the branch pointer too

repo.root_group.attrs["what"] = "more change"
repo.commit("yet another change")
Output
6632e95726b204d68d6226a8
repo.status()

🧊 Using repo earthmover-demos/vcs
📟 Session 67f4d61863dc4d6dbb2259233ad9ed0c started at 2024-05-02T01:16:07.677000
🌿 On branch main

No changes in current session

Create a branch

Create a new branch with Repo.new_branch

repo.new_branch("stage")

List branches

View a list of branches with Repo.branches

repo.branches
Output
(Branch(id='main', commit_id=6632e95726b204d68d6226a8),)
caution

Note that the new branch "stage" is not present in the list of branches!

Currently, new branches are created remotely only when a new commit is created.

Make a new commit on the new branch

repo.root_group.attrs["updated_on"] = "stage"
repo.commit("Committed to 'stage'")
Output
6632e95826b204d68d6226a9

Now check the list of branches again.

repo.branches
Output
(Branch(id='main', commit_id=6632e95726b204d68d6226a8),
Branch(id='stage', commit_id=6632e95826b204d68d6226a9))

View commit history for a branch with commit_log.

repo.commit_log
  • Commit ID6632e95826b204d68d6226a9
    AuthorDeepak Cherian <deepak@earthmover.io>
    Date2024-05-02T01:16:08.607000

    Committed to 'stage'

  • Commit ID6632e95726b204d68d6226a8
    AuthorDeepak Cherian <deepak@earthmover.io>
    Date2024-05-02T01:16:07.379000

    yet another change

  • Commit ID6632e95526b204d68d6226a7
    AuthorDeepak Cherian <deepak@earthmover.io>
    Date2024-05-02T01:16:05.419000

    I made my second change

  • Commit ID6632e95426b204d68d6226a6
    AuthorDeepak Cherian <deepak@earthmover.io>
    Date2024-05-02T01:16:04.572000

    I made my first change

repo.root_group.attrs.asdict()
Output
{'updated_on': 'stage', 'what': 'more change'}

Switch branches

Switch branches with Repo.checkout

repo.checkout("main")
Output
6632e95726b204d68d6226a8
repo.root_group.attrs.asdict()
Output
{'what': 'more change'}

Let's check the commit log again

repo.commit_log
  • Commit ID6632e95726b204d68d6226a8
    AuthorDeepak Cherian <deepak@earthmover.io>
    Date2024-05-02T01:16:07.379000

    yet another change

  • Commit ID6632e95526b204d68d6226a7
    AuthorDeepak Cherian <deepak@earthmover.io>
    Date2024-05-02T01:16:05.419000

    I made my second change

  • Commit ID6632e95426b204d68d6226a6
    AuthorDeepak Cherian <deepak@earthmover.io>
    Date2024-05-02T01:16:04.572000

    I made my first change

Delete a branch

Delete a branch usingRepo.delete_branch().

tip

The main branch is special and cannot be deleted. Please let us know if you have a need for this feature.

repo.delete_branch("stage")

Confirm that branch was deleted with Repo.branches

repo.branches
Output
(Branch(id='main', commit_id=6632e95726b204d68d6226a8),)

Tagging

Concepts

  1. Tags are created by an Author, and associated with a timestamp and a particular commit through a commit_id.
  2. Tags may be associated with an optional message.
  3. Tags are immutable.

Create a tag

Create a new tag with Repo.tag. By default this will tag the latest commit on the current branch

repo.tag("v0")
Output
Tag(id='v0', label='v0', created_at=datetime.datetime(2024, 5, 2, 1, 16, 10, 305000), commit=Commit(id=6632e95726b204d68d6226a8, session_start_time=datetime.datetime(2024, 5, 2, 1, 16, 6, 683000), parent_commit=6632e95526b204d68d6226a7, commit_time=datetime.datetime(2024, 5, 2, 1, 16, 7, 379000), author_name='Deepak Cherian', author_email='deepak@earthmover.io', message='yet another change'), message=None, author_name='Deepak Cherian', author_email='deepak@earthmover.io', commit_id=6632e95726b204d68d6226a8)

View the tag in the commit log

repo.commit_log
  • Commit ID6632e95726b204d68d6226a8
    AuthorDeepak Cherian <deepak@earthmover.io>
    Date2024-05-02T01:16:07.379000
    Tagsv0

    yet another change

  • Commit ID6632e95526b204d68d6226a7
    AuthorDeepak Cherian <deepak@earthmover.io>
    Date2024-05-02T01:16:05.419000

    I made my second change

  • Commit ID6632e95426b204d68d6226a6
    AuthorDeepak Cherian <deepak@earthmover.io>
    Date2024-05-02T01:16:04.572000

    I made my first change

List tags

View a list of tags with Repo.tags

repo.tags
Output
(Tag(id='v0', label='v0', created_at=datetime.datetime(2024, 5, 2, 1, 16, 10, 305000), commit=Commit(id=6632e95726b204d68d6226a8, session_start_time=datetime.datetime(2024, 5, 2, 1, 16, 6, 683000), parent_commit=6632e95526b204d68d6226a7, commit_time=datetime.datetime(2024, 5, 2, 1, 16, 7, 379000), author_name='Deepak Cherian', author_email='deepak@earthmover.io', message='yet another change'), message=None, author_name='Deepak Cherian', author_email='deepak@earthmover.io', commit_id=6632e95726b204d68d6226a8),)

Customizing tags

Tags can be customized with a "message". First lets make a new commit and then tag it

repo.root_group.attrs["what"] = "a change for version 1"
v1_commit = repo.commit("update for v1")
repo.tag("v1", message="I like this version. Tagging it as v1")
Output
Tag(id='v1', label='v1', created_at=datetime.datetime(2024, 5, 2, 1, 16, 11, 424000), commit=Commit(id=6632e95b26b204d68d6226ab, session_start_time=datetime.datetime(2024, 5, 2, 1, 16, 9, 720000), parent_commit=6632e95726b204d68d6226a8, commit_time=datetime.datetime(2024, 5, 2, 1, 16, 10, 950000), author_name='Deepak Cherian', author_email='deepak@earthmover.io', message='update for v1'), message='I like this version. Tagging it as v1', author_name='Deepak Cherian', author_email='deepak@earthmover.io', commit_id=6632e95b26b204d68d6226ab)

View all tags in the commit log

repo.commit_log
  • Commit ID6632e95b26b204d68d6226ab
    AuthorDeepak Cherian <deepak@earthmover.io>
    Date2024-05-02T01:16:10.950000
    Tagsv1

    update for v1

  • Commit ID6632e95726b204d68d6226a8
    AuthorDeepak Cherian <deepak@earthmover.io>
    Date2024-05-02T01:16:07.379000
    Tagsv0

    yet another change

  • Commit ID6632e95526b204d68d6226a7
    AuthorDeepak Cherian <deepak@earthmover.io>
    Date2024-05-02T01:16:05.419000

    I made my second change

  • Commit ID6632e95426b204d68d6226a6
    AuthorDeepak Cherian <deepak@earthmover.io>
    Date2024-05-02T01:16:04.572000

    I made my first change

Tag a specific commit by passing the commit_id

repo.tag("v1_second_time", commit_id=v1_commit, message="Oops, I did it again.")
Output
Tag(id='v1_second_time', label='v1_second_time', created_at=datetime.datetime(2024, 5, 2, 1, 16, 11, 688000), commit=Commit(id=6632e95b26b204d68d6226ab, session_start_time=datetime.datetime(2024, 5, 2, 1, 16, 9, 720000), parent_commit=6632e95726b204d68d6226a8, commit_time=datetime.datetime(2024, 5, 2, 1, 16, 10, 950000), author_name='Deepak Cherian', author_email='deepak@earthmover.io', message='update for v1'), message='Oops, I did it again.', author_name='Deepak Cherian', author_email='deepak@earthmover.io', commit_id=6632e95b26b204d68d6226ab)

The commit_log will list all tags associated with a commit

repo.commit_log
  • Commit ID6632e95b26b204d68d6226ab
    AuthorDeepak Cherian <deepak@earthmover.io>
    Date2024-05-02T01:16:10.950000
    Tagsv1, v1_second_time

    update for v1

  • Commit ID6632e95726b204d68d6226a8
    AuthorDeepak Cherian <deepak@earthmover.io>
    Date2024-05-02T01:16:07.379000
    Tagsv0

    yet another change

  • Commit ID6632e95526b204d68d6226a7
    AuthorDeepak Cherian <deepak@earthmover.io>
    Date2024-05-02T01:16:05.419000

    I made my second change

  • Commit ID6632e95426b204d68d6226a6
    AuthorDeepak Cherian <deepak@earthmover.io>
    Date2024-05-02T01:16:04.572000

    I made my first change

Time-travel to a tag

Pass a tag name to Repo.checkout.

repo.commit_log
  • Commit ID6632ac858bab9b29700d02b0
    AuthorDeepak Cherian <deepak@earthmover.io>
    Date2024-05-01T20:56:37.446000
    Tagsv0

    yet another change

  • Commit ID6632ac838bab9b29700d02af
    AuthorDeepak Cherian <deepak@earthmover.io>
    Date2024-05-01T20:56:35.517000

    I made my second change

  • Commit ID6632ac828bab9b29700d02ae
    AuthorDeepak Cherian <deepak@earthmover.io>
    Date2024-05-01T20:56:34.663000

    I made my first change

Let's get back to the main branch

repo.checkout("main")
Output
6632ac898bab9b29700d02b3

Delete a tag

A single tag can only used once in a repo, but a tag with the same name can exist in multiple repos of the same organization. Tags are immutable, so they cannot be edited to point to a different commit. But they can be deleted with Repo.delete_tag.

repo.delete_tag('v1')

Verify that the v1 tag is deleted by checking Repo.commit_log

repo.commit_log
  • Commit ID6632ac898bab9b29700d02b3
    AuthorDeepak Cherian <deepak@earthmover.io>
    Date2024-05-01T20:56:41.156000
    Tagsv1_second_time

    update for v1

  • Commit ID6632ac858bab9b29700d02b0
    AuthorDeepak Cherian <deepak@earthmover.io>
    Date2024-05-01T20:56:37.446000
    Tagsv0

    yet another change

  • Commit ID6632ac838bab9b29700d02af
    AuthorDeepak Cherian <deepak@earthmover.io>
    Date2024-05-01T20:56:35.517000

    I made my second change

  • Commit ID6632ac828bab9b29700d02ae
    AuthorDeepak Cherian <deepak@earthmover.io>
    Date2024-05-01T20:56:34.663000

    I made my first change

And in Repo.tags

repo.tags
Output
(Tag(id='v0', label='v0', created_at=datetime.datetime(2024, 5, 1, 20, 56, 40, 502000), commit=Commit(id=6632ac858bab9b29700d02b0, session_start_time=datetime.datetime(2024, 5, 1, 20, 56, 36, 721000), parent_commit=6632ac838bab9b29700d02af, commit_time=datetime.datetime(2024, 5, 1, 20, 56, 37, 446000), author_name='Deepak Cherian', author_email='deepak@earthmover.io', message='yet another change'), message=None, author_name='Deepak Cherian', author_email='deepak@earthmover.io', commit_id=6632ac858bab9b29700d02b0),
Tag(id='v1_second_time', label='v1_second_time', created_at=datetime.datetime(2024, 5, 1, 20, 56, 42, 113000), commit=Commit(id=6632ac898bab9b29700d02b3, session_start_time=datetime.datetime(2024, 5, 1, 20, 56, 39, 725000), parent_commit=6632ac858bab9b29700d02b0, commit_time=datetime.datetime(2024, 5, 1, 20, 56, 41, 156000), author_name='Deepak Cherian', author_email='deepak@earthmover.io', message='update for v1'), message='Oops, I did it again.', author_name='Deepak Cherian', author_email='deepak@earthmover.io', commit_id=6632ac898bab9b29700d02b3))