Version Control with Arraylake
Arraylake carries over concepts from other version control software (e.g. Git) to multidimensional arrays. Doing so helps ease the burden of managing multiple versions of your data, and helps you be precise about which version of your dataset is being used for downstream purposes.
Core concepts of Arraylake's version control system are:
- A commit bundles together related data and metadata changes in a single "transaction".
- A branch points to the latest commit in a series of commits. Multiple branches can co-exist at a given time, and multiple users can add commits to a single branch. One common pattern is to use
dev
,stage
, andprod
branches to separate versions of a dataset. - A tag is an immutable reference to a commit, usually used to represent an "important" version of the dataset such as a release.
Commits, branches, and tags all refer to specific versions of your dataset. You can time-travel/navigate back to any version of your data as referenced by a commit, a branch, or a tag by passing a commit ID, a branch name, or a tag name to Repo.checkout
See here for more on commits and branches.
Setup
First lets create a new repo for demonstration purposes
import arraylake as al
client = al.Client()
repo = client.create_repo("earthmover-demos/vcs")
repo
<arraylake.repo.Repo 'earthmover-demos/vcs'>
Committing
Concepts
- A commit is created by an
Author
, and links a set of data and metadata changes together as a single transaction with the data store. - Each commit is associated with a unique auto-generated
commit_id
; the exact timestamp for when it was created;parent_commit
, a pointer to the commit that came before; and amessage
. - The
commit_id
is an immutable identifier for the state of the repo. Multiple clients can check out any commit and are guaranteed to always see the exact same data. - Any changes you make to your dataset are not reflected remotely until a commit has been made.
Create a commit
First make a change.
For simplicity, this tutorial only modifies the attributes for the Zarr group.
repo.root_group.attrs["what"] = "a change"
repo.root_group.attrs.asdict()
{'what': 'a change'}
Inspect any uncommitted changes by calling the Repo.status()
method
repo.status()
🧊 Using repo earthmover-demos/vcs
📟 Session f53e861dc3974f548662e575e1cf20d7 started at 2024-05-02T01:16:03.245000
🌿 On branch main
paths modified in session
- 📝 meta/root.group.json
Now create a new commit using Repo.commit()
. This will return the commit ID that we will use later.
first_commit_id = repo.commit(message="I made my first change")
first_commit_id
6632e95426b204d68d6226a6
Make a second change, and a second commit.
repo.root_group.attrs["what"] = "another change"
repo.commit(message="I made my second change")
6632e95526b204d68d6226a7
List commits
View the commit history using Repo.commit_log
. Notice the associated author and timestamp.
repo.commit_log
Commit ID 6632e95526b204d68d6226a7 Author Deepak Cherian <deepak@earthmover.io> Date 2024-05-02T01:16:05.419000 I made my second change
Commit ID 6632e95426b204d68d6226a6 Author Deepak Cherian <deepak@earthmover.io> Date 2024-05-02T01:16:04.572000 I made my first change
Arraylake commit IDs are not completely random as in some other version control systems (e.g. Git). Identify different commits using the last few characters, rather than the first few.
Time-travel to a commit
Time travel back to a specific commit with Repo.checkout
.
Here's where we are currently
repo.root_group.attrs.asdict()
{'what': 'another change'}
Check out the first commit by passing the commit ID.
You could also copy this ID from the commit log, or the web app.
repo.checkout(first_commit_id)
/Users/deepak/repos/arraylake/client/arraylake/repo.py:426: UserWarning: You are not on a branch tip, so you can't commit changes.
warnings.warn("You are not on a branch tip, so you can't commit changes.")
6632e95426b204d68d6226a6
Voila! we have traveled back in time.
repo.root_group.attrs.asdict()
{'what': 'a change'}
Note the warning about being on a "branch tip". We will describe this next. For now, remember that while you can travel back to any commit, you cannot make any changes.
Branching
Concepts
- A branch is a mutable, or modifiable, reference to a
commit_id
. By default, all repos begin with amain
branch. - There is no specific Author associated with a branch.
- A "branch tip" refers to the most recent commit on a branch.
- You can only make changes to your data while on a branch tip. A new commit will update the branch pointer to point to that new commit.
Checkout the main branch with Repo.checkout
repo.checkout("main")
6632e95526b204d68d6226a7
View the current branch, together with current but uncommitted modifications, in Repo.status
.
repo.status()
🧊 Using repo earthmover-demos/vcs
📟 Session dfca330687e54443a13b1949c0a12c97 started at 2024-05-02T01:16:06.683000
🌿 On branch main
No changes in current session
Updating a branch
Making a commit will update the branch pointer too
repo.root_group.attrs["what"] = "more change"
repo.commit("yet another change")
6632e95726b204d68d6226a8
repo.status()
🧊 Using repo earthmover-demos/vcs
📟 Session 67f4d61863dc4d6dbb2259233ad9ed0c started at 2024-05-02T01:16:07.677000
🌿 On branch main
No changes in current session
Create a branch
Create a new branch with Repo.new_branch
repo.new_branch("stage")
List branches
View a list of branches with Repo.branches
repo.branches
(Branch(id='main', commit_id=6632e95726b204d68d6226a8),)
Note that the new branch "stage"
is not present in the list of branches!
Currently, new branches are created remotely only when a new commit is created.
Make a new commit on the new branch
repo.root_group.attrs["updated_on"] = "stage"
repo.commit("Committed to 'stage'")
6632e95826b204d68d6226a9
Now check the list of branches again.
repo.branches
(Branch(id='main', commit_id=6632e95726b204d68d6226a8),
Branch(id='stage', commit_id=6632e95826b204d68d6226a9))
View commit history for a branch with commit_log
.
repo.commit_log
Commit ID 6632e95826b204d68d6226a9 Author Deepak Cherian <deepak@earthmover.io> Date 2024-05-02T01:16:08.607000 Committed to 'stage'
Commit ID 6632e95726b204d68d6226a8 Author Deepak Cherian <deepak@earthmover.io> Date 2024-05-02T01:16:07.379000 yet another change
Commit ID 6632e95526b204d68d6226a7 Author Deepak Cherian <deepak@earthmover.io> Date 2024-05-02T01:16:05.419000 I made my second change
Commit ID 6632e95426b204d68d6226a6 Author Deepak Cherian <deepak@earthmover.io> Date 2024-05-02T01:16:04.572000 I made my first change
repo.root_group.attrs.asdict()
{'updated_on': 'stage', 'what': 'more change'}
Switch branches
Switch branches with Repo.checkout
repo.checkout("main")
6632e95726b204d68d6226a8
repo.root_group.attrs.asdict()
{'what': 'more change'}
Let's check the commit log again
repo.commit_log
Commit ID 6632e95726b204d68d6226a8 Author Deepak Cherian <deepak@earthmover.io> Date 2024-05-02T01:16:07.379000 yet another change
Commit ID 6632e95526b204d68d6226a7 Author Deepak Cherian <deepak@earthmover.io> Date 2024-05-02T01:16:05.419000 I made my second change
Commit ID 6632e95426b204d68d6226a6 Author Deepak Cherian <deepak@earthmover.io> Date 2024-05-02T01:16:04.572000 I made my first change
Delete a branch
Delete a branch usingRepo.delete_branch()
.
The main
branch is special and cannot be deleted. Please let us know if you have a need for this feature.
repo.delete_branch("stage")
Confirm that branch was deleted with Repo.branches
repo.branches
(Branch(id='main', commit_id=6632e95726b204d68d6226a8),)
Tagging
Concepts
- Tags are created by an
Author
, and associated with a timestamp and a particular commit through acommit_id
. - Tags may be associated with an optional
message
. - Tags are immutable.
Create a tag
Create a new tag with Repo.tag
. By default this will tag the latest commit on the current branch
repo.tag("v0")
Tag(id='v0', label='v0', created_at=datetime.datetime(2024, 5, 2, 1, 16, 10, 305000), commit=Commit(id=6632e95726b204d68d6226a8, session_start_time=datetime.datetime(2024, 5, 2, 1, 16, 6, 683000), parent_commit=6632e95526b204d68d6226a7, commit_time=datetime.datetime(2024, 5, 2, 1, 16, 7, 379000), author_name='Deepak Cherian', author_email='deepak@earthmover.io', message='yet another change'), message=None, author_name='Deepak Cherian', author_email='deepak@earthmover.io', commit_id=6632e95726b204d68d6226a8)
View the tag in the commit log
repo.commit_log
Commit ID 6632e95726b204d68d6226a8 Author Deepak Cherian <deepak@earthmover.io> Date 2024-05-02T01:16:07.379000 Tags v0 yet another change
Commit ID 6632e95526b204d68d6226a7 Author Deepak Cherian <deepak@earthmover.io> Date 2024-05-02T01:16:05.419000 I made my second change
Commit ID 6632e95426b204d68d6226a6 Author Deepak Cherian <deepak@earthmover.io> Date 2024-05-02T01:16:04.572000 I made my first change
List tags
View a list of tags with Repo.tags
repo.tags
(Tag(id='v0', label='v0', created_at=datetime.datetime(2024, 5, 2, 1, 16, 10, 305000), commit=Commit(id=6632e95726b204d68d6226a8, session_start_time=datetime.datetime(2024, 5, 2, 1, 16, 6, 683000), parent_commit=6632e95526b204d68d6226a7, commit_time=datetime.datetime(2024, 5, 2, 1, 16, 7, 379000), author_name='Deepak Cherian', author_email='deepak@earthmover.io', message='yet another change'), message=None, author_name='Deepak Cherian', author_email='deepak@earthmover.io', commit_id=6632e95726b204d68d6226a8),)
Customizing tags
Tags can be customized with a "message"
. First lets make a new commit and then tag it
repo.root_group.attrs["what"] = "a change for version 1"
v1_commit = repo.commit("update for v1")
repo.tag("v1", message="I like this version. Tagging it as v1")
Tag(id='v1', label='v1', created_at=datetime.datetime(2024, 5, 2, 1, 16, 11, 424000), commit=Commit(id=6632e95b26b204d68d6226ab, session_start_time=datetime.datetime(2024, 5, 2, 1, 16, 9, 720000), parent_commit=6632e95726b204d68d6226a8, commit_time=datetime.datetime(2024, 5, 2, 1, 16, 10, 950000), author_name='Deepak Cherian', author_email='deepak@earthmover.io', message='update for v1'), message='I like this version. Tagging it as v1', author_name='Deepak Cherian', author_email='deepak@earthmover.io', commit_id=6632e95b26b204d68d6226ab)
View all tags in the commit log
repo.commit_log
Commit ID 6632e95b26b204d68d6226ab Author Deepak Cherian <deepak@earthmover.io> Date 2024-05-02T01:16:10.950000 Tags v1 update for v1
Commit ID 6632e95726b204d68d6226a8 Author Deepak Cherian <deepak@earthmover.io> Date 2024-05-02T01:16:07.379000 Tags v0 yet another change
Commit ID 6632e95526b204d68d6226a7 Author Deepak Cherian <deepak@earthmover.io> Date 2024-05-02T01:16:05.419000 I made my second change
Commit ID 6632e95426b204d68d6226a6 Author Deepak Cherian <deepak@earthmover.io> Date 2024-05-02T01:16:04.572000 I made my first change
Tag a specific commit by passing the commit_id
repo.tag("v1_second_time", commit_id=v1_commit, message="Oops, I did it again.")
Tag(id='v1_second_time', label='v1_second_time', created_at=datetime.datetime(2024, 5, 2, 1, 16, 11, 688000), commit=Commit(id=6632e95b26b204d68d6226ab, session_start_time=datetime.datetime(2024, 5, 2, 1, 16, 9, 720000), parent_commit=6632e95726b204d68d6226a8, commit_time=datetime.datetime(2024, 5, 2, 1, 16, 10, 950000), author_name='Deepak Cherian', author_email='deepak@earthmover.io', message='update for v1'), message='Oops, I did it again.', author_name='Deepak Cherian', author_email='deepak@earthmover.io', commit_id=6632e95b26b204d68d6226ab)
The commit_log
will list all tags associated with a commit
repo.commit_log
Commit ID 6632e95b26b204d68d6226ab Author Deepak Cherian <deepak@earthmover.io> Date 2024-05-02T01:16:10.950000 Tags v1, v1_second_time update for v1
Commit ID 6632e95726b204d68d6226a8 Author Deepak Cherian <deepak@earthmover.io> Date 2024-05-02T01:16:07.379000 Tags v0 yet another change
Commit ID 6632e95526b204d68d6226a7 Author Deepak Cherian <deepak@earthmover.io> Date 2024-05-02T01:16:05.419000 I made my second change
Commit ID 6632e95426b204d68d6226a6 Author Deepak Cherian <deepak@earthmover.io> Date 2024-05-02T01:16:04.572000 I made my first change
Time-travel to a tag
Pass a tag name to Repo.checkout
.
repo.commit_log
Commit ID 6632ac858bab9b29700d02b0 Author Deepak Cherian <deepak@earthmover.io> Date 2024-05-01T20:56:37.446000 Tags v0 yet another change
Commit ID 6632ac838bab9b29700d02af Author Deepak Cherian <deepak@earthmover.io> Date 2024-05-01T20:56:35.517000 I made my second change
Commit ID 6632ac828bab9b29700d02ae Author Deepak Cherian <deepak@earthmover.io> Date 2024-05-01T20:56:34.663000 I made my first change
Let's get back to the main branch
repo.checkout("main")
6632ac898bab9b29700d02b3
Delete a tag
A single tag can only used once in a repo, but a tag with the same name can exist in multiple repos of the same organization.
Tags are immutable, so they cannot be edited to point to a different commit.
But they can be deleted with Repo.delete_tag
.
repo.delete_tag('v1')
Verify that the v1
tag is deleted by checking Repo.commit_log
repo.commit_log
Commit ID 6632ac898bab9b29700d02b3 Author Deepak Cherian <deepak@earthmover.io> Date 2024-05-01T20:56:41.156000 Tags v1_second_time update for v1
Commit ID 6632ac858bab9b29700d02b0 Author Deepak Cherian <deepak@earthmover.io> Date 2024-05-01T20:56:37.446000 Tags v0 yet another change
Commit ID 6632ac838bab9b29700d02af Author Deepak Cherian <deepak@earthmover.io> Date 2024-05-01T20:56:35.517000 I made my second change
Commit ID 6632ac828bab9b29700d02ae Author Deepak Cherian <deepak@earthmover.io> Date 2024-05-01T20:56:34.663000 I made my first change
And in Repo.tags
repo.tags
(Tag(id='v0', label='v0', created_at=datetime.datetime(2024, 5, 1, 20, 56, 40, 502000), commit=Commit(id=6632ac858bab9b29700d02b0, session_start_time=datetime.datetime(2024, 5, 1, 20, 56, 36, 721000), parent_commit=6632ac838bab9b29700d02af, commit_time=datetime.datetime(2024, 5, 1, 20, 56, 37, 446000), author_name='Deepak Cherian', author_email='deepak@earthmover.io', message='yet another change'), message=None, author_name='Deepak Cherian', author_email='deepak@earthmover.io', commit_id=6632ac858bab9b29700d02b0),
Tag(id='v1_second_time', label='v1_second_time', created_at=datetime.datetime(2024, 5, 1, 20, 56, 42, 113000), commit=Commit(id=6632ac898bab9b29700d02b3, session_start_time=datetime.datetime(2024, 5, 1, 20, 56, 39, 725000), parent_commit=6632ac858bab9b29700d02b0, commit_time=datetime.datetime(2024, 5, 1, 20, 56, 41, 156000), author_name='Deepak Cherian', author_email='deepak@earthmover.io', message='update for v1'), message='Oops, I did it again.', author_name='Deepak Cherian', author_email='deepak@earthmover.io', commit_id=6632ac898bab9b29700d02b3))