Version Control with Arraylake
Arraylake carries over concepts from other version control software (e.g. Git) to multidimensional arrays. Doing so helps ease the burden of managing multiple versions of your data, and helps you be precise about which version of your dataset is being used for downstream purposes.
Core concepts of Arraylake's version control system are:
- A commit bundles together related data and metadata changes in a single "transaction".
- A branch points to the latest commit in a series of commits. Multiple branches can co-exist at a given time, and multiple users can add commits to a single branch. One common pattern is to use
dev
,stage
, andprod
branches to separate versions of a dataset. - A tag is an immutable reference to a commit, usually used to represent an "important" version of the dataset such as a release.
Commits, branches, and tags all refer to specific versions of your dataset. You can time-travel/navigate back to any version of your data as referenced by a commit, a branch, or a tag by passing a commit ID, a branch name, or a tag name to Repo.checkout
See here for more on commits and branches.
Setup
First lets create a new repo for demonstration purposes
import arraylake as al
client = al.Client()
repo = client.get_or_create_repo("earthmover-demos/vcs")
repo
<arraylake.repo.Repo 'earthmover-demos/vcs'>
Committing
Concepts
- A commit is created by an
Author
, and links a set of data and metadata changes together as a single transaction with the data store. - Each commit is associated with a unique auto-generated
commit_id
; the exact timestamp for when it was created;parent_commit
, a pointer to the commit that came before; and amessage
. - The
commit_id
is an immutable identifier for the state of the repo. Multiple clients can check out any commit and are guaranteed to always see the exact same data. - Any changes you make to your dataset are not reflected remotely until a commit has been made.
Create a commit
First make a change.
For simplicity, this tutorial only modifies the attributes for the Zarr group.
repo.root_group.attrs["what"] = "a change"
repo.root_group.attrs.asdict()
{'what': 'a change'}
Inspect any uncommitted changes by calling the Repo.status()
method
repo.status()
🧊 Using repo earthmover-demos/vcs
📌 Base commit is 6697354dd8e9064ad7e6c6d6
📟 Session df68a5f10793450281893b364b421860 started at 2024-07-17T03:07:04.293000
🌿 On branch main
paths modified in session
- 📝 meta/root.group.json
Now create a new commit using Repo.commit()
. This will return the commit ID that we will use later.
first_commit_id = repo.commit(message="I made my first change")
first_commit_id
6697355ce00e05ed1cce08d2
Make a second change, and a second commit.
repo.root_group.attrs["what"] = "another change"
repo.commit(message="I made my second change")
6697355f3f5775e74408ab47
Call repo.status()
to see the updated base commit.
repo.status()
🧊 Using repo earthmover-demos/vcs
📌 Base commit is 6697355f3f5775e74408ab47
📟 Session 7d1b9f87a1404f0dbce9623c055901bc started at 2024-07-17T03:07:11.985000
🌿 On branch main
No changes in current session
List commits
View the commit history using Repo.commit_log
. Notice the associated author and timestamp.
repo.commit_log
Commit ID 6697355f3f5775e74408ab47 Author Emma Marshall <emma@earthmover.io> Date 2024-07-17T03:09:40.340000 I made my second change
Commit ID 6697355ce00e05ed1cce08d2 Author Emma Marshall <emma@earthmover.io> Date 2024-07-17T03:09:37.024000 I made my first change
Commit ID 6697354dd8e9064ad7e6c6d6 Author Emma Marshall <emma@earthmover.io> Date 2024-07-17T03:09:21.979000 I made my first change
Commit ID 6632e95b26b204d68d6226ab Author Deepak Cherian <deepak@earthmover.io> Date 2024-05-02T01:16:10.950000 Tags v1, v1_second_time update for v1
Commit ID 6632e95726b204d68d6226a8 Author Deepak Cherian <deepak@earthmover.io> Date 2024-05-02T01:16:07.379000 Tags v0 yet another change
Commit ID 6632e95526b204d68d6226a7 Author Deepak Cherian <deepak@earthmover.io> Date 2024-05-02T01:16:05.419000 I made my second change
Commit ID 6632e95426b204d68d6226a6 Author Deepak Cherian <deepak@earthmover.io> Date 2024-05-02T01:16:04.572000 I made my first change
Arraylake commit IDs are not completely random as in some other version control systems (e.g. Git). Identify different commits using the last few characters, rather than the first few.
Time-travel to a commit
Time travel back to a specific commit with Repo.checkout
.
Here's where we are currently
repo.root_group.attrs.asdict()
{'what': 'another change'}
Check out the first commit by passing the commit ID.
You could also copy this ID from the commit log, or the web app.
repo.checkout(first_commit_id)
/home/emmamarshall/Desktop/earthmover/arraylake/client/arraylake/asyn.py:135: UserWarning: You are not on a branch tip, so you can't commit changes.
result[0] = await coro
6697355ce00e05ed1cce08d2
Voila! we have traveled back in time.
repo.root_group.attrs.asdict()
{'what': 'a change'}
Note the warning about being on a "branch tip". We will describe this next. For now, remember that while you can travel back to any commit, you cannot make any changes.
Branching
Branches are great for testing out updates for your data before you commit them to your main branch.
Concepts
- A branch is a mutable, or modifiable, reference to a
commit_id
. By default, all repos begin with amain
branch. - There is no specific Author associated with a branch.
- A "branch tip" refers to the most recent commit on a branch.
- You can only make changes to your data while on a branch tip. A new commit will update the branch pointer to point to that new commit.
Arraylake branches differ from Github branches in one key way: Github branches are designed to be merged with main
while Arraylake branches are not. In Arraylake, creating a branch from a reference point initializes an independent development path with a copy of the data using the reference state. On the new branch, you can safely prototype changes without affecting the root branch. New writes to a branch are saved separately from the state when the branch was created.
Below is a step-by-step walk through of creating branches, committing changes to different branches, switching between branches, and deleting branches.
Checkout the main branch with Repo.checkout
repo.checkout("main")
6697355f3f5775e74408ab47
View the current branch, together with current but uncommitted modifications, in Repo.status
.
repo.status()
🧊 Using repo earthmover-demos/vcs
📌 Base commit is 6697355f3f5775e74408ab47
📟 Session b788b0c1619c4cad9129f81a277546c6 started at 2024-07-17T03:07:37.195000
🌿 On branch main
No changes in current session
Updating a branch
Making a commit will update the branch pointer too
repo.root_group.attrs["what"] = "more change"
repo.commit("yet another change")
6697357f3f5775e74408ab8d
repo.status()
🧊 Using repo earthmover-demos/vcs
📌 Base commit is 6697357f3f5775e74408ab8d
📟 Session ee3f4f96594f41a39fbe8f22eb72d60e started at 2024-07-17T03:07:44.233000
🌿 On branch main
No changes in current session
At this point, our working tree looks like this:
Create a branch
Create a new branch with Repo.new_branch
repo.new_branch("stage")
List branches
View a list of branches with Repo.branches
repo.branches
(Branch(id='main', commit_id=6697357f3f5775e74408ab8d),)
Note that the new branch "stage"
is not present in the list of branches!
Currently, new branches are created remotely only when a new commit is created.
Make a new commit on the new branch
repo.root_group.attrs["updated_on"] = "stage"
repo.commit("Committed to 'stage'")
66973586d8e9064ad7e6c759
Now check the list of branches again.
repo.branches
(Branch(id='main', commit_id=6697357f3f5775e74408ab8d),
Branch(id='stage', commit_id=66973586d8e9064ad7e6c759))
View commit history for a branch with commit_log
.
repo.commit_log
Commit ID 66973586d8e9064ad7e6c759 Author Emma Marshall <emma@earthmover.io> Date 2024-07-17T03:10:19.456000 Committed to 'stage'
Commit ID 6697357f3f5775e74408ab8d Author Emma Marshall <emma@earthmover.io> Date 2024-07-17T03:10:12.295000 yet another change
Commit ID 6697355f3f5775e74408ab47 Author Emma Marshall <emma@earthmover.io> Date 2024-07-17T03:09:40.340000 I made my second change
Commit ID 6697355ce00e05ed1cce08d2 Author Emma Marshall <emma@earthmover.io> Date 2024-07-17T03:09:37.024000 I made my first change
Commit ID 6697354dd8e9064ad7e6c6d6 Author Emma Marshall <emma@earthmover.io> Date 2024-07-17T03:09:21.979000 I made my first change
Commit ID 6632e95b26b204d68d6226ab Author Deepak Cherian <deepak@earthmover.io> Date 2024-05-02T01:16:10.950000 Tags v1, v1_second_time update for v1
Commit ID 6632e95726b204d68d6226a8 Author Deepak Cherian <deepak@earthmover.io> Date 2024-05-02T01:16:07.379000 Tags v0 yet another change
Commit ID 6632e95526b204d68d6226a7 Author Deepak Cherian <deepak@earthmover.io> Date 2024-05-02T01:16:05.419000 I made my second change
Commit ID 6632e95426b204d68d6226a6 Author Deepak Cherian <deepak@earthmover.io> Date 2024-05-02T01:16:04.572000 I made my first change
Now that the new branch has ben committed, the commit history looks like this:
repo.root_group.attrs.asdict()
{'updated_on': 'stage', 'what': 'more change'}
Switch branches
Switch branches with Repo.checkout
repo.checkout("main")
6697357f3f5775e74408ab8d
repo.root_group.attrs.asdict()
{'what': 'more change'}
Let's check the commit log again
repo.commit_log
Commit ID 6697357f3f5775e74408ab8d Author Emma Marshall <emma@earthmover.io> Date 2024-07-17T03:10:12.295000 yet another change
Commit ID 6697355f3f5775e74408ab47 Author Emma Marshall <emma@earthmover.io> Date 2024-07-17T03:09:40.340000 I made my second change
Commit ID 6697355ce00e05ed1cce08d2 Author Emma Marshall <emma@earthmover.io> Date 2024-07-17T03:09:37.024000 I made my first change
Commit ID 6697354dd8e9064ad7e6c6d6 Author Emma Marshall <emma@earthmover.io> Date 2024-07-17T03:09:21.979000 I made my first change
Commit ID 6632e95b26b204d68d6226ab Author Deepak Cherian <deepak@earthmover.io> Date 2024-05-02T01:16:10.950000 Tags v1, v1_second_time update for v1
Commit ID 6632e95726b204d68d6226a8 Author Deepak Cherian <deepak@earthmover.io> Date 2024-05-02T01:16:07.379000 Tags v0 yet another change
Commit ID 6632e95526b204d68d6226a7 Author Deepak Cherian <deepak@earthmover.io> Date 2024-05-02T01:16:05.419000 I made my second change
Commit ID 6632e95426b204d68d6226a6 Author Deepak Cherian <deepak@earthmover.io> Date 2024-05-02T01:16:04.572000 I made my first change
Delete a branch
Delete a branch using Repo.delete_branch()
.
The main
branch is special and cannot be deleted. Please let us know if you have a need for this feature.
repo.delete_branch("stage")
Confirm that branch was deleted with Repo.branches
repo.branches
(Branch(id='main', commit_id=6697357f3f5775e74408ab8d),)
Tagging
Concepts
- Tags are created by an
Author
, and associated with a timestamp and a particular commit through acommit_id
. - Tags may be associated with an optional
message
. - Tags are immutable.
Create a tag
Create a new tag with Repo.tag
. By default this will tag the latest commit on the current branch
repo.tag("v0")
Tag(id='v0', label='v0', created_at=datetime.datetime(2024, 5, 2, 1, 16, 10, 305000), commit=Commit(id=6632e95726b204d68d6226a8, session_start_time=datetime.datetime(2024, 5, 2, 1, 16, 6, 683000), parent_commit=6632e95526b204d68d6226a7, commit_time=datetime.datetime(2024, 5, 2, 1, 16, 7, 379000), author_name='Deepak Cherian', author_email='deepak@earthmover.io', message='yet another change'), message=None, author_name='Deepak Cherian', author_email='deepak@earthmover.io', commit_id=6632e95726b204d68d6226a8)
View the tag in the commit log
repo.commit_log
Commit ID 6632e95726b204d68d6226a8 Author Deepak Cherian <deepak@earthmover.io> Date 2024-05-02T01:16:07.379000 Tags v0 yet another change
Commit ID 6632e95526b204d68d6226a7 Author Deepak Cherian <deepak@earthmover.io> Date 2024-05-02T01:16:05.419000 I made my second change
Commit ID 6632e95426b204d68d6226a6 Author Deepak Cherian <deepak@earthmover.io> Date 2024-05-02T01:16:04.572000 I made my first change
List tags
View a list of tags with Repo.tags
repo.tags
(Tag(id='v0', label='v0', created_at=datetime.datetime(2024, 5, 2, 1, 16, 10, 305000), commit=Commit(id=6632e95726b204d68d6226a8, session_start_time=datetime.datetime(2024, 5, 2, 1, 16, 6, 683000), parent_commit=6632e95526b204d68d6226a7, commit_time=datetime.datetime(2024, 5, 2, 1, 16, 7, 379000), author_name='Deepak Cherian', author_email='deepak@earthmover.io', message='yet another change'), message=None, author_name='Deepak Cherian', author_email='deepak@earthmover.io', commit_id=6632e95726b204d68d6226a8),)
Customizing tags
Tags can be customized with a "message"
. First lets make a new commit and then tag it
repo.root_group.attrs["what"] = "a change for version 1"
v1_commit = repo.commit("update for v1")
repo.tag("v1", message="I like this version. Tagging it as v1")
Tag(id='v1', label='v1', created_at=datetime.datetime(2024, 5, 2, 1, 16, 11, 424000), commit=Commit(id=6632e95b26b204d68d6226ab, session_start_time=datetime.datetime(2024, 5, 2, 1, 16, 9, 720000), parent_commit=6632e95726b204d68d6226a8, commit_time=datetime.datetime(2024, 5, 2, 1, 16, 10, 950000), author_name='Deepak Cherian', author_email='deepak@earthmover.io', message='update for v1'), message='I like this version. Tagging it as v1', author_name='Deepak Cherian', author_email='deepak@earthmover.io', commit_id=6632e95b26b204d68d6226ab)
View all tags in the commit log
repo.commit_log
Commit ID 6632e95b26b204d68d6226ab Author Deepak Cherian <deepak@earthmover.io> Date 2024-05-02T01:16:10.950000 Tags v1 update for v1
Commit ID 6632e95726b204d68d6226a8 Author Deepak Cherian <deepak@earthmover.io> Date 2024-05-02T01:16:07.379000 Tags v0 yet another change
Commit ID 6632e95526b204d68d6226a7 Author Deepak Cherian <deepak@earthmover.io> Date 2024-05-02T01:16:05.419000 I made my second change
Commit ID 6632e95426b204d68d6226a6 Author Deepak Cherian <deepak@earthmover.io> Date 2024-05-02T01:16:04.572000 I made my first change
Tag a specific commit by passing the commit_id
repo.tag("v1_second_time", commit_id=v1_commit, message="Oops, I did it again.")
Tag(id='v1_second_time', label='v1_second_time', created_at=datetime.datetime(2024, 5, 2, 1, 16, 11, 688000), commit=Commit(id=6632e95b26b204d68d6226ab, session_start_time=datetime.datetime(2024, 5, 2, 1, 16, 9, 720000), parent_commit=6632e95726b204d68d6226a8, commit_time=datetime.datetime(2024, 5, 2, 1, 16, 10, 950000), author_name='Deepak Cherian', author_email='deepak@earthmover.io', message='update for v1'), message='Oops, I did it again.', author_name='Deepak Cherian', author_email='deepak@earthmover.io', commit_id=6632e95b26b204d68d6226ab)
The commit_log
will list all tags associated with a commit
repo.commit_log
Commit ID 6632e95b26b204d68d6226ab Author Deepak Cherian <deepak@earthmover.io> Date 2024-05-02T01:16:10.950000 Tags v1, v1_second_time update for v1
Commit ID 6632e95726b204d68d6226a8 Author Deepak Cherian <deepak@earthmover.io> Date 2024-05-02T01:16:07.379000 Tags v0 yet another change
Commit ID 6632e95526b204d68d6226a7 Author Deepak Cherian <deepak@earthmover.io> Date 2024-05-02T01:16:05.419000 I made my second change
Commit ID 6632e95426b204d68d6226a6 Author Deepak Cherian <deepak@earthmover.io> Date 2024-05-02T01:16:04.572000 I made my first change
Time-travel to a tag
Pass a tag name to Repo.checkout
.
repo.commit_log
Commit ID 6632ac858bab9b29700d02b0 Author Deepak Cherian <deepak@earthmover.io> Date 2024-05-01T20:56:37.446000 Tags v0 yet another change
Commit ID 6632ac838bab9b29700d02af Author Deepak Cherian <deepak@earthmover.io> Date 2024-05-01T20:56:35.517000 I made my second change
Commit ID 6632ac828bab9b29700d02ae Author Deepak Cherian <deepak@earthmover.io> Date 2024-05-01T20:56:34.663000 I made my first change
Let's get back to the main branch
repo.checkout("main")
6632ac898bab9b29700d02b3
Delete a tag
A single tag can only used once in a repo, but a tag with the same name can exist in multiple repos of the same organization.
Tags are immutable, so they cannot be edited to point to a different commit.
But they can be deleted with Repo.delete_tag
.
repo.delete_tag('v1')
Verify that the v1
tag is deleted by checking Repo.commit_log
repo.commit_log
Commit ID 6632ac898bab9b29700d02b3 Author Deepak Cherian <deepak@earthmover.io> Date 2024-05-01T20:56:41.156000 Tags v1_second_time update for v1
Commit ID 6632ac858bab9b29700d02b0 Author Deepak Cherian <deepak@earthmover.io> Date 2024-05-01T20:56:37.446000 Tags v0 yet another change
Commit ID 6632ac838bab9b29700d02af Author Deepak Cherian <deepak@earthmover.io> Date 2024-05-01T20:56:35.517000 I made my second change
Commit ID 6632ac828bab9b29700d02ae Author Deepak Cherian <deepak@earthmover.io> Date 2024-05-01T20:56:34.663000 I made my first change
And in Repo.tags
repo.tags
(Tag(id='v0', label='v0', created_at=datetime.datetime(2024, 5, 1, 20, 56, 40, 502000), commit=Commit(id=6632ac858bab9b29700d02b0, session_start_time=datetime.datetime(2024, 5, 1, 20, 56, 36, 721000), parent_commit=6632ac838bab9b29700d02af, commit_time=datetime.datetime(2024, 5, 1, 20, 56, 37, 446000), author_name='Deepak Cherian', author_email='deepak@earthmover.io', message='yet another change'), message=None, author_name='Deepak Cherian', author_email='deepak@earthmover.io', commit_id=6632ac858bab9b29700d02b0),
Tag(id='v1_second_time', label='v1_second_time', created_at=datetime.datetime(2024, 5, 1, 20, 56, 42, 113000), commit=Commit(id=6632ac898bab9b29700d02b3, session_start_time=datetime.datetime(2024, 5, 1, 20, 56, 39, 725000), parent_commit=6632ac858bab9b29700d02b0, commit_time=datetime.datetime(2024, 5, 1, 20, 56, 41, 156000), author_name='Deepak Cherian', author_email='deepak@earthmover.io', message='update for v1'), message='Oops, I did it again.', author_name='Deepak Cherian', author_email='deepak@earthmover.io', commit_id=6632ac898bab9b29700d02b3))