Skip to main content

Understanding Write Conflicts

Arraylake's version control system means that multiple users can safely collaborate on the same datasets. However, this also means that multiple clients editing the same metadata files or array data objects can give rise to situations where changes to the same objects conflict with one another. This tutorial will illustrate scenarios that give rise to these conflicts. See Transactions and Version Control for a detailed discussion of these concepts.

Learning Goals

  • Understand how Arraylake:
    • Uses version control to allow multiple users to modify the same Zarr dataset while ensuring that the dataset is always consistent.
    • Detects whether multiple writes modify the same object.
    • Uses sessions for version control.
    • Employs an 'optimistic concurrency' strategy for conflict resolution.
  • Understand when conflicts may or may not arise in the following situations:
    • Multiple users modify metadata.
    • Multiple users modify chunk data.
    • Multiple users modify chunk and meta data.
# condense error messages
%xmode minimal

import arraylake as al
import xarray as xr
import numpy as np
import zarr
Output
Exception reporting mode: Minimal

Overview

The examples in this section will relate to the scenario of two (or more) users in an organization actively working on and writing common data objects. This is also analogous to multiple writers on different machines within a distributed system modifying the same datasets. To illustrate these examples, we will create two connections to the same Arraylake Repo. Throughout these examples, they will be referred to as 'User A' and 'User B', and they will interact with the Arraylake Repository through Repo objects, repo_a and repo_b, respectively.

Two clients interacting with Arraylake

First we will create two users connected to the same Arraylake Repository. We will walk through the initial process of users making changes, committing them to a Repository, and checking out changes made by another user before looking at commit conflicts in the next section.

client = al.Client()
repo_a = client.get_or_create_repo('earthmover/conflict-resolution')
repo_b = client.get_or_create_repo('earthmover/conflict-resolution')

This will create two different sessions which we can see by looking at the session_id given by repo.status().

repo_a.status()

🧊 Using repo earthmover/conflict-resolution
📟 Session 26400ba5957a4512a8764d3ae0229c66 started at 2024-06-07T14:47:31.688000
🌿 On branch main

No changes in current session

repo_b.status()

🧊 Using repo earthmover/conflict-resolution
📟 Session a262d303fe68460ea30d9b701a2b0305 started at 2024-06-07T14:47:32.459000
🌿 On branch main

No changes in current session

First User A adds a metadata attribute to the Repo's root group:

repo_a.root_group.attrs['foo'] = 'bar'
repo_a.root_group.attrs.asdict()
Output
{'foo': 'bar'}
repo_a.status()

🧊 Using repo earthmover/conflict-resolution
📟 Session 26400ba5957a4512a8764d3ae0229c66 started at 2024-06-07T14:47:31.688000
🌿 On branch main

paths modified in session

  • 📝 meta/root.group.json

repo.status() gives us a number of helpful pieces of information:

  • Which Arraylake Repo we are connected to.
  • The session ID and when it was created.
  • The paths to the files modified in this session.

In this example, meta/root.group.json is the object we modified when User A made the metadata addition above.

repo_a.commit(message = 'User A added metadata')
Output
66631d876eb92c4b69442013

Committing changes to the Arraylake Repository creates a new session, so status() shows an updated session ID with no currently modified paths:

repo_a.status()

🧊 Using repo earthmover/conflict-resolution
📟 Session 126c271294fe4076bcecd1ff1f589db2 started at 2024-06-07T14:47:35.600000
🌿 On branch main

No changes in current session

The commit printed above is now the base commit for this Repo. User A is viewing this version of the dataset. If we print User B's commit log we'll see that it is empty and User B is now behind User A.

repo_b.commit_log

    Commit diagrams

    The diagrams below illustrate the commit log from the perspective of each user, as well as the server (your Arraylake Repository). We'll use these GitGraphs to illustrate the state of the Repo as checked out by each user in the rest of this document. The labels next to each commit will refer to which user made the change, the commit's order in the commit sequence (C1,C2,...), and whether they modify chunk data objects [C] or metadata objects [M] or both [C, M]. For example, C1: UA[M] is the first commit C1; it is made by User A UA, and it modifies a metadata object [M].

    User A is up-to-date with the commit history recorded on the server, but User B is behind because they have not checked out the change committed by User A.

    In order to bring both users to the same starting point for this demo, User B needs to run checkout(). Checkout creates a new session and picks up any changes that were committed before User B's new session was created. Running .status() will show the ID of the new session.

    repo_b.checkout()
    Output
    66631d876eb92c4b69442013
    repo_b.status()

    🧊 Using repo earthmover/conflict-resolution
    📟 Session 9508966f646747148ac0c9cdb7243b58 started at 2024-06-07T14:47:40.048000
    🌿 On branch main

    No changes in current session

    Now, we can see that both Repos have the same base commit, meaning that both Repos are currently in the same state.

    repo_a.commit_log
    • Commit ID66631d876eb92c4b69442013
      AuthorEmma Marshall <emma@earthmover.io>
      Date2024-06-07T14:50:08.318000

      User A added metadata

    repo_b.commit_log
    • Commit ID66631d876eb92c4b69442013
      AuthorEmma Marshall <emma@earthmover.io>
      Date2024-06-07T14:50:08.318000

      User A added metadata

    Here are the above commit logs represented as commit diagrams:

    We could also verify that both users are working with the same data by viewing the metadata present in both Repos:

    repo_a.root_group.attrs.asdict()
    Output
    {'foo': 'bar'}
    repo_b.root_group.attrs.asdict()
    Output
    {'foo': 'bar'}

    Commit process and conflict resolution

    This section will demonstrate Users A and B both making edits to the Arraylake Repo, and examine how Arraylake handles commits and detects conflicts. First, let's take a closer look at Arraylake's internal processes when a user makes a commit.

    Commits in Arraylake

    When a commit is made, Arraylake executes two steps:

    1. Arraylake makes a record of the commit, creating a new commit_id (C2). The parent commit of the new commit will be C1.

    2. Arraylake will try to move the tip of the branch to the new commit, C2. This is only possible if the parent commit of C2 matches the base commit of the branch (that is, if C1 is the current tip of the branch). If a commit has been made by another user since this session began, the branch's base commit ID will not match C1, and Arraylake will attempt to resolve the conflict.

      • When a conflict is detected:
        If the base commits do not match, Arraylake will examine the paths of the modified objects to determine if the new changes can be safely rebased onto the existing branch tip.
    tip

    Read a more detailed description of the committing process in the Concepts section.

    The flowchart below illustrates the sequence of multiple users interacting with the server. Creating a new session serves the current version of the Arraylake Repository to the user. In the diagram below, because User A commits their changes before User B, User B is now working in an out-of-date session. If they commit their changes, it may raise an error depending on whether their changes overlap with User A's.

    Multiple users make concurrent changes

    As described above, multiple users concurrently modifying data can lead to commit conflicts if users modify the same data. For example, assume User A makes a change (C2). User B then attempts to commit changes (C3) without running repo_b.checkout(). Now User B is working with an out-of-date view of the data: their session was created before User A's change (C2) was committed. When User B commits their change, Arraylake will see that User B's session does not have the new commit (C2) and that C3's parent commit_id (C1) doesn't match the Repo's base commit (C2).

    Here, there is a possibility of conflicts. Arraylake will use the 'optimistic concurrency' strategy (more about this here) to determine if it is safe to rebase the change we are trying to commit (C3) on User1's commit (C2). In this context, rebasing is similar to the git rebase operation, where one branch is appended to the tip (latest commit) of another branch. The following examples will demonstrate this scenario and show situations where a conflict can and cannot be automatically resolved.

    Example 1: Two changes that don't conflict

    First, we'll attempt to make a change that doesn't conflict. User A will modify metadata attributes. User B will create a new Zarr array and write that data to the Repo.

    note

    When creating the Zarr array, User B passes a chunks argument. This divides the dataset into chunks with length 2 along the x and y dimensions, which will be used in a later section.

    Remember, both users' Repos have the same view of the commit history:

    We can verify this by checking the most recent Commit_ID seen by each user and verifying that they are identical:

    repo_a.commit_log.commit_id == repo_b.commit_log.commit_id
    Output
    True

    Now User A makes a metadata change:

    repo_a.root_group.attrs['a new key'] = 'a new val' 

    Use repo_a.status() to see the objects modified by this change (meta/root.group.json)

    repo_a.status()

    🧊 Using repo earthmover/conflict-resolution
    📟 Session 126c271294fe4076bcecd1ff1f589db2 started at 2024-06-07T14:47:35.600000
    🌿 On branch main

    paths modified in session

    • 📝 meta/root.group.json

    User A commits their change:

    repo_a.commit(message='User A made a metadata update') #C2
    Output
    66631d99ad09014d211339be

    User B is now behind.

    Next, User B will write array data to the Repo:

    #create group
    repo_b.root_group.create_group('toy_data/')

    #create data
    z_arr = zarr.array(np.ones((3, 3)), chunks=(2, 2))

    #write data to Zarr store as User B
    repo_b.root_group.toy_data['ones'] = z_arr

    Take a look at the paths that were modified:

    repo_b.status()

    🧊 Using repo earthmover/conflict-resolution
    📟 Session 9508966f646747148ac0c9cdb7243b58 started at 2024-06-07T14:47:40.048000
    🌿 On branch main

    paths modified in session

    • 📝 meta/root/toy_data.group.json
    • 📝 data/root/toy_data/ones/c0/0
    • 📝 data/root/toy_data/ones/c0/1
    • 📝 data/root/toy_data/ones/c1/1
    • 📝 meta/root/toy_data/ones.array.json
    • 📝 data/root/toy_data/ones/c1/0
    • 📝 data/root/toy_data/ones

    We can see each of the chunks that User B wrote to the Zarr store. We also see that creating a group (toy_data) and writing the ones array created associated metadata json files at the group level. .status() also shows us that the path modified by User A (meta/root/group.json) is not in the list of paths modified by User B, meaning that none of User B's changes overlap with User A's modification.

    User B commits their change:

    repo_b_chunk_commit = repo_b.commit(message='User B wrote chunk data to repo') #C3

    The above commit succeeds even though User B's session is out-of-sync with the changes committed by User A because Users A and B modified different objects: metadata (root.group.json) in the case of User A (C2) and chunk data and associated metadata in the case of User B (C3). This means that User A's changes (C2) can be safely replayed on top of User B's commit (C3), and the commit operation succeeds. In this example, the rebase creates a commit which is the ID shown by repo_b.commit() above, but keep in mind User A is now behind User B.

    note

    While Arraylake is able to resolve these commits, doing so can be a computationally intensive process. If you are scaling your Arraylake workflow to include tasks distributed across multiple workers, it is better to check out 'Cooperative Concurrency mode' than to rely on optimistic concurrency conflict resolution.

    Example 2: A change that conflicts

    What about the case of two users making overlapping changes?

    We'll first bring both users up to speed:

    repo_a.checkout(); repo_b.checkout();

    Then, User A modifies metadata at the group level in the toy_data group created by User B.

    repo_a.root_group['toy_data'].attrs['toy_key'] = 'toy_val'

    Use repo.status() to see the path of User A's changes.

    repo_a.status()

    🧊 Using repo earthmover/conflict-resolution
    📟 Session 891980a9e6a2490daaeeef901c3970d7 started at 2024-06-07T14:48:08.262000
    🌿 On branch main

    paths modified in session

    • 📝 meta/root/toy_data.group.json
    repo_a.commit(message='User A modified toy_data metadata') #C4
    Output
    66631daa6eb92c4b69442023

    Now, User B makes an overlapping change:

    repo_b.root_group['toy_data'].attrs['another toy key'] = 'another toy val'
    repo_b.status()

    🧊 Using repo earthmover/conflict-resolution
    📟 Session 4a0baab8cd6b4ade97cca4fd03686dcc started at 2024-06-07T14:48:08.570000
    🌿 On branch main

    paths modified in session

    • 📝 meta/root/toy_data.group.json

    Note that the path is identical to User A's changes above. Because User B modified the same path as User A without starting an up-to-date session, the commit will fail

    repo_b.commit(message= 'User B added attr to metadata') #C5

    HTTPStatusError: Client error '422 Unprocessable Entity' for url 'https://api.earthmover.io/repos/earthmover/conflict-resolution/rebase?session_id=4a0baab8cd6b4ade97cca4fd03686dcc&branch_name=main&base_commit=66631da546079a62f62c6e1a' For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/422

    During handling of the above exception, another exception occurred:

    ValueError: Conflicting paths found for rebase 66631da546079a62f62c6e1a->main, including ['meta/root/toy_data.group.json']

    The error message thrown by the unsuccessful commit also tells us why the error occurred. You can see that Arraylake attempted a rebase but because both changes modified the object ['meta/root/toy_data.group.json'], the commit was unsuccessful.

    Visually, here are both users' views of the history. Note that User B's attempted commit, C5, is not present. So each user's view of the commit history is now:

    In this situation, Arraylake uses the 'optimistic concurrency' strategy to determine if it is safe to rebase the change we are trying to commit (C5) on User A's commit (C4). This commit failed because both commits modify the same object (the toy_data group's attrs), so C5 cannot be safely rebased on C4, and the conflict can't be resolved automatically. Similarly, if both commits modify the same chunk data, the conflict resolution will also fail.

    Independent chunk writes with multiple writers

    The concept of non-conflicting changes applies to chunk objects too. Concurrent users can modify the same dataset as long as they do not write changes to the overlapping in-memory chunks. Understanding which writes will create conflicts and how to avoid them is important to scaling workflows with Arraylake.

    tip

    The discussion of 'When to use which mode' in the Scaling with Arraylake tutorial is a helpful resource.

    Here, will use the same chunked dataset we created in the previous example. We'll then simulate User A making a change to the chunk data.

    Start with both users up to speed:

    repo_a.checkout; repo_b.checkout();

    Great, both sessions should be up-to-date with no current changes.

    #read array data we created earlier
    ones_arr = repo_a.root_group.toy_data['ones']
    np.array(ones_arr)
    Output
    array([[1., 1., 1.],
    [1., 1., 1.],
    [1., 1., 1.]])

    Now we'll have User A modify a subset of the dataset that only spans one chunk.

    ones_arr[0, slice(0,2)] = 10
    np.array(ones_arr)
    Output
    array([[10., 10.,  1.],
    [ 1., 1., 1.],
    [ 1., 1., 1.]])

    Run status() to see the changes made by User A during this session. Note that the changes only modified one chunk object.

    repo_a.status()

    🧊 Using repo earthmover/conflict-resolution
    📟 Session eea88452639f46e09dd78e4019a3702e started at 2024-06-07T14:48:10.761000
    🌿 On branch main

    paths modified in session

    • 📝 data/root/toy_data/ones/c0/0

    Now commit them:

    repo_a.commit('User A modified 1 chunk') #C5 (the original C5 never succeeded)
    Output
    66631db946079a62f62c6e1b

    User views of commit history are:

    User B now make changes to the same array but modifies a different chunk:

    array = repo_b.root_group["toy_data/ones"]
    array[-1, -1] = 3
    np.array(array)
    Output
    array([[1., 1., 1.],
    [1., 1., 1.],
    [1., 1., 3.]])
    repo_b.status()

    🧊 Using repo earthmover/conflict-resolution
    📟 Session 932667bb7aeb491291aebf1b000bc01e started at 2024-06-07T14:48:18.165000
    🌿 On branch main

    paths modified in session

    • 📝 data/root/toy_data/ones/c1/1

    Notice that User B modified chunk ones/c2/2 and User A modified ones/c0/0. This means that both users can safely commit their changes independently and Arraylake will ensure that the changes don't overlap.

    repo_b.commit('User B modified chunk data') #C6
    Output
    66631dc846079a62f62c6e2f

    Cool! Even though User B wasn't working on a current copy of the dataset when they pushed their modification to chunk data, Arraylake was able to verify that the commits C5 (from User A) and C6 (from User B) did not modify overlapping chunks. Because this conflict could be safely resolved, Arraylake accepted User B's commit and modified User B's commit history to include User A's commit.

    Note that User A is now one commit behind User B:

    Taking a look at the array data, we can see that User B's current version includes the modifications from both User A and User B.

    np.array(repo_b.root_group['toy_data/ones'])
    Output
    array([[10., 10.,  1.],
    [ 1., 1., 1.],
    [ 1., 1., 3.]])

    User A's view doesn't include User B's changes:

    np.array(repo_a.root_group['toy_data/ones'])
    Output
    array([[10., 10.,  1.],
    [ 1., 1., 1.],
    [ 1., 1., 1.]])

    But if we run .checkout() it will:

    repo_a.checkout()
    np.array(repo_a.root_group['toy_data/ones'])
    Output
    array([[10., 10.,  1.],
    [ 1., 1., 1.],
    [ 1., 1., 3.]])

    Now both users have complete views of the commit history.

    Summary

    This page focused on situations that arise when multiple users write to the same Arraylake Repository. We discussed how Arraylake's version control system tracks concurrent changes, detects possible conflicts, and resolves them where possible. These examples demonstrated situations where multiple users concurrently modify a Repository and are both able to successfully commit changes or not:

    • Commits are successful when:
      • One user edits metadata while another edits chunk data
      • Both users modify chunk data, but their commits modify non-overlapping chunks in the dataset.
    • Commits are unsuccessful when both users attempt to commit changes to the same object. In this case the first committer "wins" and successfully commits, while the second user receives an error informing them of the conflict.

    We also illustrated how Arraylake methods such as .status() and .checkout() are helpful for viewing information about a user's session and synchronizing between a user and the main Arraylake Repository.

    client.delete_repo("earthmover/conflict-resolution", imsure=True, imreallysure=True)