Skip to main content

Best Practices

Scope of Repos and Commits

How much data should I store in one repo?

A repo is a zarr group, and you can put whatever you want in it. At one extreme, you could put all of your organization's data in one single repo. At the other extreme, you could have hundreds of repos, each containing just one single array. How to decide?

The important principle is that all data within a repo share a single commit history. Transactions are scoped to the repo level. So you should keep together in one repo data that need to be updated in a coordinated way and tracked with a single version history. Beyond this, we recommend that repos should be as small as possible, for simplicity. Data that are not related or interdependent should be kept in separate repos.

For example, when ingesting data from a weather prediction model, all the variables from the model (e.g. temperature, humidity, wind speed) are generated by the same underlying process and are physically related and interdependent. So all of this data should be versioned together in a single repo. However, if ingesting data from multiple different weather prediction models, we recommend each model be kept in a separate repo.

How big should commits be?

Commits should represent a significant update to the state of the repo which occurs as part of a single "job." For example, if you maintain a nightly cron job which ingests data into an Arraylake repo, the entire job should occur within a single session and make a single commit. Avoid making many small commits within a single job. Likewise when updating an array in a planned, coordinated way, all of the updates should be part of a single session and make a single commit. To achieve this, the writing process should coordinate its writes to align with chunk boundaries (see "cooperative mode" Concurrency Modes more more details).

Distinct jobs managed by separate individuals should not share a session and should make separate commits.