Managing Storage
As a cloud-native data management platform, Arraylake stores all data in cloud object storage. See Storage in the Concepts section to learn more about how Arraylake works.
An Arraylake BucketConfig
houses the settings for one or more object storage locations.
Each BucketConfig
holds the configuration (e.g. object store bucket name, key prefix, access
credentials, etc.) that enables the Arraylake client
and services to read and write array data.
All the Repositories that use a given bucket config are safely
isolated from each other. A BucketConfig
allows organizations to easily and
securely manage storage configuration that is shared between repositories. An
organization may have multiple bucket configurations.
Important properties for BucketConfig
s include:
nickname
: a nickname for easy referencing in code, on the command line, and on the webplatform
: the object storage providerauth_config
: additional authorization inputs for the storage backendname
: the name of the bucket in object storage where chunks will be storedprefix
: an optional key prefixextra_config
: additional configuration options for the storage backend (e.g.endpoint_url
)
For convenience, The Arraylake Python client also supports expressing these last three components
in URI form (i.e. {platform}://{name}[/{prefix}]
). Each BucketConfig must have
a unique URI. Arraylake will not allow you to create two BucketConfig
s pointing to
the same location in object storage.
You can modify a BucketConfig as much as you like so long as there are no Repos
that rely on it. Once it is in use by one or more Repos, only the BucketConfig's
nickname
can be modified.
Create a BucketConfig
If you're just getting started, you probably only need one bucket config for
your entire organization (see Organizations, Users, and Access Management for more
detail). For the purposes of this example, our org name will be earthmover
. If
running these commands interactively, replace earthmover
with your org name.
For this example, we are going to create a bucket config nicknamed production
to hold the chunks for all the repositories with production-quality datasets.
- Web App
- Python
- Python (asyncio)

The create bucket dialog.
In the web app, Org Admins can add BucketConfig
s by clicking on the "Add Bucket" button in the Buckets section of the Organization Settings page (see below).
from arraylake import Client
client = Client()
client.create_bucket_config(
org="earthmover",
nickname="production",
uri="s3://my-production-data",
extra_config={'region_name': 'us-east-1'}
)
from arraylake import AsyncClient
aclient = AsyncClient()
await aclient.create_bucket_config(
org="earthmover",
nickname="production",
uri="s3://my-production-data",
extra_config={'region_name': 'us-east-1'}
)
Object store credentials can be managed in three ways: self-managed credentials, role-based access delegation, and hash-based message authentication codes (HMAC).
By default, self-managed authorization is used for object storage access. Under this configuration, Arraylake does not manage bucket credentials, and the client environment must be configured with appropriate permissions to read (and, if desired, write) to this bucket.
For all platforms, Arraylake can store HMAC keys for bucket access ("hmac"). Under this bucket configuration, Arraylake stores the access key ID and secret access key ID needed to access the chunkstore bucket.
For AWS S3 or Google Cloud, a user-managed IAM policy can be used to manage bucket access ("role-based access delegation"). Under this bucket authorization configuration, object store credentials are managed by a user-managed IAM policy that gives the Earthmover authorization service permission to assume this role and generate temporary, scoped credentials to access the object store bucket.
To use role-based access delegation for object store credential management, chunkstore.use_delegated_credentials
must be set to True
in the config. This option is True
by default. To disable delegated credentials, set chunkstore.use_delegated_credentials
to False
in the config.
Instructions on how to configure these types of credentials are below.
Configuring Self-Managed Credentials
Under this bucket credential configuration, object store credentials are managed solely by the user. This can be done in many ways:
- Environment variables: Object store credentials can be set via environment variables. For AWS S3, that means setting both
AWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
. See the Storage Integrations page for object-store specific examples. - Arraylake config: Object store credentials can be set via the Arraylake config. Below is an example for AWS S3. See the Storage Integrations page for object-store specific examples.
service:
uri: https://api.earthmover.io
s3:
verify: False
aws_access_key_id: ...
aws_secret_access_key: ...
- Native service configs: Object store credentials can be set using native service configurations, such as using the
aws
CLI to configure your AWS credentials config or thegcloud
CLI to configure Google Cloud credentials. See the Storage Integrations page for object-store specific examples.
Configuring HMAC
Under this bucket credential configuration, credentials are stored directly in Arraylake. This works for all s3-compatible object storage services, including GCS (see Configuring GCS Buckets with HMAC Credentials). The credentials provided to arraylake must have access to the object store bucket associated with the bucket config.
To configure Arraylake to use HMAC credentials, create a BucketConfig
for your bucket, setting the credential type to "hmac" and passing in the access_key_id
and secret_access_key
HMAC credentials. These keys must be able to access the object store bucket.
- Web App
- Python
- Python (asyncio)

The create bucket dialog using HMAC access delegation.
from arraylake import Client
client = Client()
client.create_bucket_config(
org="earthmover",
nickname="production",
uri="s3://my-production-data",
extra_config={'region_name': 'us-east-1'},
auth_config={
"method": "hmac",
"access_key_id": "my-access-key-id",
"secret_access_key": "my-secret-access-key"
}
)
from arraylake import AsyncClient
aclient = AsyncClient()
await aclient.create_bucket_config(
org="earthmover",
nickname="production",
uri="s3://my-production-data",
extra_config={'region_name': 'us-east-1'},
auth_config={
"method": "hmac",
"access_key_id": "my-access-key-id",
"secret_access_key": "my-secret-access-key"
}
)
Configuring GCS Buckets with HMAC Credentials
First, set up a set of HMAC keys following this tutorial from Google Cloud. Note that in Google Cloud, HMAC keys are used for interoperability (e.g. compatibility with Amazon S3) and are not supported by GCS-native storage libraries.
To configure a GCS bucket to use HMAC credentials in Arraylake, you must set the bucket platform to S3 Compatible instead of Google Cloud. You must also set the credential type to HMAC
, pass in the access_key_id
and secret_access_key
HMAC credentials generated in the above step, and set the endpoint url in the extra configuration to https://storage.googleapis.com
. The bucket will be accessed using S3 APIs using the provided HMAC keys.
- Web App
- Python
- Python (asyncio)

The create bucket dialog using HMAC access delegation for GCS buckets.
from arraylake import Client
client = Client()
client.create_bucket_config(
org="earthmover",
nickname="gcs-hmac-bucket",
uri="s3://my-gcs-bucket", # use s3:// in place of gs://
extra_config={'endpoint_url': 'https://storage.googleapis.com'},
auth_config={
"method": "hmac",
"access_key_id": "my-access-key-id",
"secret_access_key": "my-secret-access-key"
}
)
from arraylake import AsyncClient
aclient = AsyncClient()
await aclient.create_bucket_config(
org="earthmover",
nickname="gcs-hmac-bucket",
uri="s3://my-gcs-bucket", # use s3:// in place of gs://
extra_config={'endpoint_url': 'https://storage.googleapis.com'},
auth_config={
"method": "hmac",
"access_key_id": "my-access-key-id",
"secret_access_key": "my-secret-access-key"
}
)
Configuring Role-Based Access Delegation
Under this bucket credential configuration, Earthmover is given limited access to the bucket's AWS or GCS account for the purposes of generating temporary, scoped credentials for bucket access. This works for AWS S3 buckets (see #aws-s3-buckets) and Google Cloud Storage buckets (see #google-cloud-storage-buckets)
AWS S3 Buckets
To leverage the role-based access delegation bucket authorization, you must configure an AWS IAM role with appropriate permissions to the S3 bucket used to store your arraylake data and grant Earthmover’s authorization service the ability to assume the role.
After assuming this role, the Earthmover authorization service can generate temporary, scoped credentials to access this S3 bucket.
These credentials have a lifetime of 1 hour and will automatically refresh when expired.
IAM Policy
Earthmover requires the following actions on the S3 bucket:
s3:ListBucket
s3:GetBucketLocation
s3:GetBucketNotification
s3:PutBucketNotification
s3:GetObject
s3:PutObject
s3:PutObjectAcl
s3:DeleteObject
s3:AbortMultipartUpload
Create an IAM policy using the following permissions JSON template (substituting <my-arraylake-bucket>
with your S3 bucket) to grant these actions. This policy will be attached to the IAM role in the next step.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:ListBucket",
"s3:GetBucketLocation",
"s3:GetBucketNotification",
"s3:PutBucketNotification"
],
"Resource": [
"arn:aws:s3:::<my-arraylake-bucket>"
]
},
{
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:DeleteObject",
"s3:PutObjectAcl",
"s3:AbortMultipartUpload"
],
"Resource": [
"arn:aws:s3:::<my-arraylake-bucket>/*"
]
}
]
}
IAM Role
Step 1: Select trusted entity
Create a new AWS IAM role using the following settings:
- Trusted Entity Type:
AWS Account
- AWS Account ID:
842143331303
(Earthmover's AWS account ID) - External ID:
<my-shared-secret>
(any string you share only with Earthmover) - MFA is not supported at this time
Step 2: Add permissions
Select the IAM policy created above to attach it to this role.
Step 3: Name, review, and create
-
Role details: Create a meaningful role name and description.
-
Select trusted entities: Copy the below trust policy into the role's
Trust Policy
to allow the Earthmover authorization service the ability to assume and tag this role:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::842143331303:root"
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": "<my-shared-secret>"
},
"ArnLike": {
"aws:PrincipalArn": "arn:aws:iam::842143331303:role/EarthmoverSignerServiceRole-production"
}
}
},
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::842143331303:root"
},
"Action": "sts:TagSession",
"Condition": {
"ArnLike": {
"aws:PrincipalArn": "arn:aws:iam::842143331303:role/EarthmoverSignerServiceRole-production"
}
}
}
]
}
- Tags:
Add any tags that you want associated this with this policy.
Configuring Arraylake to use an IAM Role
Once this IAM role has been created, you can create a BucketConfig
for your bucket, setting the credential type to "Role-based access delegation" in the web app OR aws_customer_managed_role
in Python and passing in your AWS account ID, shared secret (external_customer_id
) and IAM role name (external_role_name
) into the auth_config
:
- Web App
- Python
- Python (asyncio)

The create bucket dialog using role-based access delegation for AWS S3.
from arraylake import Client
client = Client()
client.create_bucket_config(
org="earthmover",
nickname="production",
uri="s3://my-production-data",
extra_config={'region_name': 'us-east-1'},
auth_config={
"method": "aws_customer_managed_role",
"external_customer_id": "my-aws-account-id",
"external_role_name": "my-iam-role-name",
"shared_secret": "my-shared-secret",
}
)
from arraylake import AsyncClient
aclient = AsyncClient()
await aclient.create_bucket_config(
org="earthmover",
nickname="production",
uri="s3://my-production-data",
extra_config={'region_name': 'us-east-1'},
auth_config={
"method": "aws_customer_managed_role",
"external_customer_id": "my-aws-account-id",
"external_role_name": "my-iam-role-name",
"shared_secret": "my-shared-secret",
}
)
That's it! You can now start interacting with your bucket via Arraylake which will manage S3 credentials via this IAM role.
Google Cloud Storage Buckets
For Google Cloud buckets, cross-account service account impersonation is used to give Earthmover's authorization service account temporary, scoped access to the GCS bucket.
To enable service account impersonation, you must configure a service account with appropriate permissions to the GCS bucket used to store your arraylake data and grant Earthmover’s authorization service the ability to impersonate the account.
After impersonating this role, the Earthmover authorization service account can generate temporary, scoped credentials to access this GCS bucket.
These credentials have a lifetime of 1 hour and will not refresh automatically.
Step 1: Create a Service Account
- Go to the Google Cloud Console.
- Navigate to the IAM & Admin > Service accounts page.
- Click the Create Service Account button.
- Provide a name and ID for the service account, and click Create.
- Under Grant this service account access to project, grant the Service Account Token Creator role in the project. This will allow the service account to retrieve credentials from the impersonated account.
- Click Done to create the service account.
Step 2: Grant Earthmover Impersonation Access to the Service Account
- Navigate to the IAM & Admin > Service accounts page and click on the service account created above.
- Under the Permissions tab, click Grant Access
- Add the
earthmover-signer-service@arraylake.iam.gserviceaccount.com
service account as a principal and grant it the Service Account Token Creator role in the project. - Click Save
Step 3: Grant Service Account Access to the Bucket
- Navigate to the Cloud Storage > Buckets page and click on the bucket you wish to give the service account access to. This bucket should correspond to the bucket config that you want to grant Arraylake access to.
- Under the Permissions tab, click Grant Access.
- Add the service account created in Step 1 (ie
my-al-service-account@my-project.iam.gserviceaccount.com
) as a principal and grant it the Storage Object User role in the bucket. - Click Save
Configuring Arraylake to use a Google Service Account
Once you have created a service account with the appropriate permissions, you can create a BucketConfig
for your bucket, setting the credential type to "Role-based access delegation" in the web app OR gcp_customer_managed_role
in Python and passing in your service account (target_service_account
) into the auth_config
:
- Web App
- Python
- Python (asyncio)

The create bucket dialog using role-based access delegation for Google Cloud Storage.
from arraylake import Client
client = Client()
client.create_bucket_config(
org="earthmover",
nickname="production",
uri="gs://my-production-data",
extra_config={TODO},
auth_config={
"method": "gcp_customer_managed_role",
"target_service_account": "my-service-account@my-project.iam.gserviceaccount.com"
}
)
from arraylake import AsyncClient
aclient = AsyncClient()
await aclient.create_bucket_config(
org="earthmover",
nickname="production",
uri="gs://my-production-data",
extra_config={TODO},
auth_config={
"method": "gcp_customer_managed_role",
"target_service_account": "my-service-account@my-project.iam.gserviceaccount.com"
}
)
List BucketConfigs
You can list BucketConfig
s associated with an organization.
- Web App
- Python
- Python (asyncio)

The organization buckets list
In the web app, users can view existing buckets for their org from the org buckets section (app.earthmover.io/[orgname]/buckets
).
The default BucketConfig is marked with a purple badge.

The organization settings buckets section
Org Admins can also access and manage the buckets list via the Buckets section on the Organizations Settings page (app.earthmover.io/[orgname]/settings
).
Clicking on the gear icon in the bucket's row will bring up the settings dialog for that bucket.
client.list_bucket_configs("earthmover")
In the Python Client, the default BucketConfig will be indicated by the is_default
flag on the resulting
list item.
await aclient.list_bucket_configs("earthmover")
In the Python Client, the default BucketConfig will be indicated by the is_default
flag on the resulting
list item.
Delete a BucketConfig
Finally, we can delete a BucketConfig
.
A BucketConfig cannot be deleted while it is in use by any Repo. Deleting a BucketConfig also cannot be undone! Use this operation carefully.
- Web App
- Python
- Python (asyncio)

The bucket settings dialog
In the web app, a BucketConfig can be deleted from the bucket's settings dialog.
client.delete_bucket_config(
org="earthmover",
nickname="production",
imsure=True, imreallysure=True
)
await aclient.delete_bucket_config(
org="earthmover",
nickname="production",
imsure=True, imreallysure=True
)