Skip to main content

Managing Storage

As a cloud-native data management platform, Arraylake stores all data in cloud object storage. See Storage in the Concepts section to learn more about how Arraylake works.

An Arraylake BucketConfig houses the settings for one or more object storage locations. Each BucketConfig holds the configuration (e.g. object store bucket name, key prefix, access credentials, etc.) that enables the Arraylake client and services to read and write array data.

All the Repositories that use a given bucket config are safely isolated from each other. A BucketConfig allows organizations to easily and securely manage storage configuration that is shared between repositories. An organization may have multiple bucket configurations.

Important properties for BucketConfigs include:

  • nickname: a nickname for easy referencing in code, on the command line, and on the web
  • platform: the object storage provider
  • auth_config: additional authorization inputs for the storage backend
  • name: the name of the bucket in object storage where chunks will be stored
  • prefix: an optional key prefix
  • extra_config: additional configuration options for the storage backend (e.g. endpoint_url)

For convenience, The Arraylake Python client also supports expressing these last three components in URI form (i.e. {platform}://{name}[/{prefix}]). Each BucketConfig must have a unique URI. Arraylake will not allow you to create two BucketConfigs pointing to the same location in object storage.

tip

You can modify a BucketConfig as much as you like so long as there are no Repos that rely on it. Once it is in use by one or more Repos, only the BucketConfig's nickname can be modified.

Create a BucketConfig

If you're just getting started, you probably only need one bucket config for your entire organization (see Organizations, Users, and Access Management for more detail). For the purposes of this example, our org name will be earthmover. If running these commands interactively, replace earthmover with your org name.

For this example, we are going to create a bucket config nicknamed production to hold the chunks for all the repositories with production-quality datasets.

Create a bucket in the webapp.

The create bucket dialog.

In the web app, Org Admins can add BucketConfigs by clicking on the "Add Bucket" button in the Buckets section of the Organization Settings page (see below).

Object store credentials can be managed in three ways: self-managed credentials, role-based access delegation, and hash-based message authentication codes (HMAC).

By default, self-managed authorization is used for object storage access. Under this configuration, Arraylake does not manage bucket credentials, and the client environment must be configured with appropriate permissions to read (and, if desired, write) to this bucket.

For all platforms, Arraylake can store HMAC keys for bucket access ("hmac"). Under this bucket configuration, Arraylake stores the access key ID and secret access key ID needed to access the chunkstore bucket.

For AWS S3, a user-managed IAM policy can be used to manage bucket access ("role-based access delegation"). Under this bucket authorization configuration, object store credentials are managed by a user-managed AWS IAM policy that gives the Earthmover authorization service permission to assume this role and generate temporary, scoped credentials to access the chunkstore bucket.

info

To use role-based access delegation for object store credential management, chunkstore.use_delegated_credentials must be set to True in the config. This option is True by default. To disable delegated credentials, set chunkstore.use_delegated_credentials to False in the config.

Instructions on how to configure both types of credentials are below.

Configuring Self-Managed Credentials

Under this bucket credential configuration, object store credentials are managed solely by the user. This can be done in many ways:

  • Environment variables: Object store credentials can be set via environment variables. For AWS S3, that means setting both AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. See the Storage Integrations page for object-store specific examples.
  • Arraylake config: Object store credentials can be set via the Arraylake config. Below is an example for AWS S3. See the Storage Integrations page for object-store specific examples.
service:
uri: https://api.earthmover.io
s3:
verify: False
aws_access_key_id: ...
aws_secret_access_key: ...
  • Native service configs: Object store credentials can be set using native service configurations, such as using the aws CLI to configure your AWS credentials config or the gcloud CLI to configure Google Cloud credentials. See the Storage Integrations page for object-store specific examples.

Configuring HMAC

Under this bucket credential configuration, credentials are stored directly in Arraylake. This works for all s3-compatible object storage services, including GCS (see Configuring GCS Buckets with HMAC Credentials). The credentials provided to arraylake must have access to the object store bucket associated with the bucket config.

To configure Arraylake to use HMAC credentials, create a BucketConfig for your bucket, setting the credential type to "hmac" and passing in the access_key_id and secret_access_key_id HMAC credentials. These keys must be able to access the object store bucket.

Create a bucket in the webapp using HMAC access delegation.

The create bucket dialog using HMAC access delegation.

Configuring GCS Buckets with HMAC Credentials

First, set up a set of HMAC keys following this tutorial from Google Cloud. Note that in Google Cloud, HMAC keys are used for interoperability (e.g. compatibility with Amazon S3) and are not supported by GCS-native storage libraries.

To configure a GCS bucket to use HMAC credentials in Arraylake, you must set the bucket platform to S3 Compatible instead of Google Cloud. You must also set the credential type to HMAC, pass in the access_key_id and secret_access_key_id HMAC credentials generated in the above step, and set the endpoint url in the extra configuration to https://storage.googleapis.com. The bucket will be accessed using S3 APIs using the provided HMAC keys.

Create a GCS bucket in the webapp using HMAC access delegation.

The create bucket dialog using HMAC access delegation for GCS buckets.

Configuring Role-Based Access Delegation (AWS S3 Only)

To leverage the role-based access delegation bucket authorization, you must configure an AWS IAM role with appropriate permissions to the S3 bucket used to store your arraylake data and grant Earthmover’s authorization service the ability to assume the role.

After assuming this role, the Earthmover authorization service can generate temporary, scoped credentials to access this S3 bucket.

These credentials have a lifetime of 12 hours and do not automatically refresh.

IAM Policy

Earthmover requires the following actions on the S3 bucket:

  • s3:ListBucket
  • s3:GetBucketLocation
  • s3:GetBucketNotification
  • s3:PutBucketNotification
  • s3:GetObject
  • s3:PutObject
  • s3:PutObjectAcl
  • s3:DeleteObject
  • s3:AbortMultipartUpload

Create an IAM policy using the following permissions JSON template (substituting <my-arraylake-bucket> with your S3 bucket) to grant these actions. This policy will be attached to the IAM role in the next step.

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:ListBucket",
"s3:GetBucketLocation",
"s3:GetBucketNotification",
"s3:PutBucketNotification"
],
"Resource": [
"arn:aws:s3:::<my-arraylake-bucket>"
]
},
{
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:DeleteObject",
"s3:PutObjectAcl",
"s3:AbortMultipartUpload"
],
"Resource": [
"arn:aws:s3:::<my-arraylake-bucket>/*"
]
}
]
}

IAM Role

Step 1: Select trusted entity

Create a new AWS IAM role using the following settings:

  • Trusted Entity Type: AWS Account
  • AWS Account ID: 842143331303 (Earthmover's AWS account ID)
  • External ID: <my-shared-secret> (any string you share only with Earthmover)
  • MFA is not supported at this time

Step 2: Add permissions

Select the IAM policy created above to attach it to this role.

Step 3: Name, review, and create

  1. Role details: Create a meaningful role name and description.

  2. Select trusted entities: Copy the below trust policy into the role's Trust Policy to allow the Earthmover authorization service the ability to assume and tag this role:

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::842143331303:root"
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": "<my-shared-secret>"
},
"ArnLike": {
"aws:PrincipalArn": "arn:aws:iam::842143331303:role/EarthmoverSignerServiceRole-production"
}
}
},
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::842143331303:root"
},
"Action": "sts:TagSession",
"Condition": {
"ArnLike": {
"aws:PrincipalArn": "arn:aws:iam::842143331303:role/EarthmoverSignerServiceRole-production"
}
}
}
]
}
  1. Tags:

Add any tags that you want associated this with this policy.

Configuring Arraylake to use an IAM Role

Once this IAM role has been created, you can create a BucketConfig for your bucket, setting the credential type to "Role-based access delegation" in the web app OR customer_managed_role in Python and passing in your AWS account ID, shared secret (external_customer_id) and IAM role name (external_role_name) into the auth_config:

Create a bucket in the webapp using role-based access delegation.

The create bucket dialog using role-based access delegation.

That's it! You can now start interacting with your bucket via Arraylake which will manage S3 credentials via this IAM role.

List BucketConfigs

You can list BucketConfigs associated with an organization.

List buckets in web app

The organization buckets list

In the web app, users can view existing buckets for their org from the org buckets section (app.earthmover.io/[orgname]/buckets).

The default BucketConfig is marked with a purple badge.

Administer buckets in web app

The organization settings buckets section

Org Admins can also access and manage the buckets list via the Buckets section on the Organizations Settings page (app.earthmover.io/[orgname]/settings).

Clicking on the gear icon in the bucket's row will bring up the settings dialog for that bucket.

Delete a BucketConfig

Finally, we can delete a BucketConfig.

warning

A BucketConfig cannot be deleted while it is in use by any Repo. Deleting a BucketConfig also cannot be undone! Use this operation carefully.

Bucket settings dialog

The bucket settings dialog

In the web app, a BucketConfig can be deleted from the bucket's settings dialog.