Managing Storage

As a cloud-native data management platform, Arraylake stores all data in cloud object storage. See Storage in the Concepts section to learn more about how Arraylake works.

An Arraylake BucketConfig houses the settings for one or more object storage locations. Each BucketConfig holds the configuration (e.g. object store bucket name, key prefix, access credentials, etc.) that enables the Arraylake client and services to read and write array data.

All the Repositories that use a given bucket config are safely isolated from each other. A BucketConfig allows organizations to easily and securely manage storage configuration that is shared between repositories. An organization may have multiple bucket configurations.

Important properties for BucketConfigs include:

nickname: a nickname for easy referencing in code, on the command line, and on the web
platform: the object storage provider
auth_config: additional authorization inputs for the storage backend
name: the name of the bucket in object storage where chunks will be stored
prefix: an optional key prefix
extra_config: additional configuration options for the storage backend (e.g. endpoint_url)

For convenience, The Arraylake Python client also supports expressing these last three components in URI form (i.e. {platform}://{name}[/{prefix}]). Each BucketConfig must have a unique URI. Arraylake will not allow you to create two BucketConfigs pointing to the same location in object storage.

tip

You can modify a BucketConfig as much as you like so long as there are no Repos that rely on it. Once it is in use by one or more Repos, only the BucketConfig's nickname can be modified.

Create a BucketConfig

If you're just getting started, you probably only need one bucket config for your entire organization (see Organizations, Users, and Access Management for more detail). For the purposes of this example, our org name will be earthmover. If running these commands interactively, replace earthmover with your org name.

For this example, we are going to create a bucket config nicknamed production to hold the chunks for all the repositories with production-quality datasets.

Web App
Python
Python (asyncio)

Create a bucket in the webapp. — The create bucket dialog.

In the web app, Org Admins can add BucketConfigs by clicking on the "Add Bucket" button in the Buckets section of the Organization Settings page (see below).

from arraylake import Client


client = Client()

client.create_bucket_config(
  org="earthmover",
  nickname="production",
  uri="s3://my-production-data",
  extra_config={'region_name': 'us-east-1'}
)

from arraylake import AsyncClient


aclient = AsyncClient()

await aclient.create_bucket_config(
  org="earthmover",
  nickname="production",
  uri="s3://my-production-data",
  extra_config={'region_name': 'us-east-1'}
)

Object store credentials can be managed in three ways: self-managed credentials, role-based access delegation, and hash-based message authentication codes (HMAC).

By default, self-managed authorization is used for object storage access. Under this configuration, Arraylake does not manage bucket credentials, and the client environment must be configured with appropriate permissions to read (and, if desired, write) to this bucket.

For all platforms, Arraylake can store HMAC keys for bucket access ("hmac"). Under this bucket configuration, Arraylake stores the access key ID and secret access key ID needed to access the chunkstore bucket.

For AWS S3 or Google Cloud, a user-managed IAM policy can be used to manage bucket access ("role-based access delegation"). Under this bucket authorization configuration, object store credentials are managed by a user-managed IAM policy that gives the Earthmover authorization service permission to assume this role and generate temporary, scoped credentials to access the object store bucket.

info

To use role-based access delegation for object store credential management, chunkstore.use_delegated_credentials must be set to True in the config. This option is True by default. To disable delegated credentials, set chunkstore.use_delegated_credentials to False in the config.

Instructions on how to configure these types of credentials are below.

Configuring Self-Managed Credentials

Under this bucket credential configuration, object store credentials are managed solely by the user. This can be done in many ways:

Environment variables: Object store credentials can be set via environment variables. For AWS S3, that means setting both AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. See the Storage Integrations page for object-store specific examples.
Arraylake config: Object store credentials can be set via the Arraylake config. Below is an example for AWS S3. See the Storage Integrations page for object-store specific examples.

service:
  uri: https://api.earthmover.io
s3:
  verify: False
  aws_access_key_id: ...
  aws_secret_access_key: ...

Native service configs: Object store credentials can be set using native service configurations, such as using the aws CLI to configure your AWS credentials config or the gcloud CLI to configure Google Cloud credentials. See the Storage Integrations page for object-store specific examples.

Configuring HMAC

Under this bucket credential configuration, credentials are stored directly in Arraylake. This works for all s3-compatible object storage services, including GCS (see Configuring GCS Buckets with HMAC Credentials). The credentials provided to arraylake must have access to the object store bucket associated with the bucket config.

To configure Arraylake to use HMAC credentials, create a BucketConfig for your bucket, setting the credential type to "hmac" and passing in the access_key_id and secret_access_key HMAC credentials. These keys must be able to access the object store bucket.

Web App
Python
Python (asyncio)

Create a bucket in the webapp using HMAC access delegation. — The create bucket dialog using HMAC access delegation.

from arraylake import Client


client = Client()

client.create_bucket_config(
  org="earthmover",
  nickname="production",
  uri="s3://my-production-data",
  extra_config={'region_name': 'us-east-1'},
  auth_config={
    "method": "hmac",
    "access_key_id": "my-access-key-id",
    "secret_access_key": "my-secret-access-key"
  }
)

from arraylake import AsyncClient


aclient = AsyncClient()

await aclient.create_bucket_config(
  org="earthmover",
  nickname="production",
  uri="s3://my-production-data",
  extra_config={'region_name': 'us-east-1'},
  auth_config={
    "method": "hmac",
    "access_key_id": "my-access-key-id",
    "secret_access_key": "my-secret-access-key"
  }
)

Configuring GCS Buckets with HMAC Credentials

First, set up a set of HMAC keys following this tutorial from Google Cloud. Note that in Google Cloud, HMAC keys are used for interoperability (e.g. compatibility with Amazon S3) and are not supported by GCS-native storage libraries.

To configure a GCS bucket to use HMAC credentials in Arraylake, you must set the bucket platform to S3 Compatible instead of Google Cloud. You must also set the credential type to HMAC, pass in the access_key_id and secret_access_key HMAC credentials generated in the above step, and set the endpoint url in the extra configuration to https://storage.googleapis.com. The bucket will be accessed using S3 APIs using the provided HMAC keys.

Web App
Python
Python (asyncio)

Create a GCS bucket in the webapp using HMAC access delegation. — The create bucket dialog using HMAC access delegation for GCS buckets.

from arraylake import Client


client = Client()

client.create_bucket_config(
  org="earthmover",
  nickname="gcs-hmac-bucket",
  uri="s3://my-gcs-bucket",  # use s3:// in place of gs://
  extra_config={'endpoint_url': 'https://storage.googleapis.com'},
  auth_config={
    "method": "hmac",
    "access_key_id": "my-access-key-id",
    "secret_access_key": "my-secret-access-key"
  }
)

from arraylake import AsyncClient


aclient = AsyncClient()

await aclient.create_bucket_config(
  org="earthmover",
  nickname="gcs-hmac-bucket",
  uri="s3://my-gcs-bucket",  # use s3:// in place of gs://
  extra_config={'endpoint_url': 'https://storage.googleapis.com'},
  auth_config={
    "method": "hmac",
    "access_key_id": "my-access-key-id",
    "secret_access_key": "my-secret-access-key"
  }
)

Configuring Role-Based Access Delegation

Under this bucket credential configuration, Earthmover is given limited access to the bucket's AWS or GCS account for the purposes of generating temporary, scoped credentials for bucket access. This works for AWS S3 buckets (see #aws-s3-buckets) and Google Cloud Storage buckets (see #google-cloud-storage-buckets)

AWS S3 Buckets

To leverage the role-based access delegation bucket authorization, you must configure an AWS IAM role with appropriate permissions to the S3 bucket used to store your arraylake data and grant Earthmover’s authorization service the ability to assume the role.

After assuming this role, the Earthmover authorization service can generate temporary, scoped credentials to access this S3 bucket.

These credentials have a lifetime of 1 hour and will automatically refresh when expired.

IAM Policy

Earthmover requires the following actions on the S3 bucket:

s3:ListBucket
s3:GetBucketLocation
s3:GetBucketNotification
s3:PutBucketNotification
s3:GetObject
s3:PutObject
s3:PutObjectAcl
s3:DeleteObject
s3:AbortMultipartUpload

Create an IAM policy using the following permissions JSON template (substituting <my-arraylake-bucket> with your S3 bucket) to grant these actions. This policy will be attached to the IAM role in the next step.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:GetBucketLocation",
                "s3:GetBucketNotification",
                "s3:PutBucketNotification"
            ],
            "Resource": [
                "arn:aws:s3:::<my-arraylake-bucket>"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:DeleteObject",
                "s3:PutObjectAcl",
                "s3:AbortMultipartUpload"
            ],
            "Resource": [
                "arn:aws:s3:::<my-arraylake-bucket>/*"
            ]
        }
    ]
}

IAM Role

Step 1: Select trusted entity

Create a new AWS IAM role using the following settings:

Trusted Entity Type: AWS Account
AWS Account ID: 842143331303 (Earthmover's AWS account ID)
External ID: <my-shared-secret> (any string you share only with Earthmover)
MFA is not supported at this time

Step 2: Add permissions

Select the IAM policy created above to attach it to this role.

Step 3: Name, review, and create

Role details: Create a meaningful role name and description.
Select trusted entities: Copy the below trust policy into the role's Trust Policy to allow the Earthmover authorization service the ability to assume and tag this role:

{
    "Version": "2012-10-17",
    "Statement": [
       {
           "Effect": "Allow",
           "Principal": {
               "AWS": "arn:aws:iam::842143331303:root"
           },
           "Action": "sts:AssumeRole",
           "Condition": {
               "StringEquals": {
                   "sts:ExternalId": "<my-shared-secret>"
               },
               "ArnLike": {
                   "aws:PrincipalArn": "arn:aws:iam::842143331303:role/EarthmoverSignerServiceRole-production"
               }
           }
       },
       {
          "Effect": "Allow",
          "Principal": {
              "AWS": "arn:aws:iam::842143331303:root"
          },
          "Action": "sts:TagSession",
          "Condition": {
              "ArnLike": {
                  "aws:PrincipalArn": "arn:aws:iam::842143331303:role/EarthmoverSignerServiceRole-production"
            }
          }
        }
    ]
}

Tags:

Add any tags that you want associated this with this policy.

Configuring Arraylake to use an IAM Role

Once this IAM role has been created, you can create a BucketConfig for your bucket, setting the credential type to "Role-based access delegation" in the web app OR aws_customer_managed_role in Python and passing in your AWS account ID, shared secret (external_customer_id) and IAM role name (external_role_name) into the auth_config:

Web App
Python
Python (asyncio)

Create a bucket in the webapp using role-based access delegation for AWS S3. — The create bucket dialog using role-based access delegation for AWS S3.

from arraylake import Client


client = Client()

client.create_bucket_config(
  org="earthmover",
  nickname="production",
  uri="s3://my-production-data",
  extra_config={'region_name': 'us-east-1'},
  auth_config={
    "method": "aws_customer_managed_role",
    "external_customer_id": "my-aws-account-id",
    "external_role_name": "my-iam-role-name",
    "shared_secret": "my-shared-secret",
  }
)

from arraylake import AsyncClient


aclient = AsyncClient()

await aclient.create_bucket_config(
  org="earthmover",
  nickname="production",
  uri="s3://my-production-data",
  extra_config={'region_name': 'us-east-1'},
  auth_config={
    "method": "aws_customer_managed_role",
    "external_customer_id": "my-aws-account-id",
    "external_role_name": "my-iam-role-name",
    "shared_secret": "my-shared-secret",
  }
)

That's it! You can now start interacting with your bucket via Arraylake which will manage S3 credentials via this IAM role.

Google Cloud Storage Buckets

For Google Cloud buckets, cross-account service account impersonation is used to give Earthmover's authorization service account temporary, scoped access to the GCS bucket.

To enable service account impersonation, you must configure a service account with appropriate permissions to the GCS bucket used to store your arraylake data and grant Earthmover’s authorization service the ability to impersonate the account.

After impersonating this role, the Earthmover authorization service account can generate temporary, scoped credentials to access this GCS bucket.

These credentials have a lifetime of 1 hour and will not refresh automatically.

Step 1: Create a Service Account

Go to the Google Cloud Console.
Navigate to the IAM & Admin > Service accounts page.
Click the Create Service Account button.
Provide a name and ID for the service account, and click Create.
Under Grant this service account access to project, grant the Service Account Token Creator role in the project. This will allow the service account to retrieve credentials from the impersonated account.
Click Done to create the service account.

Step 2: Grant Earthmover Impersonation Access to the Service Account

Navigate to the IAM & Admin > Service accounts page and click on the service account created above.
Under the Permissions tab, click Grant Access
Add the earthmover-signer-service@arraylake.iam.gserviceaccount.com service account as a principal and grant it the Service Account Token Creator role in the project.
Click Save

Step 3: Grant Service Account Access to the Bucket

Navigate to the Cloud Storage > Buckets page and click on the bucket you wish to give the service account access to. This bucket should correspond to the bucket config that you want to grant Arraylake access to.
Under the Permissions tab, click Grant Access.
Add the service account created in Step 1 (ie my-al-service-account@my-project.iam.gserviceaccount.com) as a principal and grant it the Storage Object User role in the bucket.
Click Save

Configuring Arraylake to use a Google Service Account

Once you have created a service account with the appropriate permissions, you can create a BucketConfig for your bucket, setting the credential type to "Role-based access delegation" in the web app OR gcp_customer_managed_role in Python and passing in your service account (target_service_account) into the auth_config:

Web App
Python
Python (asyncio)

Create a bucket in the webapp using role-based access delegation for GCS. — The create bucket dialog using role-based access delegation for Google Cloud Storage.

from arraylake import Client


client = Client()

client.create_bucket_config(
  org="earthmover",
  nickname="production",
  uri="gs://my-production-data",
  extra_config={TODO},
  auth_config={
    "method": "gcp_customer_managed_role",
    "target_service_account": "my-service-account@my-project.iam.gserviceaccount.com"
  }
)

from arraylake import AsyncClient


aclient = AsyncClient()

await aclient.create_bucket_config(
  org="earthmover",
  nickname="production",
  uri="gs://my-production-data",
  extra_config={TODO},
  auth_config={
    "method": "gcp_customer_managed_role",
    "target_service_account": "my-service-account@my-project.iam.gserviceaccount.com"
  }
)

List BucketConfigs

You can list BucketConfigs associated with an organization.

Web App
Python
Python (asyncio)

List buckets in web app — The organization buckets list

In the web app, users can view existing buckets for their org from the org buckets section (app.earthmover.io/[orgname]/buckets).

The default BucketConfig is marked with a purple badge.

Administer buckets in web app — The organization settings buckets section

Org Admins can also access and manage the buckets list via the Buckets section on the Organizations Settings page (app.earthmover.io/[orgname]/settings).

Clicking on the gear icon in the bucket's row will bring up the settings dialog for that bucket.

client.list_bucket_configs("earthmover")

In the Python Client, the default BucketConfig will be indicated by the is_default flag on the resulting list item.

await aclient.list_bucket_configs("earthmover")

In the Python Client, the default BucketConfig will be indicated by the is_default flag on the resulting list item.

Delete a BucketConfig

Finally, we can delete a BucketConfig.

warning

A BucketConfig cannot be deleted while it is in use by any Repo. Deleting a BucketConfig also cannot be undone! Use this operation carefully.

Web App
Python
Python (asyncio)

Bucket settings dialog — The bucket settings dialog

In the web app, a BucketConfig can be deleted from the bucket's settings dialog.

client.delete_bucket_config(
  org="earthmover",
  nickname="production",
  imsure=True, imreallysure=True
)

await aclient.delete_bucket_config(
  org="earthmover",
  nickname="production",
  imsure=True, imreallysure=True
)

Create a BucketConfig​

Configuring Self-Managed Credentials​

Configuring HMAC​

Configuring GCS Buckets with HMAC Credentials​

Configuring Role-Based Access Delegation​

AWS S3 Buckets​

IAM Policy​

IAM Role​

Configuring Arraylake to use an IAM Role​

Google Cloud Storage Buckets​

Configuring Arraylake to use a Google Service Account​

List BucketConfigs​

Delete a BucketConfig​