Coiled

Built on Dask, a parallel computing library, Coiled makes it easy to use Python on the cloud.

Arraylake and Coiled work well together—you can use Coiled to manage your cloud infrastructure, run your computations in parallel with Dask, and use Arraylake as your cloud data lake platform. Coiled provides four interfaces for initializing resources:

Dask clusters
Serverless functions
CLI jobs
Jupyter notebooks

General pattern

The code snippet below demonstrates a general pattern of how to use Coiled and Arraylake in a simple workflow.

import coiled
import arraylake as al
import xarray as xr

cluster = coiled.Cluster(
    n_workers=100,          # Start 100 machines on AWS, GCP or Azure
)
dask_client = cluster.get_client()

# Connect to Arraylake by specifying 'organization/repo'
al_client = al.Client()
repo = al_client.get_repo("my-climate-company/ocean-data")

#read array data from Arraylake
ds = xr.open_dataset(
    repo.store,
    group="xarray/ocean-temp",
    chunks="auto",       # Use Dask for parallelism
)

# Run your computation in parallel on the cloud
temps = ds.groupby("time.season").mean("temp").compute()

# Write result to Arraylake
temps.to_zarr(
    repo.store,
    group="xarray/avg-season-temps",
    engine="zarr"
)

Specific examples

The following sections detail how to use Arraylake with the different Coiled APIs. To start, you will need an Arraylake API token (these begin with "ema_").

Dask cluster
Serverless functions
CLI jobs
Jupyter notebooks

Dask cluster

info

Arraylake access: Set ARRAYLAKE_TOKEN environment variable to your Arraylake API token.

Dask is a general purpose library for parallel computing that is closely integrated with the PyData ecosystem (Zarr, Xarray, GeoTIFF, etc.) to scale out your workflows. Coiled deploys Dask clusters on the cloud.

Parallelize workflows involving Arraylake by spinning up a Dask cluster with a set number of workers. Before initializing cluster, set the ARRAYLAKE_TOKEN environment variable with your API token in order to credential into Arraylake. The following example demonstrates initiating a cluster of Dask workers, reading a dataset, and writing it as a Zarr data cube to an Arraylake Repo. In a Python session:

import coiled
import arraylake as al
import xarray as xr

cluster = coiled.Cluster(n_workers=10)

This will prompt Coiled to create a cluster of Dask workers:

╭───────────────────────── Package Sync for arraylake ─────────────────────────╮
│ Fetching latest package priorities           ━━━━━━━━━━━━━━━━━━━━━━━ 0:00:00 │
│ Scanning 208 conda packages                  ━━━━━━━━━━━━━━━━━━━━━━━ 0:00:00 │
│ Scanning 338 python packages                 ━━━━━━━━━━━━━━━━━━━━━━━ 0:00:00 │
│ Running pip check                            ━━━━━━━━━━━━━━━━━━━━━━━ 0:00:02 │
│ Validating environment                       ━━━━━━━━━━━━━━━━━━━━━━━ 0:00:03 │
│ Creating wheel for arraylake                 ━━━━━━━━━━━━━━━━━━━━━━━ 0:00:07 │
│ Creating wheel for arraylake-mongo-metastore ━━━━━━━━━━━━━━━━━━━━━━━ 0:00:06 │
│ Uploading arraylake                          ━━━━━━━━━━━━━━━━━━━━━━━ 0:00:00 │
│ Uploading arraylake-mongo-metastore          ━━━━━━━━━━━━━━━━━━━━━━━ 0:00:00 │
│ Requesting package sync build                ━━━━━━━━━━━━━━━━━━━━━━━ 0:00:00 │
╰──────────────────────────────────────────────────────────────────────────────╯
╭──────────────────────────────── Package Info ────────────────────────────────╮
│                             ╷                                                │
│   Package                   │ Note                                           │
│ ╶───────────────────────────┼──────────────────────────────────────────────╴ │
│   arraylake                 │ Wheel built from                               │
│                             │ ~/Desktop/earthmover/arraylake/client          │
│   arraylake-mongo-metastore │ Wheel built from                               │
│                             │ ~/Desktop/earthmover/arraylake/mongo-metasto   │
│                             │ re                                             │
│                             ╵                                                │
╰──────────────────────────────────────────────────────────────────────────────╯
╭─────────────────────────────── Coiled Cluster ───────────────────────────────╮
│             https://cloud.coiled.io/clusters/537310?account=dask             │
╰──────────────────────────────────────────────────────────────────────────────╯
╭────────────── Overview ──────────────╮╭─────────── Configuration ────────────╮
│                                      ││                                      │
│ Name: dask-1b0d5c8f                  ││ Region: us-east-2                    │
│                                      ││                                      │
│ Scheduler Status: started            ││ Scheduler: m6i.xlarge                │
│                                      ││                                      │
│ Dashboard:                           ││ Workers:   m6i.xlarge (2)            │
│ https://cluster-upast.dask.host?toke ││                                      │
│ n=U-3fkZ5GRwezON1C                   ││ Workers Requested: 2                 │
│                                      ││                                      │
╰──────────────────────────────────────╯╰──────────────────────────────────────╯
╭───────────────────────── (2024/07/26 12:54:51 MDT) ──────────────────────────╮
│                                                                              │
│                              All workers ready.                              │
│                                                                              │
│                                                                              │
╰──────────────────────────────────────────────────────────────────────────────╯

Once all workers are ready, we can connect the cluster to the Dask client, connect to the Arraylake client, and begin working with our data.

dask_client = cluster.get_client()
al_client = al.Client()
repo = al_client.get_or_create_repo('earthmover/coiled_example')

ds = xr.tutorial.open_dataset('air_temperature').chunk(
    {'time': 1000, 'lat': 5, 'lon': 5}
)
ds

<xarray.Dataset> Size: 31MB
Dimensions:  (lat: 25, time: 2920, lon: 53)
Coordinates:
  * lat      (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
  * lon      (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
  * time     (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
    air      (time, lat, lon) float64 31MB ...
Attributes:
    Conventions:  COARDS
    title:        4x daily NMC reanalysis (1948)
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...

We read an air temperature dataset as an xr.Dataset. Next, write the data to the Arraylake repo:

ds.to_zarr(
    repo.store,
    group = 'xr_tutorial_ds/air_temperature',
    zarr_version=3,
    mode='w'
)

We've now written the data, but these changes won't be visible without committing the change to our Arraylake repo.

repo.commit('wrote air temp data to repo')

66a3f2093f5775e7440d3e62

Serverless functions

info

Arraylake access: @coiled.function(..., environ={"ARRAYLAKE_TOKEN":YOUR_API_TOKEN})

Coiled allows you to decorate Python functions in order to run them in the cloud. This is a powerful use case for using Coiled with Arraylake. See our blog post for an example using serverless tools to build large-scale datacubes.

We will need to pass Arraylake credentials to the serverless function. Write a function decorated with @coiled.function(), and pass the token as a dictionary to the environ parameter:

@coiled.function(environ= {'ARRAYLAKE_TOKEN':"ema_XXXXXXXX"})
def create_array(n,s):
    client = al.Client()
    # Connect to Arraylake repo
    repo = client.get_or_create_repo('earthmover/coiled_example')
    # Create group in repo
    foo_group = repo.root_group.create_group('foo')
    # Create array data in group
    zero_array = foo_group.create(
        'data',
        shape=n, chunks=10, dtype='f4',
        fill_value=0)
    idx = [random.randint(0,n-1) for _ in range(s)]
    zero_array[idx] = 2

    repo.commit('move array')

    return repo

r = create_array(100,15)

╭───────────────────────── Package Sync for arraylake ─────────────────────────╮
│ Fetching latest package priorities           ━━━━━━━━━━━━━━━━━━━━━━━ 0:00:00 │
│ Scanning 208 conda packages                  ━━━━━━━━━━━━━━━━━━━━━━━ 0:00:00 │
│ Scanning 338 python packages                 ━━━━━━━━━━━━━━━━━━━━━━━ 0:00:00 │
│ Running pip check                            ━━━━━━━━━━━━━━━━━━━━━━━ 0:00:02 │
│ Validating environment                       ━━━━━━━━━━━━━━━━━━━━━━━ 0:00:03 │
│ Creating wheel for arraylake                 ━━━━━━━━━━━━━━━━━━━━━━━ 0:00:05 │
│ Creating wheel for arraylake-mongo-metastore ━━━━━━━━━━━━━━━━━━━━━━━ 0:00:04 │
│ Uploading arraylake                          ━━━━━━━━━━━━━━━━━━━━━━━ 0:00:00 │
│ Uploading arraylake-mongo-metastore          ━━━━━━━━━━━━━━━━━━━━━━━ 0:00:00 │
│ Requesting package sync build                ━━━━━━━━━━━━━━━━━━━━━━━ 0:00:00 │
╰──────────────────────────────────────────────────────────────────────────────╯
╭──────────────────────────────── Package Info ────────────────────────────────╮
│                             ╷                                                │
│   Package                   │ Note                                           │
│ ╶───────────────────────────┼──────────────────────────────────────────────╴ │
│   arraylake                 │ Wheel built from                               │
│                             │ ~/Desktop/earthmover/arraylake/client          │
│   arraylake-mongo-metastore │ Wheel built from                               │
│                             │ ~/Desktop/earthmover/arraylake/mongo-metasto   │
│                             │ re                                             │
│                             ╵                                                │
╰──────────────────────────────────────────────────────────────────────────────╯
╭─────────────────────────────── Coiled Cluster ───────────────────────────────╮
│             https://cloud.coiled.io/clusters/537524?account=dask             │
╰──────────────────────────────────────────────────────────────────────────────╯
╭────────────── Overview ──────────────╮╭─────────── Configuration ────────────╮
│                                      ││                                      │
│ Name: function-7dc1e56a              ││ Region: us-east-2                    │
│                                      ││                                      │
│ Scheduler Status: started            ││ Scheduler: m6i.large                 │
│                                      ││                                      │
│ Dashboard:                           ││ Workers Requested: 0                 │
│ https://cluster-zanjt.dask.host?toke ││                                      │
│ n=wvgxuBjkZQkT30Wc                   ││                                      │
│                                      ││                                      │
╰──────────────────────────────────────╯╰──────────────────────────────────────╯



2024-07-26 16:33 - distributed.deploy.adaptive - INFO - Adaptive scaling started: minimum=0 maximum=500

We can now see the changes that were made and committed in the function reflected in the state of the repo:

r.tree().__rich__()
Out[7]:
/
├── 📁 foo
│   └── 🇦 bar (100,) float32
├── 📁 bar
└── 📁 xr_tutorial_ds
    └── 📁 air_temperature
        ├── 🇦 lat (25,) float32
        ├── 🇦 lon (53,) float32
        ├── 🇦 time (2920,) float32
        └── 🇦 air (2920, 25, 53) int16

CLI Jobs

info

Arraylake access: coiled notebook --env ARRAYLAKE_TOKEN=YOUR_TOKEN

Use the Coiled CLI to run Python scripts on the cloud. To connect to Arraylake, pass credentials to the Coiled CLI with the --env flag. The example below demonstrates using the Coiled CLI coiled run to list all repos in an Arraylake organization.

$ coiled run arraylake repo list earthmover-demos --env ARRAYLAKE_TOKEN=ema_XXXXXXXX

The exact arguments and their purpose are:

coiled run: Run a command on the cloud
arraylake repo list earthmover-demos: The command you want to run. In this example, we list all repos within the earthmover-demos organization.
--env ARRAYLAKE_TOKEN=ema_XXXXXXXX: Use Coiled's --env flag to securely transmit enviroment variables from your local environment to the run command environment. Here, your Arraylake token is passed to authenticate Arraylake from a VM on the cloud.

$ coiled run arraylake repo list earthmover-demos --env ARRAYLAKE_TOKEN=ema_XXXXXXXX

╭──────────────── Running arraylake repo list earthmover-demos ────────────────╮
│                                                                          	│
│ Details: https://cloud.coiled.io/clusters/537308?account=dask            	│
│                                                                          	│
│ Ready  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━      	│
│                                                                          	│
│ Synced local Python environment: arraylake                               	│
│ Region:   us-east-2             	Uptime:                       	1m 42s │
│ VM Type:  m6i.xlarge            	Approx cloud cost:          	$0.19/hr │
│                                                                          	│
╰──────────────────────────────────────────────────────────────────────────────╯

Output
------

✓ Listing repos for earthmover-demos...succeeded
                                         	Arraylake Repositories for earthmov
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━
┃ Name                                      	┃               	Created ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━
│ gfs                                       	│ 2024-05-09T22:35:36+00:00 │ 20
│ hrrr                                      	│ 2024-05-09T21:39:49+00:00 │ 20
│ weather-data-demo                         	│ 2024-07-19T17:00:55+00:00 │ 20
│ vcs                                       	│ 2024-05-02T01:16:01+00:00 │ 20
└───────────────────────────────────────────────┴───────────────────────────┴───

Great, we were able to connect to our Arraylake organization and list the org's Repos from the Coiled CLI.

Jupyter notebooks

info

Arraylake access: coiled notebook --env ARRAYLAKE_TOKEN=YOUR_TOKEN

Similar to Google Colab or Amazon SageMaker, you can also use Coiled to start a Jupyter Notebook in the cloud. To start an Coiled Jupyter Notebook with an Arraylake connection, use the --env flag to pass your Arraylake authentication as you would object storage credentials:

coiled notebook start --env ARRAYLAKE_TOKEN=ema_XXXXXXXX \
                      --env AWS_ACCESS_ID=$AWS_ACCESS_ID \
                      --env AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY \
                      /

╭─────────────────────── Notebook notebook-cd82d938... ────────────────────────╮
│                                                                              │
│ Jupyter: https://cluster-srmtf.dask.host/jupyter/lab?token=ewBg58I3CiH25Uma  │
│ Details: https://cloud.coiled.io/clusters/537336?account=dask                │
│                                                                              │
│ Ready  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━      │
│                                                                              │
│ Synced local Python environment: arraylake                                   │
│ Region:   us-east-2                Uptime:                            1m 10s │
│ VM Type:  m6i.xlarge               Approx cloud cost:               $0.19/hr │
│                                    Approx cloud total:                 $0.00 │
│                                                                              │
│ Use Control-C to stop this notebook server                                   │
╰──────────────────────────────────────────────────────────────────────────────╯

This will launch a Jupyter Notebook where your Arraylake credentials are already configured and you can connect to Arraylake.

General pattern​

Specific examples​

Dask cluster​

Serverless functions​

CLI Jobs​

Jupyter notebooks​

General pattern

Specific examples

Dask cluster

Serverless functions

CLI Jobs

Jupyter notebooks