Coiled
Arraylake and Coiled work well together—you can use Coiled to manage your cloud infrastructure, run your computations in parallel with Dask, and use Arraylake as your cloud data lake platform. Coiled provides four interfaces for initializing resources:
- Dask clusters
- Serverless functions
- CLI jobs
- Jupyter notebooks
General pattern
The code snippet below demonstrates a general pattern of how to use Coiled and Arraylake in a simple workflow.
import coiled
import arraylake as al
import xarray as xr
cluster = coiled.Cluster(
n_workers=100, # Start 100 machines on AWS, GCP or Azure
)
dask_client = cluster.get_client()
# Connect to Arraylake by specifying 'organization/repo'
al_client = al.Client()
repo = al_client.get_repo("my-climate-company/ocean-data")
#read array data from Arraylake
ds = xr.open_dataset(
repo.store,
group="xarray/ocean-temp",
chunks="auto", # Use Dask for parallelism
)
# Run your computation in parallel on the cloud
temps = ds.groupby("time.season").mean("temp").compute()
# Write result to Arraylake
temps.to_zarr(
repo.store,
group="xarray/avg-season-temps",
engine="zarr"
)
Specific examples
The following sections detail how to use Arraylake with the different Coiled APIs. To start, you will need an Arraylake API token (these begin with "ema_").
- Dask cluster
- Serverless functions
- CLI jobs
- Jupyter notebooks
Dask cluster
Arraylake access: Set ARRAYLAKE_TOKEN
environment variable to your Arraylake API token.
Dask is a general purpose library for parallel computing that is closely integrated with the PyData ecosystem (Zarr, Xarray, GeoTIFF, etc.) to scale out your workflows. Coiled deploys Dask clusters on the cloud.
Parallelize workflows involving Arraylake by spinning up a Dask cluster with a set number of workers. Before initializing cluster, set the ARRAYLAKE_TOKEN
environment variable with your API token in order to credential into Arraylake. The following example demonstrates initiating a cluster of Dask workers, reading a dataset, and writing it as a Zarr data cube to an Arraylake Repo.
In a Python session:
import coiled
import arraylake as al
import xarray as xr
cluster = coiled.Cluster(n_workers=10)
This will prompt Coiled to create a cluster of Dask workers:
╭───────────────────────── Package Sync for arraylake ─────────────────────────╮
│ Fetching latest package priorities ━━━━━━━━━━━━━━━━━━━━━━━ 0:00:00 │
│ Scanning 208 conda packages ━━━━━━━━━━━━━━━━━━━━━━━ 0:00:00 │
│ Scanning 338 python packages ━━━━━━━━━━━━━━━━━━━━━━━ 0:00:00 │
│ Running pip check ━━━━━━━━━━━━━━━━━━━━━━━ 0:00:02 │
│ Validating environment ━━━━━━━━━━━━━━━━━━━━━━━ 0:00:03 │
│ Creating wheel for arraylake ━━━━━━━━━━━━━━━━━━━━━━━ 0:00:07 │
│ Creating wheel for arraylake-mongo-metastore ━━━━━━━━━━━━━━━━━━━━━━━ 0:00:06 │
│ Uploading arraylake ━━━━━━━━━━━━━━━━━━━━━━━ 0:00:00 │
│ Uploading arraylake-mongo-metastore ━━━━━━━━━━━━━━━━━━━━━━━ 0:00:00 │
│ Requesting package sync build ━━━━━━━━━━━━━━━━━━━━━━━ 0:00:00 │
╰──────────────────────────────────────────────────────────────────────────────╯
╭──────────────────────────────── Package Info ────────────────────────────────╮
│ ╷ │
│ Package │ Note │
│ ╶───────────────────────────┼──────────────────────────────────────────────╴ │
│ arraylake │ Wheel built from │
│ │ ~/Desktop/earthmover/arraylake/client │
│ arraylake-mongo-metastore │ Wheel built from │
│ │ ~/Desktop/earthmover/arraylake/mongo-metasto │
│ │ re │
│ ╵ │
╰──────────────────────────────────────────────────────────────────────────────╯
╭─────────────────────────────── Coiled Cluster ───────────────────────────────╮
│ https://cloud.coiled.io/clusters/537310?account=dask │
╰──────────────────────────────────────────────────────────────────────────────╯
╭────────────── Overview ──────────────╮╭─────────── Configuration ────────────╮
│ ││ │
│ Name: dask-1b0d5c8f ││ Region: us-east-2 │
│ ││ │
│ Scheduler Status: started ││ Scheduler: m6i.xlarge │
│ ││ │
│ Dashboard: ││ Workers: m6i.xlarge (2) │
│ https://cluster-upast.dask.host?toke ││ │
│ n=U-3fkZ5GRwezON1C ││ Workers Requested: 2 │
│ ││ │
╰──────────────────────────────────────╯╰──────────────────────────────────────╯
╭───────────────────────── (2024/07/26 12:54:51 MDT) ──────────────────────────╮
│ │
│ All workers ready. │
│ │
│ │
╰──────────────────────────────────────────────────────────────────────────────╯
Once all workers are ready, we can connect the cluster to the Dask client, connect to the Arraylake client, and begin working with our data.
dask_client = cluster.get_client()
al_client = al.Client()
repo = al_client.get_or_create_repo('earthmover/coiled_example')
ds = xr.tutorial.open_dataset('air_temperature').chunk(
{'time': 1000, 'lat': 5, 'lon': 5}
)
ds
<xarray.Dataset> Size: 31MB
Dimensions: (lat: 25, time: 2920, lon: 53)
Coordinates:
* lat (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
* lon (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
* time (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
air (time, lat, lon) float64 31MB ...
Attributes:
Conventions: COARDS
title: 4x daily NMC reanalysis (1948)
description: Data is from NMC initialized reanalysis\n(4x/day). These a...
platform: Model
references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
We read an air temperature dataset as an xr.Dataset
. Next, write the data to the Arraylake repo:
ds.to_zarr(
repo.store,
group = 'xr_tutorial_ds/air_temperature',
zarr_version=3,
mode='w'
)
We've now written the data, but these changes won't be visible without committing the change to our Arraylake repo.
repo.commit('wrote air temp data to repo')
66a3f2093f5775e7440d3e62
Serverless functions
Arraylake access: @coiled.function(..., environ={"ARRAYLAKE_TOKEN":YOUR_API_TOKEN})
Coiled allows you to decorate Python functions in order to run them in the cloud. This is a powerful use case for using Coiled with Arraylake. See our blog post for an example using serverless tools to build large-scale datacubes.
We will need to pass Arraylake credentials to the serverless function. Write a function decorated with @coiled.function()
, and pass the token as a dictionary to the environ
parameter:
@coiled.function(environ= {'ARRAYLAKE_TOKEN':"ema_XXXXXXXX"})
def create_array(n,s):
client = al.Client()
# Connect to Arraylake repo
repo = client.get_or_create_repo('earthmover/coiled_example')
# Create group in repo
foo_group = repo.root_group.create_group('foo')
# Create array data in group
zero_array = foo_group.create(
'data',
shape=n, chunks=10, dtype='f4',
fill_value=0)
idx = [random.randint(0,n-1) for _ in range(s)]
zero_array[idx] = 2
repo.commit('move array')
return repo
r = create_array(100,15)
╭───────────────────────── Package Sync for arraylake ─────────────────────────╮
│ Fetching latest package priorities ━━━━━━━━━━━━━━━━━━━━━━━ 0:00:00 │
│ Scanning 208 conda packages ━━━━━━━━━━━━━━━━━━━━━━━ 0:00:00 │
│ Scanning 338 python packages ━━━━━━━━━━━━━━━━━━━━━━━ 0:00:00 │
│ Running pip check ━━━━━━━━━━━━━━━━━━━━━━━ 0:00:02 │
│ Validating environment ━━━━━━━━━━━━━━━━━━━━━━━ 0:00:03 │
│ Creating wheel for arraylake ━━━━━━━━━━━━━━━━━━━━━━━ 0:00:05 │
│ Creating wheel for arraylake-mongo-metastore ━━━━━━━━━━━━━━━━━━━━━━━ 0:00:04 │
│ Uploading arraylake ━━━━━━━━━━━━━━━━━━━━━━━ 0:00:00 │
│ Uploading arraylake-mongo-metastore ━━━━━━━━━━━━━━━━━━━━━━━ 0:00:00 │
│ Requesting package sync build ━━━━━━━━━━━━━━━━━━━━━━━ 0:00:00 │
╰──────────────────────────────────────────────────────────────────────────────╯
╭────────────────────── ────────── Package Info ────────────────────────────────╮
│ ╷ │
│ Package │ Note │
│ ╶───────────────────────────┼──────────────────────────────────────────────╴ │
│ arraylake │ Wheel built from │
│ │ ~/Desktop/earthmover/arraylake/client │
│ arraylake-mongo-metastore │ Wheel built from │
│ │ ~/Desktop/earthmover/arraylake/mongo-metasto │
│ │ re │
│ ╵ │
╰─────────────────── ───────────────────────────────────────────────────────────╯
╭─────────────────────────────── Coiled Cluster ───────────────────────────────╮
│ https://cloud.coiled.io/clusters/537524?account=dask │
╰──────────────────────────────────────────────────────────────────────────────╯
╭────────────── Overview ──────────────╮╭─────────── Configuration ────────────╮
│ ││ │
│ Name: function-7dc1e56a ││ Region: us-east-2 │
│ ││ │
│ Scheduler Status: started ││ Scheduler: m6i.large │
│ ││ │
│ Dashboard: ││ Workers Requested: 0 │
│ https://cluster-zanjt.dask.host?toke ││ │
│ n=wvgxuBjkZQkT30Wc ││ │
│ ││ │
╰──────────────────────────────────────╯╰──────────────────────────────────────╯
2024-07-26 16:33 - distributed.deploy.adaptive - INFO - Adaptive scaling started: minimum=0 maximum=500
We can now see the changes that were made and committed in the function reflected in the state of the repo:
r.tree().__rich__()
Out[7]:
/
├── 📁 foo
│ └── 🇦 bar (100,) float32
├── 📁 bar
└── 📁 xr_tutorial_ds
└── 📁 air_temperature
├── 🇦 lat (25,) float32
├── 🇦 lon (53,) float32
├── 🇦 time (2920,) float32
└── 🇦 air (2920, 25, 53) int16
CLI Jobs
Arraylake access: coiled notebook --env ARRAYLAKE_TOKEN=YOUR_TOKEN
Use the Coiled CLI to run Python scripts on the cloud. To connect to Arraylake, pass credentials to the Coiled CLI with the --env
flag. The example below demonstrates using the Coiled CLI coiled run
to list all repos in an Arraylake organization.
$ coiled run arraylake repo list earthmover-demos --env ARRAYLAKE_TOKEN=ema_XXXXXXXX
The exact arguments and their purpose are:
coiled run
: Run a command on the cloudarraylake repo list earthmover-demos
: The command you want to run. In this example, we list all repos within theearthmover-demos
organization.--env ARRAYLAKE_TOKEN=ema_XXXXXXXX
: Use Coiled's--env
flag to securely transmit enviroment variables from your local environment to the run command environment. Here, your Arraylake token is passed to authenticate Arraylake from a VM on the cloud.
$ coiled run arraylake repo list earthmover-demos --env ARRAYLAKE_TOKEN=ema_XXXXXXXX
╭──────────────── Running arraylake repo list earthmover-demos ────────────────╮
│ │
│ Details: https://cloud.coiled.io/clusters/537308?account=dask │
│ │
│ Ready ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ │
│ │
│ Synced local Python environment: arraylake │
│ Region: us-east-2 Uptime: 1m 42s │
│ VM Type: m6i.xlarge Approx cloud cost: $0.19/hr │
│ │
╰──────────────────────────────────────────────────────────────────────────────╯
Output
------
✓ Listing repos for earthmover-demos...succeeded
Arraylake Repositories for earthmov
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━
┃ Name ┃ Created ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━
│ gfs │ 2024-05-09T22:35:36+00:00 │ 20
│ hrrr │ 2024-05-09T21:39:49+00:00 │ 20
│ weather-data-demo │ 2024-07-19T17:00:55+00:00 │ 20
│ vcs │ 2024-05-02T01:16:01+00:00 │ 20
└───────────────────────────────────────────────┴───────────────────────────┴───
Great, we were able to connect to our Arraylake organization and list the org's Repos from the Coiled CLI.
Jupyter notebooks
Arraylake access: coiled notebook --env ARRAYLAKE_TOKEN=YOUR_TOKEN
Similar to Google Colab or Amazon SageMaker, you can also use Coiled to start a Jupyter Notebook in the cloud. To start an Coiled Jupyter Notebook with an Arraylake connection, use the --env
flag to pass your Arraylake authentication as you would object storage credentials:
coiled notebook start --env ARRAYLAKE_TOKEN=ema_XXXXXXXX \
--env AWS_ACCESS_ID=$AWS_ACCESS_ID \
--env AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY \
/
╭─────────────────────── Notebook notebook-cd82d938... ────────────────────────╮
│ │
│ Jupyter: https://cluster-srmtf.dask.host/jupyter/lab?token=ewBg58I3CiH25Uma │
│ Details: https://cloud.coiled.io/clusters/537336?account=dask │
│ │
│ Ready ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ │
│ │
│ Synced local Python environment: arraylake │
│ Region: us-east-2 Uptime: 1m 10s │
│ VM Type: m6i.xlarge Approx cloud cost: $0.19/hr │
│ Approx cloud total: $0.00 │
│ │
│ Use Control-C to stop this notebook server │
╰──────────────────────────────────────────────────────────────────────────────╯
This will launch a Jupyter Notebook where your Arraylake credentials are already configured and you can connect to Arraylake.