Skip to main content

Weights & Biases

wandb logo

Weights & Biases is a popular MLOps platform to manage, track, and analyze model training runs.

Weights & Biases lets you track and version the data used in a modeling run as an Artifact. Using W&B Artifacts, you can create a reference to your data in Arraylake so that it will be associated with the W&B run that uses it. Subsequent runs can then use this artifact to access Arraylake data.

The code snippet below demonstrates how to configure a W&B artifact to point to an Arraylake repo:

import wandb
import arraylake as al

# Connect to Arraylake by specifying 'organization/repo'
al_client = al.Client()
repo = al_client.get_repo("earthmover/wandb-demo")

# Connect to W&B and initialize a run
run = wandb.init(project="wandb-al-integration")

# Create a W&B dataset Artifact from repo
a = wandb.Artifact(
name="my-arraylake-dataset-artifact",
type="dataset",
description="The data used for this run and associated arraylake metadata",
metadata={
"arraylake_repo": repo.repo_name,
"arraylake_ref": str(repo.session.base_commit)
}
)

# Add a reference to the dataset "location" in Arraylake
# This is an alternative to exporting the repo and adding files to W&B directly
# Note that W&B does not understand the arraylake:// scheme out of the box
# so this information is just for internal tracking purposes only
a.add_reference(
uri=f"arraylake://{repo.repo_name}@{repo.session.base_commit}",
name="arraylake"
)

# Save the artifact to W&B
run.log_artifact(a).wait()

# Close the run
run.finish()

You can then see this artifact and the associated Arraylake metadata in the W&B UI:

Artifact in W&B

This artifact can then be used by subsequent runs to open and explore the data in the Arraylake repo.

# Initialize a new run in the same project containing the Arraylake dataset artifact
run2 = wandb.init(project="wandb-al-integration")

# Get the Artifact
run_artifact = run2.use_artifact("my-arraylake-dataset-artifact:latest", type="dataset")

# Use the Artifact metadata to checkout the Arraylake repo
repo_from_wandb = client.get_repo(run_artifact.metadata["arraylake_repo"], checkout=False)
repo_from_wandb.checkout(ref=run_artifact.metadata["arraylake_ref"])

# Open the repo data with Xarray
group = "air_temperature"
repo_from_wandb.to_xarray(group)
<xarray.Dataset> Size: 31MB
Dimensions: (time: 2920, lat: 25, lon: 53)
Coordinates:
* lat (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
* lon (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
* time (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
air (time, lat, lon) float64 31MB ...
Attributes:
Conventions: COARDS
description: Data is from NMC initialized reanalysis\n(4x/day). These a...
platform: Model
references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
title: 4x daily NMC reanalysis (1948)

Be sure to close the W&B run when you are finished:

run2.finish()