DataHub
DataHub is an open-source metadata platform that consolidates data discovery, observability, and governance across the modern data stack.
Arraylake integrates with DataHub via the arraylake-datahub ingestion source plugin.
The plugin crawls the Arraylake catalog over HTTPS and emits one DataHub Dataset per xarray-compatible Zarr group in your repos. Access control, credential vending, and querying stay in Arraylake — DataHub becomes the discovery and metadata-search surface, with externalUrl links back into the Arraylake web app for the actual data.
What you get in DataHub
For every xarray-compatible group, a Dataset named <org>/<repo>/<group_path> with:
- Schema — one field per Zarr array. Coordinates and data variables are distinguished via a
classificationflag in each field'sjsonProps, alongsideshape,chunk_shape,dimension_names, codecs, and the full CF attribute bag (GRIB_*keys are filtered as noise). - Description — the group's CF
title+summarywhen present, otherwise the repo description. externalUrl— direct link into the Arraylake page for that group.customPropertiesspread:- Arraylake metadata: provider, product_type, spatial/temporal coverage, spatial_resolution, update_freq, etc.
- Storage: bucket platform/name/region, computed
storage_uri. - CF group attributes: license, institution, creator/publisher, time and geospatial coverage, references, history.
- Marketplace subscription details if the repo is from a listing.
Repos with no xarray-compatible groups still emit one Dataset (repo landing only) so every catalog entry is discoverable.
Orphan repos — catalog entries whose underlying Icechunk storage no longer exists — are tagged with arraylake_storage_status=orphan.
Install
pip install arraylake-datahub 'acryl-datahub[datahub-rest]'
Compatible with acryl-datahub 0.15.x on Python 3.10–3.13.
One-time platform registration
Register the earthmover custom data platform in your DataHub instance once. This is what gives Arraylake datasets the Earthmover logo and a dedicated platform entry in DataHub's UI.
datahub put platform \
--name earthmover \
--display_name "Earthmover" \
--logo https://app.earthmover.io/icon.svg
Run
Save the following as recipe.yml:
source:
type: earthmover
config:
# token: ${ARRAYLAKE_TOKEN} # default: read from env
# api_url: https://api.earthmover.io
orgs: # omit to crawl every org the token sees
- earthmover-public
repo_pattern:
allow: [".*"]
# deny: [".*-archive$"]
env: PROD # DataHub fabric
sink:
type: datahub-rest
config:
server: http://localhost:8080
token: ${DATAHUB_GMS_TOKEN}
Then run:
export ARRAYLAKE_TOKEN=ema_xxxxxxxxxxxx
export DATAHUB_GMS_TOKEN=...
datahub ingest -c recipe.yml --preview # dry run
datahub ingest -c recipe.yml # for real
The most useful knobs are orgs (allowlist) and repo_pattern (regex allow/deny). Full config below.
Configuration
| Field | Default | Notes |
|---|---|---|
token | $ARRAYLAKE_TOKEN | Arraylake API token (ema_*). Read-only is sufficient. |
api_url | https://api.earthmover.io | Arraylake catalog API base URL. |
web_url | https://app.earthmover.io | Used for externalUrl when a repo's web_url is missing. |
orgs | all visible | Allowlist of org slugs. Omit to crawl every org the token sees. |
repo_pattern | allow .* | AllowDenyPattern matched against <org>/<repo>. |
env | PROD | DataHub fabric segment of the Dataset URN. |
platform | earthmover | Must match the platform registered above. |
walk_max_workers | 8 | Parallel HTTP fetches per repo when walking groups. |
request_timeout_s | 30 | |
max_retries | 3 |
Required Arraylake API access
The token needs read access to:
GET /user/orgsGET /orgs/{org}/repos/paginatedGET /repos/{org}/{repo}GET /repos/icechunk/{org}/{repo}/dataset-node
Verification
After a successful ingest, in DataHub's UI you should see:
- The Earthmover platform with logo.
- One
Datasetper xarray-compatible Zarr group, named<org>/<repo>/<group>. - A Schema panel listing every coordinate and data variable with units and CF descriptions where available.
- A clickable "View in Source" link that lands on the Arraylake page for that group, where authentication and querying happen.