Skip to main content

DataHub

DataHub logo

DataHub is an open-source metadata platform that consolidates data discovery, observability, and governance across the modern data stack. Arraylake integrates with DataHub via the arraylake-datahub ingestion source plugin.

The plugin crawls the Arraylake catalog over HTTPS and emits one DataHub Dataset per xarray-compatible Zarr group in your repos. Access control, credential vending, and querying stay in Arraylake — DataHub becomes the discovery and metadata-search surface, with externalUrl links back into the Arraylake web app for the actual data.

What you get in DataHub

For every xarray-compatible group, a Dataset named <org>/<repo>/<group_path> with:

  • Schema — one field per Zarr array. Coordinates and data variables are distinguished via a classification flag in each field's jsonProps, alongside shape, chunk_shape, dimension_names, codecs, and the full CF attribute bag (GRIB_* keys are filtered as noise).
  • Description — the group's CF title + summary when present, otherwise the repo description.
  • externalUrl — direct link into the Arraylake page for that group.
  • customProperties spread:
    • Arraylake metadata: provider, product_type, spatial/temporal coverage, spatial_resolution, update_freq, etc.
    • Storage: bucket platform/name/region, computed storage_uri.
    • CF group attributes: license, institution, creator/publisher, time and geospatial coverage, references, history.
    • Marketplace subscription details if the repo is from a listing.

Repos with no xarray-compatible groups still emit one Dataset (repo landing only) so every catalog entry is discoverable.

Orphan repos — catalog entries whose underlying Icechunk storage no longer exists — are tagged with arraylake_storage_status=orphan.

Install

pip install arraylake-datahub 'acryl-datahub[datahub-rest]'

Compatible with acryl-datahub 0.15.x on Python 3.10–3.13.

One-time platform registration

Register the earthmover custom data platform in your DataHub instance once. This is what gives Arraylake datasets the Earthmover logo and a dedicated platform entry in DataHub's UI.

datahub put platform \
--name earthmover \
--display_name "Earthmover" \
--logo https://app.earthmover.io/icon.svg

Run

Save the following as recipe.yml:

source:
type: earthmover
config:
# token: ${ARRAYLAKE_TOKEN} # default: read from env
# api_url: https://api.earthmover.io
orgs: # omit to crawl every org the token sees
- earthmover-public
repo_pattern:
allow: [".*"]
# deny: [".*-archive$"]
env: PROD # DataHub fabric

sink:
type: datahub-rest
config:
server: http://localhost:8080
token: ${DATAHUB_GMS_TOKEN}

Then run:

export ARRAYLAKE_TOKEN=ema_xxxxxxxxxxxx
export DATAHUB_GMS_TOKEN=...

datahub ingest -c recipe.yml --preview # dry run
datahub ingest -c recipe.yml # for real

The most useful knobs are orgs (allowlist) and repo_pattern (regex allow/deny). Full config below.

Configuration

FieldDefaultNotes
token$ARRAYLAKE_TOKENArraylake API token (ema_*). Read-only is sufficient.
api_urlhttps://api.earthmover.ioArraylake catalog API base URL.
web_urlhttps://app.earthmover.ioUsed for externalUrl when a repo's web_url is missing.
orgsall visibleAllowlist of org slugs. Omit to crawl every org the token sees.
repo_patternallow .*AllowDenyPattern matched against <org>/<repo>.
envPRODDataHub fabric segment of the Dataset URN.
platformearthmoverMust match the platform registered above.
walk_max_workers8Parallel HTTP fetches per repo when walking groups.
request_timeout_s30
max_retries3

Required Arraylake API access

The token needs read access to:

  • GET /user/orgs
  • GET /orgs/{org}/repos/paginated
  • GET /repos/{org}/{repo}
  • GET /repos/icechunk/{org}/{repo}/dataset-node

Verification

After a successful ingest, in DataHub's UI you should see:

  • The Earthmover platform with logo.
  • One Dataset per xarray-compatible Zarr group, named <org>/<repo>/<group>.
  • A Schema panel listing every coordinate and data variable with units and CF descriptions where available.
  • A clickable "View in Source" link that lands on the Arraylake page for that group, where authentication and querying happen.