Skip to main content

Arraylake Overview

What problem does Arraylake solve?

Existing data warehouses and data lake platforms only understand tabular data, but most scientific data is best represented in a more complex data model — multidimensional arrays with rich metadata. Teams that use scientific data for research or operations today have to build their own foundational data infrastructure. Key features like data ingest, versioning, cataloging, and governance must be built from scratch.

Arraylake provides users with a home for their scientific data in the cloud, enabling teams to quickly derive insights from large, complex datasets. The Arraylake platform helps manage the full life cycle of scientific data in the cloud.

How does it work?

Arraylake is designed to make it easier to store multidimensional array data in cloud object storage (e.g. AWS S3). Arraylake is based on the Zarr data model, in which data are organized into groups, arrays, and metadata. Arrays are split into chunks.

Without Arraylake, clients store all of this information directly in the object store, with the metadata stored in .json files. This can make it hard to search, and explore a large archive of data. With Arraylake, chunks are still stored in the object store, but Arraylake handles the metadata. This enables Arraylake to build catalogs, provide search, and layer many other services on top of your data. However, the raw data (chunks) are still accessed directly from the object store, enabling the highly scalable performance and relatively low cost that users have come to expect from cloud storage.

The use of a separate database for metadata (sometimes called a "Metastore") is what allows Arraylake to provide all of the advanced features described below.

Arraylake architectureArraylake architecture

Major Features

Data Catalog

Without Arraylake, it's very difficult to obtain an accurate catalog of your data in object storage. That's because object storage has no inherent understanding of the contents or structure of the data it stores; it's just a big key-value store. To solve this problem, teams usually build and maintain separate, independent catalogs of their data holdings. It can be challenging to keep these separate catalogs in sync with the actual data in object storage.

Arraylake solves this problem by keeping track of all the metadata in its own database at the time it's written. This enables users to easily explore the contents of their data lake and see what's actually inside. With Arraylake, you can browse the entire contents of your data lake using the standard Zarr API. Furthermore, since all of the metadata are in a database (instead of locked up inside binary files), we can provide rich metadata search over all the data in a repo.



info

We have lots more catalog features on our roadmap, including

  • Exposing data catalogs via standard REST APIs such as STAC and OGC
  • Interactive browsing via the Arraylake web application

Version Control

As a further advantage over raw object storage, Arraylake implements a version control system for your data and metadata inspired by Git but specifically designed for the Zarr data model and for collaboration across multiple writers.

Some of the main features of the version control system include:

  • An Arraylake repo (Zarr hierarchy) shares a single common version history
  • Writes to Arraylake are immutable
  • Writes are only visible to a specific client session until committed
  • Serializable isolation between commits from different client sessions
  • Optimistic concurrency, so that multiple writers can safely write to the same data concurrently
  • Time travel between different states of your data as represented by commits
  • Data are deduplicated automatically and verified via content-addressable storage
  • Schema evolution for Zarr arrays and groups
info

Our roadmap includes more version-control features, such as

  • Support for multiple branches and tags
  • Merge commits

Flexible File Format Support

Arraylake's native format and internal data model is Zarr. However, other array file formats can also be stored in Arraylake. That's because, under the hood, most array file formats use an internal structure and data model that can be mapped 1:1 to the Zarr data model of groups, arrays, and chunks. We call this capability virtual files.

Under the hood, Arraylake uses the popular Kerchunk package to generate indexes ("references" in Kerchunk lingo) of different file formats. Arraylake automatically handles the storage of the references and integrates them seamlessly into your data catalog.

Arraylake supports the following types of virtual files:

  • Zarr V2 / V3 (i.e. references to Zarr data outside of Arraylake)
  • NetCDF3 / NetCDF4
  • HDF5
  • GRIB
  • TIFF / GeoTIFF

Access Control and Data Governance

Arraylake makes it easy to control who can read and write to your data lake. You authenticate with Arraylake using your organization's Identity Provider. Permissions for repository access are configured based on your Arraylake user ID. Users don't need any cloud-provider IAM credentials in order to read and write metadata.

note

Currently, cloud provider IAM credentials are required to interact with the underlying chunk data. Bypassing this constraint is on our roadmap.