Arraylake Overview

Arraylake is a data lake platform for collaborating on multidimensional arrays and metadata in the cloud.

What problem does Arraylake solve?

Existing cloud data warehouses and data lake platforms only understand tabular data, but most scientific data is best represented in a more complex data model — multidimensional arrays with rich metadata. This type of data is ubiquitous in weather, climate and geospatial domains. (It's also prevalent throughout bioinformatics, materials science, physics, and engineering.)

Teams that use scientific data for research or operations today are struggling to use cloud storage and computing effectively. Organizations spend lots of time and money building and maintaining bespoke, in-house data management solutions.

Arraylake provides a home for your scientific data in the cloud, enabling your team to efficiently store, manage, query, and collaborate around your data.

Major Features

Cloud Native

Arraylake is a cloud native data management platform. That means that it was designed from the ground up to enable you to leverage the transformative possibilities of cloud computing. Specifically, Arraylake was designed around the following principles:

Object storage as the primary data layer. All data are read and written directly via object storage, bypassing the expensive and limiting constraints of file-based access.
Separation of storage and compute. Compute and storage in Arraylake always scale separately.
Open standards and formats. Arraylake is based around the Zarr format, a community standard of the Open Geospatial Consortium.

Data Catalog

Arraylake provides a central catalog for your organization's scientific data. With a native understanding of multidimensional array data and the Zarr data model,

Arraylake makes it easy to know what data you have, how it got there, and how it's changing over time. To learn more about how data are organized in Arraylake, check out the Arraylake data model.

Safe, Consistent Transactional Updates

Zarr is a great format for storing large, evolving scientific datasets, such as weather forecasts. But if you're already managing large volumes of Zarr data in the cloud, you might be aware of some of the consistency problems with Zarr.

Icechunk allows you to create and update Zarr datasets via atomic transactions with serializable isolation between snapshots. This makes your Zarr data safe to read and write concurrently from multiple uncoordinated processes. Writes to Icechunk are immutable, and writes are only visible to a single client session until committed. This allows you to use Zarr like a distributed database, rather than just a file format. This capability is essential for any team using Zarr data in production.

Learn more about these capabilities in the Icechunk docs on parallel writes.

Version Control

Icechunk implements a version control system for your data and metadata, inspired by Git but specifically designed for the Zarr data model and for collaboration across multiple writers.

Some of the main features of the version control system include:

An Icechunk repo (Zarr hierarchy) shares a single common version history
Time travel between different states of your data (as represented by snapshots)
Branches (mutable pointers to snapshots) allow you to easily prototype changes to your data and evolve datasets carefully (e.g. dev, staging, and prod branches)
Tags (immutable pointers to snapshots) allow you to publish and reference specific versions that will never change (great for reproducibility and machine-learning workflows)

Learn more about these capabilities in the Icechunk docs on version control.

Broad File Format Support

Arraylake's native format and internal data model is Zarr/Icechunk. However, other array file formats can also be stored as Zarr/Icechunk. That's because, under the hood, most array file formats use an internal structure and data model that can be mapped 1:1 to the Zarr data model of groups, arrays, and chunks. We call this capability virtual files.

Learn more about these capabilities in the Icechunk docs on virtual datasets.

Access Control and Data Governance

Arraylake makes it easy to control who can read and write to your data lake. You authenticate with Arraylake using a Single-Sign-On (SSO) provider (e.g. Google) or your organization's custom Identity Provider. Permissions for repository access are configured based on your Arraylake user ID. Arraylake supports API keys and service accounts Arraylake's delegated credentials system means that users don't need any cloud-provider credentials in order to read or write data.

How does it work?

Arraylake catalogs multidimensional array data stored in cloud object storage (e.g. AWS S3). Arraylake data is stored in the Icechunk format, which is based on the Zarr data model, in which data are organized into groups, arrays, and metadata. Arrays are further split into chunks. (See Data Model for more details.)

Whilst your actual data lives in your object storage, Arraylake stores metadata about your data, and a record of who is allowed to view and access your data and how.

Arraylake is a data lake platform for collaborating on multidimensional arrays and metadata in the cloud.

What problem does Arraylake solve?​

Major Features​

Cloud Native​

Data Catalog​

Safe, Consistent Transactional Updates​

Version Control​

Broad File Format Support​

Access Control and Data Governance​

How does it work?​

What problem does Arraylake solve?

Major Features

Cloud Native

Data Catalog

Safe, Consistent Transactional Updates

Version Control

Broad File Format Support

Access Control and Data Governance

How does it work?