Arraylake Overview
Arraylake is a data lake platform for managing multidimensional arrays and metadata in the cloud.
What problem does Arraylake solve?
Existing cloud data warehouses and data lake platforms only understand tabular data, but most scientific data is best represented in a more complex data model — multidimensional arrays with rich metadata. This type of data is ubiquitous in weather, climate and geospatial domains. (It's also prevalent throughout bioinformatics, materials science, physics, and engineering.)
Teams that use scientific data for research or operations today are struggling to use cloud storage and computing effectively. Organizations spend lots of time and money building and maintaining bespoke, in-house data management solutions.
Arraylake provides a home for your scientific data in the cloud, enabling your team to efficiently store, manage, query, and collaborate around your data.
Major Features
Cloud Native
Arraylake is a cloud native data management platform. That means that it was designed from the ground up to enable you to leverage the transformative possibilities of cloud computing. Specifically, Arraylake was designed around the following principles:
- Object storage as the primary data layer. All data are read and written directly via object storage, bypassing the expensive and limiting constraints of file-based access.
- Separation of storage and compute. Compute and storage in Arraylake always scale separately.
- Open standards and formats. Arraylake is based around the Zarr format, a community standard of the Open Geospatial Consortium.
Data Catalog
Arraylake provides a central catalog for your organization's scientific data. With a native understanding of multidimensional array data and the Zarr data model,
Arraylake makes it easy to know what data you have, how it got there, and how it's changing over time. To learn more about how data are organized in Arraylake, check out the Arraylake data model.
Arraylake also offers rich metadata search capabilities, enabling you to search against arbitrary user-defined metadata fields. To learn more about search, check out Searching and Filtering.
Safe, Consistent Transactional Updates
Zarr is a great format for storing large, evolving scientific datasets, such as weather forecasts. But if you're already managing large volumes of Zarr data in the cloud, you might be aware of some of the consistency problems with Zarr.
Arraylake allows you to create an update Zarr datasets via atomic transactions with serializable isolation between snapshots. This makes your Zarr data safe to read and write concurrently from multiple uncoordinated processes. Writes to Arraylake are immutable, and writes are only visible to a single client session until committed This allows you to use Zarr like a distributed database, rather than just a file format. This capability is essential for any team using Zarr data in production.
Version Control
Arraylake implements a version control system for your data and metadata, inspired by Git but specifically designed for the Zarr data model and for collaboration across multiple writers.
Some of the main features of the version control system include:
- An Arraylake repo (Zarr hierarchy) shares a single common version history
- Time travel between different states of your data (as represented by snapshots)
- Branches (mutable pointers to snapshots) allow you to easily prototype changes to your data and evolve datasets carefully (e.g.
dev
,staging
, andprod
branches) - Tags (immutable pointers to snapshots) allow you to publish and reference specific versions that will never change (great for reproducibility and machine-learning workflows)
Learn more about these capabilities in our docs on version control.
Broad File Format Support
Arraylake's native format and internal data model is Zarr. However, other array file formats can also be stored in Arraylake. That's because, under the hood, most array file formats use an internal structure and data model that can be mapped 1:1 to the Zarr data model of groups, arrays, and chunks. We call this capability virtual files.
Under the hood, Arraylake uses the popular Kerchunk package to generate indexes ("references" in Kerchunk lingo) of different file formats. Arraylake automatically handles the storage of the references and integrates them seamlessly into your data catalog.
Arraylake supports the following types of virtual files:
- Zarr V2 / V3 (i.e. references to Zarr data outside of Arraylake)
- NetCDF3 / NetCDF4
- HDF5
- GRIB
- TIFF / GeoTIFF
Learn more about these capabilities in our docs on version control.
Access Control and Data Governance
Arraylake makes it easy to control who can read and write to your data lake. You authenticate with Arraylake using a Single-Sign-On (SSO) provider (e.g. Google) or your organization's custom Identity Provider. Permissions for repository access are configured based on your Arraylake user ID. Arraylake supports API keys and service accounts Arraylake's delegated credentials system means that users don't need any cloud-provider credentials in order to read or write data.
How does it work?
Arraylake stores multidimensional array data in cloud object storage (e.g. AWS S3). Arraylake is based on the Zarr data model, in which data are organized into groups, arrays, and metadata. Arrays are further split into chunks. (See Data Model for more details.)
In standard Zarr, without Arraylake, clients store all of this information directly in the object store, with the metadata stored in .json
files.
Arraylake keeps the Zarr data model but makes some changes to how the data and metadata are stored.
Version 1
With Arraylake, chunks are still stored in the object store, but Arraylake handles the metadata. This enables Arraylake to build catalogs, provide search, and layer many other services on top of your data. However, the raw data (chunks) are still accessed directly from the object store, enabling the highly scalable performance and relatively low cost that users have come to expect from cloud storage.
The use of a separate database for metadata (sometimes called a "Metastore") is what allows Arraylake to provide its advanced features.
Version 2 (Icechunk)
Beginning in 2025, Arraylake will migrate to an "open-core" architecture with the core storage engine and data format provided by the Icechunk project.
Icechunk is Earthmover's new open-source transactional storage engine for tensor / ND-array data designed for use on cloud object storage. Icechunk works together with Zarr, augmenting the Zarr core data model with features that enhance performance, collaboration, and safety in a multi-user cloud-computing context.
Arraylake Version 1 split customer data between object storage (chunks) and the Arraylake platform database (metadata). Icechunk allows us to provide all of the same features that Arraylake Version 1 provided around transactions and data version control, while storing all chunk data and metadata in the object store. In this new configuration, it's no longer necessary to talk to the Arraylake backend in order to read, write, and commit data; therefore, it's also possible for the format and implementation to be completely open source. This greatly simplifies the platform architecture and brings Arraylake more in line with industry trends. architecturally, Icechunk works very similarly to Apache Iceberg or Delta Lake, the leading open table formats in tabular data world.
Read more about Icechunk in the Icechunk launch blog post or check out the Icechunk docs.