Arraylake is a data lake platform for managing multidimensional arrays and metadata in the cloud.
What problem does Arraylake solve?
Existing data warehouses and data lake platforms only understand tabular data, but most scientific data is best represented in a more complex data model — multidimensional arrays with rich metadata. Teams that use scientific data for research or operations today have to build their own foundational data infrastructure. Key features like data ingest, versioning, cataloging, and governance must be built from scratch.
Arraylake provides users with a home for their scientific data in the cloud, enabling teams to quickly derive insights from large, complex datasets. The Arraylake platform helps manage the full life cycle of scientific data in the cloud.
How does it work?
Arraylake is designed to make it easier to store multidimensional array data in cloud object storage (e.g. AWS S3). Arraylake is based on the Zarr data model, in which data are organized into groups, arrays, and metadata. Arrays are split into chunks.
Without Arraylake, clients store all of this information directly in the object store, with the metadata stored in
This can make it hard to search, and explore a large archive of data.
With Arraylake, chunks are still stored in the object store, but Arraylake handles the metadata.
This enables Arraylake to build catalogs, provide search, and layer many other services on top of your data.
However, the raw data (chunks) are still accessed directly from the object store, enabling the highly scalable
performance and relatively low cost that users have come to expect from cloud storage.
The use of a separate database for metadata (sometimes called a "Metastore") is what allows Arraylake to provide all of the advanced features described below.
Without Arraylake, it's very difficult to obtain an accurate catalog of your data in object storage. That's because object storage has no inherent understanding of the contents or structure of the data it stores; it's just a big key-value store. To solve this problem, teams usually build and maintain separate, independent catalogs of their data holdings. It can be challenging to keep these separate catalogs in sync with the actual data in object storage.
Arraylake solves this problem by keeping track of all the metadata in its own database at the time it's written. This enables users to easily explore the contents of their data lake and see what's actually inside. With Arraylake, you can browse the entire contents of your data lake using the standard Zarr API. Furthermore, since all of the metadata are in a database (instead of locked up inside binary files), we can provide rich metadata search over all the data in a repo.
We have lots more catalog features on our roadmap, including
- Exposing data catalogs via standard REST APIs such as STAC and OGC
- Interactive browsing via the Arraylake web application
As a further advantage over raw object storage, Arraylake implements a version control system for your data and metadata inspired by Git but specifically designed for the Zarr data model and for collaboration across multiple writers.
Some of the main features of the version control system include:
- An Arraylake repo (Zarr hierarchy) shares a single common version history
- Writes to Arraylake are immutable
- Writes are only visible to a specific client session until committed
- Serializable isolation between commits from different client sessions
- Optimistic concurrency, so that multiple writers can safely write to the same data concurrently
- Time travel between different states of your data as represented by commits
- Data are deduplicated automatically and verified via content-addressable storage
- Schema evolution for Zarr arrays and groups
Our roadmap includes more version-control features, such as
- Support for multiple branches and tags
- Merge commits
Flexible File Format Support
Arraylake's native format and internal data model is Zarr. However, other array file formats can also be stored in Arraylake. That's because, under the hood, most array file formats use an internal structure and data model that can be mapped 1:1 to the Zarr data model of groups, arrays, and chunks.
Under the hood, Arraylake uses the popular Kerchunk package to generate indexes ("references" in Kerchunk lingo) of different file formats. Arraylake automatically handles the storage of the references and integrates them seamlessly into your data catalog.
If you already have Zarr V2 datasets, HDF5 / NetCDF4 files, or NetCDF3 files in object storage, you can make them available directly in Arraylake, without copying or duplicating the data! 🎉
Other formats, such as GRIB, and Tiff / GeoTiff / COG are on our roadmap.
Access Control and Data Governance
Arraylake makes it easy to control who can read and write to your data lake. You authenticate with Arraylake using your organization's Identity Provider. Permissions for repository access are configured based on your Arraylake user ID. Users don't need any cloud-provider IAM credentials in order to read and write metadata.
Currently, cloud provider IAM credentials are required to interact with the underlying chunk data. Bypassing this constraint is on our roadmap.