Performance Tuning
Arraylake is designed to help you read and write data as fast as your network connection allows. Performance with Arraylake is measured in throughput, specifically, Mbps (Megabits per second). Different machines and network configurations will obtain different levels of throughput. The most important factor is network proximity to your organization's cloud storage bucket[s]. A secondary factor is the amount of CPU and memory available on your machine.
Parameters to Tune
The main parameter to which controls I/O performance is the amount of asyncio concurrency used by Zarr for read / write operations. The concurrency parameter can be configured as follows
import zarr
# recommended value for a small machine with a slow connection to object storage
zarr.config.set({"async.concurrency": 16})
# recommended value for a high-performance machine in the cloud
zarr.config.set({"async.concurrency": 64})
Using too little concurrency may result in sub-optimal performance. But using too much concurrency may lead to errors related to network timeouts. For this reason, some tuning may be required.
Performance Bechmarking and Tuning Utility
Arraylake makes it easy to measure your performance and find an optimal I/O configuration via a command-line utility
called arraylake repo tune
.
Its options are as follows
% al repo tune --help
Usage: al repo tune [OPTIONS] ORG_NAME
Tune I/O configuration for optimal performance
Examples
• Tune the organization with a specific bucket config nickname
$ arraylake repo tune my-org --bucket-config-nickname my-bucket-config
╭─ Arguments ──────────────────────────────────────────────────────────────────────╮
│ * org_name TEXT Name of organization │
│ [default: None] │
│ [required] │
╰──────────────────────────── ──────────────────────────────────────────────────────╯
╭─ Options ────────────────────────────────────────────────────────────────────────╮
│ --bucket-config-nickname TEXT Chunkstore bucket config nickname │
│ [default: None] │
│ --help -h Show this message and exit. │
╰──────────────────────────────────────────────────────────────────────────────────╯
This utility will create a temporary repo to measure the I/O performance of the host where it is running. It will then perform a series of experiments which write and read data to find the optimal chunk size and concurrency. The script outputs a recommended configuration along with the maximum throughput achieved.
An example output from a laptop running on a home network is
OptimalIOConfig(
read=IOConfig(async_concurrency=16, chunksize_bytes=100000, throughput_mbps=9.837626326920796),
write=IOConfig(async_concurrency=4, chunksize_bytes=100000, throughput_mbps=0.8309464158422604)
)
Here we can see that the network can barely deliver 10 Mbps of read throughput, and gives than 1 Mbps of write throughput.
(This is quite terrible!)
In this case, we would set zarr.config.set({"async.concurrency": 16})
.
In contrast, a large EC2 node located in the same AWS region as our storage bucket might give the following results:
OptimalIOConfig(
read=IOConfig(async_concurrency=32, chunksize_bytes=10000000, throughput_mbps=11640.287796111286),
write=IOConfig(async_concurrency=64, chunksize_bytes=10000000, throughput_mbps=10375.117613155975)
)
Here we are achieving throughput around 10 Gbps, on par with the network bandwidth of our EC2 instance.
In this case, we would set zarr.config.set({"async.concurrency": 64})
.