Skip to main content

Garbage Collection and Expiration

Arraylake provides automated data lifecycle management through Icechunk's expiration and garbage collection features. These processes help optimize storage costs and performance by removing old, unnecessary data from your repositories.

Expiration removes old snapshots from your repository, while Garbage Collection (GC) permanently deletes the underlying data files that are no longer referenced. Both operations are irreversible, so use them with care.

When expiration runs, it consolidates your repository's commit history by removing old snapshots from the history tree, similar to how a git squash operation works. This process helps reduce commit history while preserving your data's current state. Old snapshots that are no longer part of the history can later be deleted by garbage collection, which will free their storage space.

Repository commit history after expiration

Example of expiration consolidating history: This repo originally had thousands of commits like β€œchunk batch 0” to β€œchunk batch 3107.” After expiration, they were consolidated, reducing history and improving performance.

For detailed information about how these features work, see the Icechunk expiration documentation and this comprehensive guide.

Arraylake runs expiration and GC jobs automatically on configured repositories using Earthmover managed compute resources. There's no additional compute cost to you.

Optimization Window​

The optimization window is a scheduled time period during which Arraylake performs maintenance operations on your repository to improve performance and reduce storage costs. Currently, these optimizations include data expiration and garbage collection.

During the optimization window, it's still safe to read and write from your repository using branches and recent snapshots. However, you should avoid attempting to access old snapshots that are scheduled for expiration during the optimization window, as this may lead to errors.

Key Requirements:

  • Minimum window size must be 4 hours
  • Configure the window during times of low repository activity to minimize disruption

The optimization window for your organization can be configured in the organization settings.

Organization optimization window

Set the organization-wide optimization window in the organization settings.

Configuring GC and Expiration for Individual Repos​

Expiration and garbage collection are configured on a repository-by-repository basis through the "Optimization" tab in the repository settings within the Arraylake UI.

Repo optimization settings

Configure repo optimizations in the Optimizations tab located in the repo settings.

Enabling and Configuring Jobs​

  1. Enable the feature: Check "Enabled" to activate expiration or GC for the repository
  2. Configure job frequency: Set how often Arraylake should run the GC or expiration job
  • If expiration and GC are scheduled to run on the same cadence, expiration will always be scheduled to run before GC.
  1. Set data age thresholds:

    • For expiration: Configure how old a snapshot must be before it's expired
    • For garbage collection: Configure how old data must be before it's permanently deleted
    info

    Important: Garbage collection will never delete data that is accessible from a non-expired snapshot, regardless of how old the data is. Only data that is no longer referenced by any active snapshot will be considered for deletion.

warning

Garbage collection actually deletes data from the backend storage bucket and is irreversible. Use with caution.

Available Configuration Options:

  • Data age thresholds: 1 week, 2 weeks, 1 month, 3 months, 6 months, 1 year
  • Job frequency: 1 week, 2 weeks, 1 month, 3 months, 6 months, 1 year

If you need different settings, contact Earthmover support.

Expiration settings configuration modal

Configure expiration settings by clicking the "Configure" button. Set how old versions must be before expiration and how frequently the job should run.

Monitoring and History​

Arraylake provides comprehensive monitoring capabilities for your garbage collection and expiration runs through history and schedule tracking and metrics visualization.

History and Schedule​

Once configured, the "View History" button in the repo settings optimization tab shows:

  • The next scheduled run based on your configuration settings
  • Previous job runs with status indicators:
    • 🟒 Finished: Job completed successfully
    • 🟑 Timeout: Job exceeded time limits
    • πŸ”΄ Failed: Job encountered an error

If you see timeout or failed statuses, contact Earthmover support for investigation.

Scheduling System Limitations

Optimization scheduling is a new feature. Known limitations include:

  • Disabling expiration or GC after a job is already scheduled will not prevent the scheduled job from executing.
  • There is currently no mechanism to unschedule queued jobs.
  • The scheduler may miss scheduling jobs if the previous job execution begins after the optimization window.
  • Running jobs outside the optimization window may affect subsequent scheduling. If the next scheduled job is not already queued, the scheduler will reschedule it based on the configured cadence from the manual run time, aligning it with the next maintenance window.

We are actively addressing these scheduling edge cases. If job scheduling behavior does not meet your expectations, contact Earthmover support.

Recent expiration jobs history

The "View History" button shows recent job runs with their scheduling information, execution times, and status. This example shows one scheduled job and two completed expiration jobs.

Metrics​

Arraylake provides built-in metrics that track data that was expired or garbage collected for your repositories and organization. These metrics are available through plots in the Arraylake UI.

Available Metrics:

Garbage Collection Metrics:

  • Bytes Deleted: Tracks the amount of data (in bytes) removed from storage during GC operations
    • Shows the actual storage savings achieved over time

Expiration Metrics:

  • Snapshots Expired: Tracks the number of snapshots that have been expired from your repository
    • Helps you understand how many historical versions are being removed

Viewing Metrics:

Metrics are available at two levels:

Repository Level:

  • View metrics for individual repositories through the repository's metrics dashboard
  • Shows detailed data about what has been deleted for that specific repository

Organization Level:

  • View aggregated metrics across all repositories in your organization
  • Represents the sum of all repository-level metrics, giving you an organization-wide view of total deletions

These time-series plots allow you to:

  • Monitor what your optimization jobs have accomplished
  • Track storage savings and cleanup activity over time
  • Identify patterns in data growth and cleanup
  • Validate that your GC and expiration configurations are working as expected

Metrics are updated regularly after a optimization job completes, providing near real-time visibility into your data lifecycle management.

Optimization metrics dashboard

Metrics dashboard showing time-series plots for both garbage collection (bytes deleted) and snapshot expiration (snapshots expired).

Manual Job Execution​

The "Run Now" button bypasses the schedule and runs the job immediately according to your configured settings.

Limitations:

  • Will not run if another expiration or GC job is actively running for the repository
  • Will not run if a job is scheduled to run before the manual job would complete (within the job's timeout window)

Impact on Scheduling:

Manually running a job may disrupt the regular cadence of scheduled jobs:

  • If a job is already scheduled to run according to the regular schedule, it will still run as planned
  • If no job is currently scheduled, the scheduler will use the most recent run (including manual "Run Now" executions) as the basis for calculating when to schedule the next job

Timeouts​

Each job type has specific timeout limits to ensure system stability:

  • Expiration jobs: 20 minutes
  • Garbage collection jobs: 1 hour

If you see a "Timeout" status in the job history, it means the operation exceeded these limits before completing. This typically occurs with very large repositories that have:

  • A large number of commits
  • Significant amounts of garbage data to process

If you encounter timeout issues, contact Earthmover support for investigation and potential optimization.