Hello,
We are generating Zarr stores and push final data products to AWS S3 bucket. We were hit by unexpected high cost (~3K to store 40 data products) from AWS S3 to store our test data sets.
When using (250, 10, 10) chunking for (date, x, y) dimensions, we ended up with average ~3-4M objects per single Zarr store, thus the high S3 cost.
Our normal way of operation is to generate the data set locally and to push it to the S3 bucket. When new data that contributes to the data set becomes available, we copy existing Zarr store from S3 locally, update it with new data, remove original Zarr in S3 bucket, then push updated Zarr to the S3 bucket. Since we need to update our datasets often, there is a concern that high S3 costs will persist each time we push updated data to the S3 bucket.
In the past, we tried to write data directly to S3, but it was rather slow. We might need to revisit this approach as updating data set, which resides in S3 bucket, directly would minimize the number of objects pushed to the S3 bucket every time we need to update our dataset. I guess, we need to choose cost over runtime here.
Just for the record, we tested new (250, 50, 50) chunking when storing data to the Zarr. It does not affect data access time and definitely helps to cut the number of objects for the final Zarr store (from 3M to 120K objects). But since we will be updating Zarr’s with new data on a regular basis, the S3 cost still will be the issue as we will be generating thousands of data sets. It would also seem to limit us on how often we can update these data products if we push the whole updated Zarr to the bucket each time we update the data set.
Has anybody run into similar issues with storing Zarr’s in AWS S3?
I would appreciate if anybody could point us in the right direction on:
- Any known tricks and tips on how to avoid high S3 costs when pushing Zarr store with high number of objects or pushing Zarr with moderate number of objects frequently to the Cloud
- What is the best practice of updating Zarr store when it is stored in AWS S3 bucket (local update and push the whole store to the bucket, or update Zarr in S3 directly despite it being very slow).
Many thanks in advance,
Masha