Any suggestions s3 upload optimizations for large 3d zarr datasets

We run a collaboration space and archive for neurophysiology data (https://dandiarchive.org) and have been working on implementing zarr support for microscopy datasets. Each zarr NestedDirectoryStore contains about 800K - 1.2M objects of compressed size 40K - 100K each, with chunk size 64^3 (optimized for viewers like neuroglancer) Each store varies in content from 10GB to 150GB. Our reason for moving towards zarr is that the microscopy community is developing a zarr based-spec - Next Generation File Format, and this would eventually allow us to stitch these datasets into an earth like view of each of these brain datasets. Each dataset is expected to be about 250TB in size when stitched/registered together.

At the moment we are focused on getting these individual slabs of 10G to 150G online on aws s3 such that others can focus on processing and stitching the data. we currently have a version online by converting zarr to hdf5 and then uploading the data. Available here.

Users have direct control to upload their data, but not writing directly to s3 using keys. The client mints presigned urls for upload. This makes the data available instantly without any further processing or relocation needed on the server side.

But this has been painfully slow for zarr uploads relative to tarred upload. We are getting about 20 - 40 MB/s (only about 5 MB/s using the AWS CLI) when the full bandwidth can be 150+MB/s for standard multipart uploads of large files. As you can imagine a 5x drop in bandwidth can result in extra months when uploading full datasets of this size.

So we are looking for pointers from other communities that have used zarr and s3 to see if we can optimize the transfer for our users.

Sorry couldn’t post more than two links.

3 Likes

Have you had a chance to look at globus with globus s3 connector? We (@mgrover1 and I ) have used it to transfer terabytes of data for CESM1 Large Ensemble datasets and CESM2 Large Ensemble datasets from our HPC system to AWS S3.

Regarding the speed, the last time I checked we were getting 85MB/s.

20190712.lens-aws.ocn.globus.overview

1 Like

thank you @andersy005 - we haven’t looked at the globus s3 connector. we were hoping for more general options based on s3 before bringing another technology into the picture. but i will try out globus on some of these datasets as well.

Have you explored other command clients like s5cmd? GitHub - peak/s5cmd: Parallel S3 and local filesystem execution tool. The way s3 http api works there are tricks that can make a huge difference in performance with many small files for uploads compared to the default cli tool (at least that was the case for the default client as of a couple years ago).

Another option would be to setup a lambda that untars the files after upload into the correct schema.

It’s good to hear that the microscopy community is embracing Zarr :slight_smile:

3 Likes

thank you @Jake_Bolewski - indeed s5cmd is significantly faster. even with default parameter it is much faster. will work on optimizing. our thought process was leading towards a bulk transfer blob and then “unpacking” on the s3 side. we have some homework to do.

1 Like

Not sure if it helps but I had a similar issue of uploading parallel zarr stores (e.g. one per date) and hitting request rates to_zarr to s3 with asynchronous=False · Discussion #5869 · pydata/xarray · GitHub

2 Likes

s3fs uses concurrent connections to amortize latency costs when uploading many files at once. In the zarr world, you can use this in conjunction with dask to parallelise the CPU effort at the same time, if each of your dask array partitions contains many zarr chunks. However, if you are being throttled by AWS (as @raybellwaves mentions), none of this will help! I believe it is technically possible to bundle many uploads into a multi-part HTTP request, but actually I don’t know how AWS accounts this against your rate quota.

That is similar to what s5cmd is doing, the other trick is to use http pipelining (not sure if s3fs supports this). But yes eventually you do hit request limits! and then you are in the nightmare world of trying to rationalize distributed backoff timeouts.

There are a lot of good information here so just chiming in my experience. I’ve used Rclone to upload Zarr datasets to Digital Ocean Space (object storage similar to S3) and was throttled by concurrent requests limit. This played into making sure that our Zarr chunks weren’t too small to keep the total number of files to a reasonable number.

1 Like