Copy > 5GB zarr file to S3

Yves_Moisan · November 27, 2023, 9:58pm

My head is spinning when I look at the potential mix of xarray, zarr, fsspec, s3fs, boto3/glob for copying a large zarr file from a local FS to S3. By “large” I mean files > 5 GB, which need to be split up to partial uploads otherwise they won’t get through.

Since I’m just copying a Zarr file, it seems redundant to open the local Zarr as ‘data’ with xarray and then “save as” Zarr over on S3, whichever way that can be done. Before delving into my tests, can anyone tell me the high level view for

cp big_local_zarr to my_s3_bucket ?

TIA

rabernat · November 27, 2023, 10:14pm

If the file already exists in Zarr format on a local disk, you don’t have to use Xarray, Zarr, or Python at all. Just use your cloud provider’s transfer utility. With the AWS CLI, it’s something like this

aws s3 cp --recursive local_zarr_dir s3://myBucket/remote/path

There are many ways to tweak this to try to make it go faster. You could use rsync, s3p, etc.

martindurant · November 28, 2023, 3:05pm

The python equivalent would be

fs = fsspec.filesystem("s3")
fs.put(local_zarr_dir,  "myBucket/remote/path", recursive=True)

which does use multi-part uploads for files that are big enough. Logging/callbacks for feedback are optional.

The AWS CLI might be faster, but for very large files if may make no difference, you’ll just max out the bandwidth.

fsspec’s generic module also has an “rsync” with a few of the optional of the namesake utility, and it will figure out which filesystem to pick for each URL. You don’t really need it here.

Yves_Moisan · November 28, 2023, 9:51pm

Thank you both. I try to work in notebooks and not worry about setting up CLI’s. (I’m using a CEPH S3 bucket)

For the record, here are 3 ways I uploaded the same 14 GB Zarr file, with

base_dir = "my/full/path/"
zarr_directory = "my.zarr"
endpoint_url='http://my_url:my_port'
os.environ['AWS_ACCESS_KEY_ID'] = access_key
os.environ['AWS_SECRET_ACCESS_KEY'] = secret_access_key
s3 = s3fs.S3FileSystem(key=access_key,
                       secret=secret_access_key,
                       endpoint_url=endpoint_url
                      )
fs = fsspec.filesystem('s3', endpoint_url=endpoint_url)
local_zarr_dir = base_dir + zarr_directory

fsspec way

fs.put(local_zarr_dir, "zarr-fsspec", recursive=True)

zarr-fsspec bucket gets :
Total Objects: 3091
Total Size: 13.3 GiB

Zarr way

store1 = zarr.DirectoryStore(base_dir + zarr_directory) # Source store
store2 = s3fs.S3Map(root='zarr-tt', s3=s3, check=False)
zarr.copy_store(store1, store2,if_exists='replace')

zarr-tt bucket gets :
Total Objects: 3091
Total Size: 13.3 GiB

Fuse mount way

In a fresh directory mounted to the zarr-fuse bucket and copying the file into that mounted directory.

zarr-fuse bucket gets:
Total Objects: 3095
Total Size: 13.2 GiB

I end up with buckets without the Zarr filename, which I can of course fix before uploading contents, but it makes the copying trickier than for example copying a PMTile or a COG.

How much of an access penalty would I get if I were to use a gzipped Zarr, which would get me a “filename with the contents inside” ?

Cheers

rabernat · November 29, 2023, 1:35am

Do not use gzip with Zarr. Instead, use a Zip file. It is supported natively by the format (e.g. ZipStore).

Similar discussion happening at Any tips on avoiding high AWS S3 cost when storing Zarr's with lots of objects - #13 by rabernat

Topic		Replies	Views
Upload zarr data directly to S3 Data	7	1320	January 3, 2022
Any suggestions s3 upload optimizations for large 3d zarr datasets Data	11	1260	May 28, 2022
Extremly slow write to S3 bucket with xarray.Dataset.to_zarr Data	32	4905	December 6, 2023
Slow append to existing s3 zarr store using xarray Data zarr	6	134	May 13, 2025
Best way to scale s3 zarr store to handle massive amounts of S3 ingress? Data	4	918	October 2, 2023

Copy > 5GB zarr file to S3

fsspec way

Zarr way

Fuse mount way

Related topics