Copy > 5GB zarr file to S3

My head is spinning when I look at the potential mix of xarray, zarr, fsspec, s3fs, boto3/glob for copying a large zarr file from a local FS to S3. By “large” I mean files > 5 GB, which need to be split up to partial uploads otherwise they won’t get through.

Since I’m just copying a Zarr file, it seems redundant to open the local Zarr as ‘data’ with xarray and then “save as” Zarr over on S3, whichever way that can be done. Before delving into my tests, can anyone tell me the high level view for

cp big_local_zarr to my_s3_bucket ?

TIA

If the file already exists in Zarr format on a local disk, you don’t have to use Xarray, Zarr, or Python at all. Just use your cloud provider’s transfer utility. With the AWS CLI, it’s something like this

aws s3 cp --recursive local_zarr_dir s3://myBucket/remote/path

There are many ways to tweak this to try to make it go faster. You could use rsync, s3p, etc.

The python equivalent would be

fs = fsspec.filesystem("s3")
fs.put(local_zarr_dir,  "myBucket/remote/path", recursive=True)

which does use multi-part uploads for files that are big enough. Logging/callbacks for feedback are optional.

The AWS CLI might be faster, but for very large files if may make no difference, you’ll just max out the bandwidth.

fsspec’s generic module also has an “rsync” with a few of the optional of the namesake utility, and it will figure out which filesystem to pick for each URL. You don’t really need it here.

Thank you both. I try to work in notebooks and not worry about setting up CLI’s. (I’m using a CEPH S3 bucket)

For the record, here are 3 ways I uploaded the same 14 GB Zarr file, with

base_dir = "my/full/path/"
zarr_directory = "my.zarr"
endpoint_url='http://my_url:my_port'
os.environ['AWS_ACCESS_KEY_ID'] = access_key
os.environ['AWS_SECRET_ACCESS_KEY'] = secret_access_key
s3 = s3fs.S3FileSystem(key=access_key,
                       secret=secret_access_key,
                       endpoint_url=endpoint_url
                      )
fs = fsspec.filesystem('s3', endpoint_url=endpoint_url)
local_zarr_dir = base_dir + zarr_directory

fsspec way

fs.put(local_zarr_dir, "zarr-fsspec", recursive=True)

zarr-fsspec bucket gets :
Total Objects: 3091
Total Size: 13.3 GiB

Zarr way

store1 = zarr.DirectoryStore(base_dir + zarr_directory) # Source store
store2 = s3fs.S3Map(root='zarr-tt', s3=s3, check=False)
zarr.copy_store(store1, store2,if_exists='replace')

zarr-tt bucket gets :
Total Objects: 3091
Total Size: 13.3 GiB

Fuse mount way

In a fresh directory mounted to the zarr-fuse bucket and copying the file into that mounted directory.

zarr-fuse bucket gets:
Total Objects: 3095
Total Size: 13.2 GiB

I end up with buckets without the Zarr filename, which I can of course fix before uploading contents, but it makes the copying trickier than for example copying a PMTile or a COG.

How much of an access penalty would I get if I were to use a gzipped Zarr, which would get me a “filename with the contents inside” ?

Cheers

Do not use gzip with Zarr. Instead, use a Zip file. It is supported natively by the format (e.g. ZipStore).

Similar discussion happening at Any tips on avoiding high AWS S3 cost when storing Zarr's with lots of objects - #13 by rabernat

1 Like