Upload zarr data directly to S3

Murugesh · December 29, 2021, 5:13pm

Hi all,

I have been trying to upload zarr data directly to S3. But I cannot achieve it yet.

I can read the data from S3 using fsspec filestream. There is no problem with that. But when I try to upload, I can’t use the same method.

data.to_zarr("s3:/sample-storage/new.zarr", 
                 consolidated=True, mode='w',
                storage_options = {
                "key": access_key,
                "secret": secret,
                "endpoint": end_point
                })

Also there is no much information or resource available on the internet, other than documentation of xarray, zarr and pangeo discourse forum.

Can someone help with this?

Thanks in advance!

martindurant · December 30, 2021, 8:37pm

For some reason, to_zarr never got the storage_options treatment. What you can instead, is create the mapper by hand

mapper = fsspec.get_mapper("s3://sample_storage/new.zarr",
    key=, secret=, endpoint=)
data.to_zarr(mapper)

Murugesh · December 31, 2021, 8:00am

Thanks for the help @martindurant

I tried the same. But unfortunately it is throwing the following error.

TypeError: __init__() got an unexpected keyword argument 'endpoint'

I tried multiple naming conventions for each of the three variables like key, access_key, access, secret_key, endpoint_url etc. But all returns the same error.

martindurant · December 31, 2021, 4:46pm

Ah, I didn’t actually check that the arguments you provided for storage_options were reasonable. I believe it should be

mapper = fsspec.get_mapper("s3://sample_storage/new.zarr",
    key=, secret=, client_kwargs=dict(endpoint_url= ))
data.to_zarr(mapper)

rsignell · December 31, 2021, 4:58pm

I just tried a small test case on our USGS HPC system, reading a 1.2GB uncompressed NetCDF4 file and then writing zarr directly to s3, but it took a super long time – 23 minutes.
I then tried it with my usual method of writing locally and then copying to s3 using the AWS CLI, and it took 6 s to write, and then 20 s to transfer, so 26s total.

This dataset has 13Mb chunks, but only 1 chunk per variable, with 188 data variables.

@martindurant, do you have an explanation for this behavior?
(the fsspec version I used was 2021.10.1, in case that’s relevant)

martindurant · December 31, 2021, 5:13pm

The dask local client and dashboard could be used to profile, or even just %%snakeviz. I don’t have an immediate suggestion. It would be useful to know how many total chunks there is the data.

Murugesh · January 3, 2022, 8:26am

It worked perfectly fine for me. I needed to add “https://” before the endpoint_url as well.

Thank you so much for the help!!

rabernat · January 3, 2022, 5:26pm

23 minutes vs 26 seconds is indeed a huge difference! Rich could you post the profile information from your experiments? That would be super helpful.

Topic		Replies	Views
Copy > 5GB zarr file to S3 Data	4	865	November 29, 2023
Any suggestions s3 upload optimizations for large 3d zarr datasets Data	11	1272	May 28, 2022
Zarr on other S3-compatible storage (e.g. DigitalOcean)?	3	1050	October 7, 2020
Extremly slow write to S3 bucket with xarray.Dataset.to_zarr Data	32	4957	December 6, 2023
Loading ensembles using intake Data	4	874	May 24, 2023

Upload zarr data directly to S3

Related topics