Upload zarr data directly to S3

Hi all,

I have been trying to upload zarr data directly to S3. But I cannot achieve it yet.

I can read the data from S3 using fsspec filestream. There is no problem with that. But when I try to upload, I can’t use the same method.

data.to_zarr("s3:/sample-storage/new.zarr", 
                 consolidated=True, mode='w',
                storage_options = {
                "key": access_key,
                "secret": secret,
                "endpoint": end_point
                })

Also there is no much information or resource available on the internet, other than documentation of xarray, zarr and pangeo discourse forum.

Can someone help with this?

Thanks in advance!

1 Like

For some reason, to_zarr never got the storage_options treatment. What you can instead, is create the mapper by hand

mapper = fsspec.get_mapper("s3://sample_storage/new.zarr",
    key=, secret=, endpoint=)
data.to_zarr(mapper)
2 Likes

Thanks for the help @martindurant

I tried the same. But unfortunately it is throwing the following error.

TypeError: __init__() got an unexpected keyword argument 'endpoint'

I tried multiple naming conventions for each of the three variables like key, access_key, access, secret_key, endpoint_url etc. But all returns the same error.

1 Like

Ah, I didn’t actually check that the arguments you provided for storage_options were reasonable. I believe it should be

mapper = fsspec.get_mapper("s3://sample_storage/new.zarr",
    key=, secret=, client_kwargs=dict(endpoint_url= ))
data.to_zarr(mapper)
2 Likes

I just tried a small test case on our USGS HPC system, reading a 1.2GB uncompressed NetCDF4 file and then writing zarr directly to s3, but it took a super long time – 23 minutes.
I then tried it with my usual method of writing locally and then copying to s3 using the AWS CLI, and it took 6 s to write, and then 20 s to transfer, so 26s total.

This dataset has 13Mb chunks, but only 1 chunk per variable, with 188 data variables.

@martindurant, do you have an explanation for this behavior?
(the fsspec version I used was 2021.10.1, in case that’s relevant)

1 Like

The dask local client and dashboard could be used to profile, or even just %%snakeviz. I don’t have an immediate suggestion. It would be useful to know how many total chunks there is the data.

1 Like

It worked perfectly fine for me. I needed to add “https://” before the endpoint_url as well.

Thank you so much for the help!!

1 Like

23 minutes vs 26 seconds is indeed a huge difference! Rich could you post the profile information from your experiments? That would be super helpful.