Upload zarr data directly to S3

Hi all,

I have been trying to upload zarr data directly to S3. But I cannot achieve it yet.

I can read the data from S3 using fsspec filestream. There is no problem with that. But when I try to upload, I can’t use the same method.

                 consolidated=True, mode='w',
                storage_options = {
                "key": access_key,
                "secret": secret,
                "endpoint": end_point

Also there is no much information or resource available on the internet, other than documentation of xarray, zarr and pangeo discourse forum.

Can someone help with this?

Thanks in advance!

1 Like

For some reason, to_zarr never got the storage_options treatment. What you can instead, is create the mapper by hand

mapper = fsspec.get_mapper("s3://sample_storage/new.zarr",
    key=, secret=, endpoint=)

Thanks for the help @martindurant

I tried the same. But unfortunately it is throwing the following error.

TypeError: __init__() got an unexpected keyword argument 'endpoint'

I tried multiple naming conventions for each of the three variables like key, access_key, access, secret_key, endpoint_url etc. But all returns the same error.

1 Like

Ah, I didn’t actually check that the arguments you provided for storage_options were reasonable. I believe it should be

mapper = fsspec.get_mapper("s3://sample_storage/new.zarr",
    key=, secret=, client_kwargs=dict(endpoint_url= ))

I just tried a small test case on our USGS HPC system, reading a 1.2GB uncompressed NetCDF4 file and then writing zarr directly to s3, but it took a super long time – 23 minutes.
I then tried it with my usual method of writing locally and then copying to s3 using the AWS CLI, and it took 6 s to write, and then 20 s to transfer, so 26s total.

This dataset has 13Mb chunks, but only 1 chunk per variable, with 188 data variables.

@martindurant, do you have an explanation for this behavior?
(the fsspec version I used was 2021.10.1, in case that’s relevant)

1 Like

The dask local client and dashboard could be used to profile, or even just %%snakeviz. I don’t have an immediate suggestion. It would be useful to know how many total chunks there is the data.

1 Like

It worked perfectly fine for me. I needed to add “https://” before the endpoint_url as well.

Thank you so much for the help!!

1 Like

23 minutes vs 26 seconds is indeed a huge difference! Rich could you post the profile information from your experiments? That would be super helpful.