Hi all,
I have been trying to upload zarr data directly to S3. But I cannot achieve it yet.
I can read the data from S3 using fsspec filestream. There is no problem with that. But when I try to upload, I can’t use the same method.
data.to_zarr("s3:/sample-storage/new.zarr",
consolidated=True, mode='w',
storage_options = {
"key": access_key,
"secret": secret,
"endpoint": end_point
})
Also there is no much information or resource available on the internet, other than documentation of xarray, zarr and pangeo discourse forum.
Can someone help with this?
Thanks in advance!
1 Like
For some reason, to_zarr
never got the storage_options treatment. What you can instead, is create the mapper by hand
mapper = fsspec.get_mapper("s3://sample_storage/new.zarr",
key=, secret=, endpoint=)
data.to_zarr(mapper)
2 Likes
Thanks for the help @martindurant
I tried the same. But unfortunately it is throwing the following error.
TypeError: __init__() got an unexpected keyword argument 'endpoint'
I tried multiple naming conventions for each of the three variables like key, access_key, access, secret_key, endpoint_url etc. But all returns the same error.
1 Like
Ah, I didn’t actually check that the arguments you provided for storage_options were reasonable. I believe it should be
mapper = fsspec.get_mapper("s3://sample_storage/new.zarr",
key=, secret=, client_kwargs=dict(endpoint_url= ))
data.to_zarr(mapper)
2 Likes
I just tried a small test case on our USGS HPC system, reading a 1.2GB uncompressed NetCDF4 file and then writing zarr directly to s3, but it took a super long time – 23 minutes.
I then tried it with my usual method of writing locally and then copying to s3 using the AWS CLI, and it took 6 s to write, and then 20 s to transfer, so 26s total.
This dataset has 13Mb chunks, but only 1 chunk per variable, with 188 data variables.
@martindurant, do you have an explanation for this behavior?
(the fsspec version I used was 2021.10.1, in case that’s relevant)
1 Like
The dask local client and dashboard could be used to profile, or even just %%snakeviz
. I don’t have an immediate suggestion. It would be useful to know how many total chunks there is the data.
1 Like
It worked perfectly fine for me. I needed to add “https://” before the endpoint_url as well.
Thank you so much for the help!!
1 Like
23 minutes vs 26 seconds is indeed a huge difference! Rich could you post the profile information from your experiments? That would be super helpful.