Storing CMIP6 data on JASMIN's object store

Hello,

This is my first post to this community. I’m working on a pilot project with JASMIN, a data analysis facility for the NERC community in the UK. It’s a hybrid of traditional batch system and OpenStack cloud. We’ve recently set up a system so that groups make their own instance of Pangeo in OpenStack projects on our system. We are also experimenting with storing CMIP6 data on an object store which is also part of JASMIN. This is S3 compatible.

I’ve looked at https://pangeo.io/data.html to see how you’ve gone about uploading data to Google cloud object storage with xarray and zarr. We would like to do a workflow where we transfer CMIP6 data from our POSIX file system to our object store. Looking at the guide, It looks like for 2) we could miss a step and write direct to our object store. I think we can do that with the to_zarr call described but would appreciate some tips. I’m not sure how we give to_zarr a handle to tell it the location of the object store rather than a POSIX directory path.

We’d also appreciate any pointers with using Dask to scale this out into a larger workflow.

Perhaps you also have some general experience you could share from the work you’ve done uploading CMIP6 data into the public cloud?

Thanks,
Phil

2 Likes

Hi Phil,

Theo from the Met Office informatics Lab here.

Some things that might help.

Rich Signell gist here (hope you don’t mind the share Rich):


Is a workflow for creating Zarrs on object stores from many individual files on POSIX. We found it needs a tweak here and there for your own use case but is a good paradigm.

The main gotcha’s with scaling out that I’ve found:

  • You can only append to zarrs so make sure you have a process by which you start at index 0.
  • Choosing good chunking options is always hard. If you’ve many small files it can be tempting to write each as a chunk but you might get significantly better write performance if you first combine many files into an xarray object and then write out into bigger chunks (but not too big). Even if you don’t want bigger chunks, because of the way dask/zarr paralyse this operation you going to do better combining more data into an xarray and doing a big-ish append rather than many small ones.
  • On “I’m not sure how we give to_zarr a handle to tell it the location of the object store…” the to_zarr method can take a store argument. This is an object that implements a dictionary-like API, xarray will tell it the key and the binary data to write to that key and the store is responsible for doing it. There are lots of stores in zarr but you might need to implement your own or extend/overwrite the S3 one. If my memory serves the s3 store can take a “filesystem” object (typically from s3fs) which is responsible to write to s3 you might be able to implement an object like this. Happy to talk more on this subject.
    *My general experience is that this is non-trivial. It seems simple enough but somehow the reality of the data always throw up interesting aspects/artefacts. It’s simpler to create single phenomenon zarr objects (by creating single phenomenon arrays) but this may or may not be your ideal access patten. Exposing whatever you do through intake allows you to ‘tidy up’ if necessary by organising things into catalogues and nested catalogues etc.

Good luck!!

  • sorry I’m only allowed two links so you might need to search the web a bit more…
1 Like

Hi @philipkershaw - we are super excited about this project!

This is definitely possible. You have two choices:

  • write the zarr to POSIX disk and then copy it to the object store with the s3 command line utility
  • stream the data directly to the object store

Dask can parallelize both operations.

You will need to create an fsspec-style mapper object for your object store. I recently did something like this with wasabi cloud storage.

import s3fs

fs = s3fs.S3FileSystem(key=s3_access_key, secret=s3_secret,
                                      client_kwargs={'endpoint_url': 'https://s3.us-east-2.wasabisys.com'}
mapper = fs.get_mapper('bucket_name/whatever/path/you/want')

# create xarray dataset somehow, i.e.
ds = xr.open_mfdataset('/all_files/*.nc')

# write to object store
# (if you're connected to a dask distributed cluster, this will run in parallel)
# (if you pass a regular path instead of `mapper`, you would write to disk instead)
ds.to_zarr(mapper, consolidated=True)

# to read back the data after writing, you need to restart your kernel for some reason
# I think this has to do with s3fs caching
ds_s3 = xr.open_zarr(mapper, consolidated=True)

We need to refresh our docs with some updated info.

Thanks @Theo_McCaie, @rabernat! This help is much appreciated. Almost certainly there’ll be more questions to follow ;). We definitely prefer streaming direct to our object store. Colleague Matt here had tried s3fs earlier in the week and done an initial test write to our store. We’ll need to get Dask working in this setup so that we can scale it up.

We see a lot of potential in object storage for our on-premise infrastructure so this kind of pilot is important for us.

1 Like

zarr + object_store = :fire: