How to read multiple zarr archives at once from s3?

Hello,
I am succesfully using zarr+s3 but now I want to improve my solution.

def get_xarray_from_s3(bucket_name: str, dataset_name: str) -> xarray.Dataset:
    """
    Basic function to take xarray data from s3 bucket

    Args:
        bucket_name: name of the bucket
        dataset_name: refined name of the dataset, can be a path too

    Returns:
        xarray.Dataset object with the data from s3

    """
    check_aws_env_vars()

    s3_out = s3fs.S3FileSystem(anon=False)
    return xarray.open_zarr(
        store=s3fs.S3Map(
            root=f"s3:///{bucket_name}/{dataset_name}.zarr", s3=s3_out, check=False
        )
    )

This is my latest running version.

Now I expect something like that, where I am able to pass a list of datasets in the bucket:

def get_xarray_from_s3_multiple(bucket_name: str, dataset_names: List[str]) -> xarray.Dataset:
    """
    Basic function to take xarray data from s3 bucket

    Args:
        bucket_name: name of the bucket
        dataset_names: refined name of the dataset, can be a path too

    Returns:
        xarray.Dataset object with the data from s3

    """
    check_aws_env_vars()

    s3_out = s3fs.S3FileSystem(anon=False)
    fileset = [s3_out.open(f"s3:///{bucket_name}/{dataset_name}.zarr") for dataset_name in dataset_names]
    return xarray.open_mfdataset(fileset, engine='zarr', consolidated=True)

But this is not working due to this issue:

ValueError: Starting with Zarr 2.11.0, stores must be subclasses of BaseStore, if your store exposes the MutableMapping interface wrap it in Zarr.storage.KVStore. Got ()

I tried to wrap the s3_out.open() object by using Zarr.storage.KVStore but then I am running into TypeError.

So I hope anyone of you will know how to access multiple zarr archives at once.

1 Like

Would you mind trying with an older version of zarr? It sounds like xarray may need to be updated for the multi-dataset case with newer zarr.

Note that kerchunk allows you to build a single virtual dataset out of many datasets, specifying how to merge them. For the case of zarr inputs, this is not particularly useful, but you could consider it a workaround in this case.

Is the .open() call intentional here? You may want to create a mapper object instead.

...
fileset = [s3fs.S3Map(
            root=f"s3:///{bucket_name}/{dataset_name}.zarr", s3=s3_out, check=False
        ) for dataset_name in dataset_names]
return xarray.open_mfdataset(fileset, engine='zarr', consolidated=True)

Another option is to pass the list of stores to xr.open_mfdataset() directly:

....
fileset = [f"s3:///{bucket_name}/{dataset_name}.zarr" for dataset_name in dataset_names]
return xarray.open_mfdataset(fileset, engine='zarr', consolidated=True)
2 Likes

Thanks @andersy005 , this works for me so far. Thanks a lot

1 Like