How to read multiple zarr archives at once from s3?

Hello,
I am succesfully using zarr+s3 but now I want to improve my solution.

def get_xarray_from_s3(bucket_name: str, dataset_name: str) -> xarray.Dataset:
    """
    Basic function to take xarray data from s3 bucket

    Args:
        bucket_name: name of the bucket
        dataset_name: refined name of the dataset, can be a path too

    Returns:
        xarray.Dataset object with the data from s3

    """
    check_aws_env_vars()

    s3_out = s3fs.S3FileSystem(anon=False)
    return xarray.open_zarr(
        store=s3fs.S3Map(
            root=f"s3:///{bucket_name}/{dataset_name}.zarr", s3=s3_out, check=False
        )
    )

This is my latest running version.

Now I expect something like that, where I am able to pass a list of datasets in the bucket:

def get_xarray_from_s3_multiple(bucket_name: str, dataset_names: List[str]) -> xarray.Dataset:
    """
    Basic function to take xarray data from s3 bucket

    Args:
        bucket_name: name of the bucket
        dataset_names: refined name of the dataset, can be a path too

    Returns:
        xarray.Dataset object with the data from s3

    """
    check_aws_env_vars()

    s3_out = s3fs.S3FileSystem(anon=False)
    fileset = [s3_out.open(f"s3:///{bucket_name}/{dataset_name}.zarr") for dataset_name in dataset_names]
    return xarray.open_mfdataset(fileset, engine='zarr', consolidated=True)

But this is not working due to this issue:

ValueError: Starting with Zarr 2.11.0, stores must be subclasses of BaseStore, if your store exposes the MutableMapping interface wrap it in Zarr.storage.KVStore. Got ()

I tried to wrap the s3_out.open() object by using Zarr.storage.KVStore but then I am running into TypeError.

So I hope anyone of you will know how to access multiple zarr archives at once.

Would you mind trying with an older version of zarr? It sounds like xarray may need to be updated for the multi-dataset case with newer zarr.

Note that kerchunk allows you to build a single virtual dataset out of many datasets, specifying how to merge them. For the case of zarr inputs, this is not particularly useful, but you could consider it a workaround in this case.

Is the .open() call intentional here? You may want to create a mapper object instead.

...
fileset = [s3fs.S3Map(
            root=f"s3:///{bucket_name}/{dataset_name}.zarr", s3=s3_out, check=False
        ) for dataset_name in dataset_names]
return xarray.open_mfdataset(fileset, engine='zarr', consolidated=True)

Another option is to pass the list of stores to xr.open_mfdataset() directly:

....
fileset = [f"s3:///{bucket_name}/{dataset_name}.zarr" for dataset_name in dataset_names]
return xarray.open_mfdataset(fileset, engine='zarr', consolidated=True)
1 Like

Thanks @andersy005 , this works for me so far. Thanks a lot

1 Like