How to grab data from Amazon?

rybchuk · June 8, 2021, 10:41pm

Hi folks,

Apologies for the overly simple question, but how would you go about loading data that is hosted on Amazon in the form ds = xr.open_zarr(fname)? I’m familiar with working with Zarr stores that are on my local HPC cluster, but I’m new to working with data in the cloud and I’m having a hard time understanding all the different types of links. I’m ultimately trying to open the ERA5 dataset on AWS to hopefully do sensible heat flux analysis.

There are a number of great examples in the Gallery that use data that is hosted on AWS, like the CESM LENS demo. The data is described at https://registry.opendata.aws/ncar-cesm-lens/. However, intake reads data from https://ncar-cesm-lens.s3-us-west-2.amazonaws.com/catalogs/aws-cesm1-le.json, which is a different link. The AWS Open Data Registry link has a number of other info about the S3 bucket in the side bar (arn:aws:s3:::ncar-cesm-lens, us-west-2, aws s3 ls s3://ncar-cesm-lens/ --no-sign-request), but I’m having a difficult time seeing how those values translate into the link that we feed intake.

I think part of my confusion also comes from the multitude of libraries that are used to read S3 data. For example, this SST demo uses fsspec to read from the MUR SST dataset on AWS, whereas this ERA5 demo uses s3fs.

rabernat · June 9, 2021, 2:14pm

Welcome @rybchuk! Indeed it is a bit confusing.

First, it’s important to understand the layers. From low to high, we have

Zarr: this is the format that lots of cloud data are stored in on object stores (S3, GCS, etc.). As you know, Zarr is actually a directory, not a file. Cloud data often use consolidated metadata to remove the need to “list” the directories in the object store, which can be slow.
Fsspec implementation: fsspec is an API that allows us to read from different storage services using a common API. It’s a key piece to make all of this work. There are many different implementations of fsspec for different storage services (e.g. local files, http, s3, gcs, dropbox, ftp, etc. etc.). s3fs is the fsspec implementation for s3. When you do fsspec.get_mapper('s3://bucket/path'), it actually dispatches to s3fs. It’s equivalent to:
```
fs = s3fs.S3Filesystem()
mapper = fs.get_mapper('s3://bucket/path')
```
These mapper objects are the things we pass to Zarray or Zarr to actually open the dataset.
Xarray is the analysis library that can interpret the zarr stores as netcdf-like datasets. For Zarr, Xarray takes an fsspec mapping as its input to open_zarr, opens this store with Zarr, and then decodes it according to xarray convetions.
Intake is a catalog for data. It is 100% optional! You do not need intake to open any cloud data, but it can make it more convenient. For LENS and CMIP6, we provide intake-esm compatible catalogs. (That’s what https://ncar-cesm-lens.s3-us-west-2.amazonaws.com/catalogs/aws-cesm1-le.json is.) But you can always bypass that and go directly for the data, if you know where to look. For example, that json file points at a CSV file (https://ncar-cesm-lens.s3-us-west-2.amazonaws.com/catalogs/aws-cesm1-le.csv) that has entries like this:
```
variable,long_name,component,experiment,frequency,vertical_levels,spatial_domain,units,start_time,end_time,path
FLNS,net longwave flux at surface,atm,20C,daily,1.0,global,W/m2,1920-01-01 12:00:00,2005-12-31 12:00:00,s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FLNS.zarr
FLNSC,clearsky net longwave flux at surface,atm,20C,daily,1.0,global,W/m2,1920-01-01 12:00:00,2005-12-31 12:00:00,s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FLNSC.zarr
```

From that CSV, you can see the links to the actual zarr stores, e.g. s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FLNSC.zarr. If you’re using recent versions of xarray, zarr, and fsspec, you could just do

xr.open_zarr('s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FLNSC.zarr', consolidated=True)

or more verbosely

fs = s3fs.S3FileSystem(anon=True)
mapper = fs.get_mapper('s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FLNSC.zarr')
xr.open_zarr(mapper, consolidated=True)

Hope this helps. Disclaimer: all this code is untested.

rybchuk · June 9, 2021, 7:36pm

Thanks @rabernat, this was super helpful on so many levels! I didn’t realize that intake catalogs needed to be created for data, so that has been useful to clarify.

Aside from the libraries, one of the other things I was confused about was how the S3 URLs work, but I think I understand now after looking at the LENS documentation. The “root directory” can be found on the sidebar of the OpenData pages for each of the datasets (s3://ncar-cesm-lens/ for LENS, s3://era5-pds/ for ERA5, etc.). However, to find the location of individual variables, you need to look elsewhere. If an intake catalog exists, you can grab that info there. Otherwise, you will probably need to look into the documentation to find the naming convention (here for LENS, here for ERA5). It looks like this documentation is usually linked to on the OpenData page.

Thanks!

jeffdlb · June 9, 2021, 10:30pm

Hello @rybchuk - My group is that one that published the CESM LENS data on AWS, so we are glad to hear it may be of use to you, even if only as an example. If your main interest is in reanalyses such as ERA5, you should know that we are currently working on converting to Zarr and publishing in AWS the CAM6 DART Reanalysis (NCAR releases new, realistic atmospheric reanalysis | Computational and Information Systems Laboratory).

rybchuk · June 10, 2021, 6:54pm

Hi @jeffdlb, thanks for publishing the LENS data! And that’s exciting to hear that CAM6 DART data is getting pushed to the cloud too

Topic		Replies	Views
Loading ensembles using intake Data	4	873	May 24, 2023
Opening cloud data without using intake Pangeo Cloud Support	2	654	July 17, 2020
Best practice reading zarr from s3 Cloud	8	4534	July 28, 2022
How to read multiple zarr archives at once from s3?	3	1557	July 5, 2022
Zarr on other S3-compatible storage (e.g. DigitalOcean)?	3	1049	October 7, 2020

How to grab data from Amazon?

Related topics