Reading GOES-R s3 netCDFs from an AWS EC2 instance - is it possible to get faster speeds than from my local machine?

Hi everyone,

I’ve been trying to develop a tropical storm satellite image browsing tool using the public GOES-R s3 buckets, and rather predictably I ran into a significant slowdown when changing from reading a sample of locally-downloaded files to reading the s3 objects directly. I moved my code to an EC2 instance on us-east-1 (which should be the same region as the noaa-goes16 bucket), but the loading times actually increased from 3-4 seconds to 5-6. I’m not sure yet why that is.

Ideally I wanted to be able to do simple processing on the images without having to process and host .zarr or .json files for each NetCDF. I did try the method for creating reference jsons outlined at Fake it until you make it — Reading GOES NetCDF4 data on AWS S3 as Zarr for rapid data access | by Lucas Sterzinger | pangeo | Medium, but I’m concerned about the cost of uploading and hosting something like that - the mesoscale data is scanned once a minute, and even taking into account the subsets scanned during active tropical cyclones in the Atlantic, there are something like 11 million files. I could reduce that number a bit by only referencing the mesoscale scans that actually contain a cyclone, but not to any size that would actually make sense for an individual to host online.

Since this is essentially a browsing tool, I would need to be able to switch from file to file pretty quickly - I’d be fine with a 1-2 second lag, but I’m getting more like 5-6 on EC2. I’m pretty new to netCDF and AWS (this was a final project for a machine learning bootcamp) so any suggestions to either speed things up or change how I’m approaching the problem would be quite welcome!

For reference, a rough demo of the browser is available at: http://mesoscale-env.eba-np7r4e3n.us-east-1.elasticbeanstalk.com

(Please forgive the upside-down image plots and broken download buttons - I’m waiting to fix those until I have a better handle on what is feasible for speed).

2 Likes

Hi @mathematigal! How exactly are you accessing these S3 files with xarray? Are you using fsspec/s3fs?

That demo looks neat :slight_smile:

I think @lsterzinger’s question is on the right path for diagnosing the slowness of reading the data.

I don’t know if Azure is an option, but we do host GOES-R on the Planetary Computer Planetary Computer as both COG and NetCDF. I suspect that the STAC API might be helpful for finding the right imagery for an area / datetime. And you might even be able to use our data API to load the imagery and avoid needing to set up your own compute in Azure. If I did the query right, Planetary Computer seems to be Hurricane Elsa over Cuba on July 5th (found via the Explorer).

Lucas: Yes - I’ve also tried using the requests library on http urls, but it was a tad slower. current code looks something like this:

# Initialize s3 client with s3fs
fs = s3fs.S3FileSystem(anon=True)

# Open with s3fs
f = fs.open("s3://noaa-goes16/ABI-L2-MCMIPM/2021/241/14/OR_ABI-L2-MCMIPM1-M6_G16_s20212411400278_e20212411400347_c20212411400421.nc")

# open xarray dataset
with xr.open_dataset(f, engine='h5netcdf') as data:
    px.imshow(data.rgb.NaturalColor())

That code takes about 2 seconds to run on my local machine. Replacing the s3 path with a local file runs in milliseconds. The browser is plotting three things at once, so it’s a little slower than this.

Hi @lsterzinger! I replied, just not as a reply to you. Still getting used to this interface.

Thanks, Tom! The Planetary Computer project looks awesome. I’m not very familiar with it - is it something that would allow for customizing color options? Part of the reason I’m doing my own compute is so I can have full control over the image plotting.

Most likely s3fs is making multiple HTTP requests to fetch the metadata it needs to open the dataset (I think there’s a logging configuration / environment variable that you can set that’ll log any network requests). That’s one of the downsides of trying to read HDF5 files over the network. I don’t recall if s3fs has any parameters to tune this behavior.

If you’re going to read the whole file anyway to create the image, you might be best off downloading the file to disk and then reading it in with xarray, or reading all the bytes into memory in one shot with an `data = f.read().

is it something that would allow for customizing color options?

In this case, the Planetary Computer is just using TiTiler to dynamically render the COGs from Azure Blob Storage. TiTiler in turn uses rio-tiler, which lets you customize some things (Introduction to rio-tiler - rio-tiler), and TiTiler might have some additional parameters at /cog - TiTiler to control the rendering. I’ve struggled with learning its expression system, but it seems to be pretty flexible.

Are you referring to s3fs.core.setup_logging("DEBUG")? The example I shared makes two fetch/get combos to read the file. What I find kind of confusing is that when that same code is executed from my EC2 instance, it takes twice as long to do the same number of fetch/get combos. I would have thought that running the code from the same AWS region would be faster, based on what AWS docs claimed.

I could rework the code to download each selected file and read it locally on the EC2 instance, but due to space constraints I’d want to delete the file after using it. It seems like that would really limit what I could do with the tool, too (I wouldn’t want to share this publicly during, say, an active Atlantic hurricane season, for example).

What structural changes would be necessary to scale this so it could reasonably ‘browse’ the full set of cyclone mesoscales, do you think? I’m assuming that if I can’t get better performance reading the s3 files directly, I’d have to do something like convert those files to zarr or create a reference metadata database and host that online. That would definitely move this out of the realm of a personal project!

I am only using 9 of the 16 channel variables in each file (and only reading between 1-5 of them for any particular plot), but my understanding is that the reason for the slowdown is that xarray can’t do the lazy loading it does with local files due to the fact that netCDF metadata isn’t consolidated, and it has to load the entire s3 object. Does that sound right?

I was expecting a bit of a slowdown based on reading about netCDF/xarray/s3 here, but I went from plotting three figures from local files in about 800 ms to plotting the same three in about 6 seconds from code and data in the cloud. (Those three plots use data from two ~5MB compressed netCDFs.) Is that pretty standard for running in the cloud? I could also just package this as a downloadable tool that runs locally on my machine, since reading s3 is significantly faster from my laptop.

It’s a genuine concern;
a) those files are (of course) much smaller than the originals
b) when you produce combined files from many inputs, the size ratio is usually far far greater
c) the files usually compress very well (Zstd consistently seems the best)
d) you can tailor what you keep to whatever in the dataset is of interest to you - this requires some deeper knowledge of the workings of zarr, etc.
e) we intend on working on higher efficiency binary storage, with lazy loading of just what you need
f) pangeo-forge will become the de-facto place to generate these reference sets, so the compute cost to produce, storage and egress charges will be on someone else, so long as you are prepared to write and maintain the recipe.

Hmmm. Are you telling me this because it’s normal for xarray to slow down this much when reading objects from an s3 bucket in the same region as the cloud server running xarray? That’s the part I find confusing - I’m not expecting miracles! But I also wasn’t expecting the performance to be fully three times worse than when I was reading those same s3 objects from my laptop through my home network. I understand the value of the reference files, but since I’m only trying to open 5MB files one at a time in a browsing tool, I’m not sure I want to go down that road quite yet. Is using xarray from the cloud to read small s3-hosted netcdfs completely out of the question?

I agree that makes little sense. And I agree with your assessment that the reference files / kerchunk are not really needed for your application.

I tried running the code above, but it didn’t quite work for me, for the following two reasons:

  • px was not defined
  • rgb is not a data variable of the dataset

I modified it as follows

import s3fs
import xarray as xr

# Initialize s3 client with s3fs
fs = s3fs.S3FileSystem(anon=True)

# Open with s3fs
f = fs.open("s3://noaa-goes16/ABI-L2-MCMIPM/2021/241/14/OR_ABI-L2-MCMIPM1-M6_G16_s20212411400278_e20212411400347_c20212411400421.nc")

# this is the time it takes to initialize this dataset
%time data = xr.open_dataset(f, engine='h5netcdf')
# this is the time it takes to load all of the variables into memory
%time data.load()

I think calling data.load() is a better benchmark, because it times just the time it takes to download the data from s3 into memory. We should separate this from how long it takes to plot. I ran this code from my own laptop in NYC, Binder, and a 2i2c JupyterHub in Google Cloud US-Central-1.

As an additional data point, I tried just calling

time wget https://s3.amazonaws.com/noaa-goes16/ABI-L2-MCMIPM/2021/241/14/OR_ABI-L2-MCMIPM1-M6_G16_s20212411400278_e20212411400347_c20212411400421.nc

from the command line as a baseline for bandwidth that does not involve python at all.

Here is the timing I got

location xr.open_dataset xr load wget
laptop 1.17 s 175 ms 699 ms
Gesis binder 2.17 s 349 ms 1.42 s
GCS US-Central-1 1.0 s 311 ms 499 s

Some takeaways:

  • Xarray + s3fs on the same order of magnitude to just a vanilla wget (2-3x slower, but xarray is doing more work)
  • My laptop was just a tiny bit faster than either of the cloud environments…not significantly
  • There seemed to be quite a bit of variance between runs (but I didn’t try to quantify it)

In general, the time here should be governed almost completely by network properties: latency and bandwidth.

Unhelpfully, I don’t actually have access to an AWS instance from which to try this right now. But I agree completely that it should definitely be fastest from the same AWS region because that’s where you have the lowest network latency and highest bandwidth to s3. I have no specific hypothesis for why you’re not seeing that.

Another thing you can try is to cache the file first

import fsspec
import xarray as xr
url = "s3://noaa-goes16/ABI-L2-MCMIPM/2021/241/14/OR_ABI-L2-MCMIPM1-M6_G16_s20212411400278_e20212411400347_c20212411400421.nc"
with fsspec.open(f"simplecache::{url}", s3=dict(anon=True)) as f:
    data = xr.open_dataset(f, engine='h5netcdf')
    data.load()

This seemed a little faster for me. Or you could bypass fsspec and h5py / h5netcdf completely as follows and just give a local path string to the netcdf4 engine

# Open with s3fs
import fsspec
import xarray as xr
url = "s3://noaa-goes16/ABI-L2-MCMIPM/2021/241/14/OR_ABI-L2-MCMIPM1-M6_G16_s20212411400278_e20212411400347_c20212411400421.nc"
with fsspec.open(f"simplecache::{url}", s3=dict(anon=True)) as f:
    data = xr.open_dataset(f.name, engine='netcdf4')
    data.load()

@mathematigal - just as a follow up to the above post, I would be very interested to see the results of the exact same benchmarks (including wget) from your AWS environment. If wget is also slower, then you know it is a network issue.

Hey folks, this is a somewhat old post but I wanted to drop in to say thanks and post my own experience for anyone else. I was very confused why xr.open_dataset was so slow with pulling from s3 - it would take about 4 minutes, whereas wget took ~5 seconds.

Thanks @rabernat for your suggestions regarding local caching. This made a dramatic difference for me, reducing the runtime of xr.open_dataset and ds.load() from 4 minutes to ~5 seconds, right on par with wget. I found no difference in speed between the two versions you posted, i.e. whether the h5netcdf or netcdf4 engines are used.

For any future readers, here’s my experience (I’m running this from my laptop in Boulder, Colorado):

This took 4 minutes!

import s3fs
import xarray as xr
url = "s3://noaa-ufs-gefsv13replay-pds/1deg/1994/01/1994010100/bfg_1994010100_fhr00_control"
fs = s3fs.S3FileSystem(anon=True)
f = fs.open(url)
ds = xr.open_dataset(f,engine="h5netcdf")
ds.load()

whereas this took 4-5 seconds :rocket: (on par with wget)

import fsspec
import xarray as xr
url = "s3://noaa-ufs-gefsv13replay-pds/1deg/1994/01/1994010100/bfg_1994010100_fhr00_control"
with fsspec.open(f"simplecache::{url}", s3={"anon":True}) as f:
     ds = xr.open_dataset(f, engine="h5netcdf")
     ds.load()
2 Likes

I am thinking that fsspec should have a hdf5-specific caching strategy, as it does for parquet. Readahead works particularly badly for the typical case of little bits of metadata spread through the file; since much of that is in the header, the “first” strategy is what is used when kerchunking.