I’ve been trying to develop a tropical storm satellite image browsing tool using the public GOES-R s3 buckets, and rather predictably I ran into a significant slowdown when changing from reading a sample of locally-downloaded files to reading the s3 objects directly. I moved my code to an EC2 instance on us-east-1 (which should be the same region as the noaa-goes16 bucket), but the loading times actually increased from 3-4 seconds to 5-6. I’m not sure yet why that is.
Ideally I wanted to be able to do simple processing on the images without having to process and host .zarr or .json files for each NetCDF. I did try the method for creating reference jsons outlined at Fake it until you make it — Reading GOES NetCDF4 data on AWS S3 as Zarr for rapid data access | by Lucas Sterzinger | pangeo | Medium, but I’m concerned about the cost of uploading and hosting something like that - the mesoscale data is scanned once a minute, and even taking into account the subsets scanned during active tropical cyclones in the Atlantic, there are something like 11 million files. I could reduce that number a bit by only referencing the mesoscale scans that actually contain a cyclone, but not to any size that would actually make sense for an individual to host online.
Since this is essentially a browsing tool, I would need to be able to switch from file to file pretty quickly - I’d be fine with a 1-2 second lag, but I’m getting more like 5-6 on EC2. I’m pretty new to netCDF and AWS (this was a final project for a machine learning bootcamp) so any suggestions to either speed things up or change how I’m approaching the problem would be quite welcome!
For reference, a rough demo of the browser is available at: http://mesoscale-env.eba-np7r4e3n.us-east-1.elasticbeanstalk.com
(Please forgive the upside-down image plots and broken download buttons - I’m waiting to fix those until I have a better handle on what is feasible for speed).