Low-budget Cloud Architecture for CESM Ensemble Analysis

playertr · July 15, 2020, 5:27pm

Hello!

I am an intern at Ice911 Research tasked with visualizing CESM model output for a nine-member ensemble of climate simulations, in a project investigating the impact of localized surface albedo modification. Thanks to several Medium articles by @jhamman and others in the Pangeo ecosystem, I discovered that Pangeo’s zarr-xarray-dask pipeline for cloud-based analysis is very well suited to this task. I have a couple of questions on how Ice911, a 501(c)(3) nonprofit, can move its model analysis to Pangeo while minimizing cost. I’ve enjoyed the Pangeo community’s approachable documentation, but would appreciate people’s input on a couple of design considerations.

Upgrading to a Cluster: We are able to produce visualizations on a single AWS EC2 instance, but most Pangeo deployments use Kubernetes clusters. Is it worth it to upgrade to a cluster?
- 1.1. My budget for computing services is around $50/month. It is hard to parse AWS’ pricing models to know whether we may deploy a cluster on this budget. Does anyone with experience with AWS and Kubernetes have recommendations on whether it is worth it?
- 1.2 Do folks recommend applying for a Pangeo Cloud account? This solution seems cheaper and easier, but its future seems uncertain.
Storing and Accessing Data in the Cloud: The variables of interest in the CESM 1.2 model output are stored in monthly time-slice .nc files in S3 Glacier, in different “folders” for each ensemble member. What is the best way to make these files accessible on the cloud?
- 2.1 It may be convenient to download and convert the ensemble members’ output to Zarr datastores individually then upload each store to an S3 bucket, but then each ensemble member would have a separate datastore. Is there a more convenient way, perhaps by leveraging intake-esm, to make the raw model output accessible?
- 2.2 Having a compressed and intelligently-chunked Zarr datastore would minimize the expense of costly Standard S3 storage, but I’m weary of compromising data integrity through compression loss. Is this a valid concern?

Thank you for your input and for hosting such a welcoming community. If these questions are easily answered by documentation or other posts, please let me know. Since they regard design tradeoffs which may be important to other users, I’m asking here.

Sincerely,
Tim

playertr · November 18, 2020, 7:05pm

For posterity:

In our use case, it turned out that a hefty AWS r5n.2xlarge was sufficient for all steps of our Pangeo pipeline. I was unable to get Kubernetes up and running and sunk about five days into the task, but I’m new to devops.
I started with @rabernat’s Zarr conversion pipeline and modified it for the AWS ecosystem. For us, it made sense to:
- Restore 15TB (9 ensembles) of source NetCDF files from AWS S3 Glacier
- Create a 2TB EBS drive attached to an EC2 instance capable of storing one ensemble’s worth of files
- For each ensemble in turn, download the files to the EBS drive, create an appropriately-chunked Zarr dataset using a variant of the linked Github gist, and upload the Zarr dataset to the cloud.
- Delete the expensive EBS drive
- Open the cloud-based Zarr datasets and concatenate the dataset into a single useful Zarr dataset.
I ran into an xarray bug where the time_bnds variable from the CESM model output prevented xarray from appropriately concatenating all files. See issue here.
Moving to the Zarr architecture was awesome. It enabled us to perform realtime visualization via Holoviews and a Panel webapp. Lightning-fast parallel IO and thoughtful visualization tools were lightyears ahead of legacy methods – without Pangeo, to host the webapp we would have had to hand-extract each climate variable using bash scripts of nco-tools calls run on a supercomputer.

Hope this helps and happy hacking!

rabernat · November 18, 2020, 7:17pm

Thanks so much for sharing @playertr! This is a fantastic success story!

Is any of your data public? Could you share any examples the code you used to generate your datasets or analyze the resulting zarr data?

playertr · November 18, 2020, 7:44pm

I’m afraid that some of my code and documentation may expose implementation details that could be a security concern – I’ll try to spend some time going through it and seeing if there’s any I could share.

RichardScottOZ · November 19, 2020, 6:41am

Yes, would definitely like to see some of that.

Topic		Replies	Views
CESM Large Ensemble Analysis CMIP6 Hackathon	7	1643	December 16, 2019
Cleaning out the pangeo-data google cloud storage bucket Cloud	27	2660	February 5, 2020
Access to Pangeo GCS Bucket to push model output from pre-CMIP6 experiments? Cloud	6	1147	November 21, 2019
Suggested database for large amount of NetCDF data Data	13	3072	April 7, 2022
Pangeo in the cloud Cloud	0	694	September 5, 2019

Low-budget Cloud Architecture for CESM Ensemble Analysis

Related topics