Low-budget Cloud Architecture for CESM Ensemble Analysis


I am an intern at Ice911 Research tasked with visualizing CESM model output for a nine-member ensemble of climate simulations, in a project investigating the impact of localized surface albedo modification. Thanks to several Medium articles by @jhamman and others in the Pangeo ecosystem, I discovered that Pangeo’s zarr-xarray-dask pipeline for cloud-based analysis is very well suited to this task. I have a couple of questions on how Ice911, a 501(c)(3) nonprofit, can move its model analysis to Pangeo while minimizing cost. I’ve enjoyed the Pangeo community’s approachable documentation, but would appreciate people’s input on a couple of design considerations.

  1. Upgrading to a Cluster: We are able to produce visualizations on a single AWS EC2 instance, but most Pangeo deployments use Kubernetes clusters. Is it worth it to upgrade to a cluster?
    • 1.1. My budget for computing services is around $50/month. It is hard to parse AWS’ pricing models to know whether we may deploy a cluster on this budget. Does anyone with experience with AWS and Kubernetes have recommendations on whether it is worth it?
    • 1.2 Do folks recommend applying for a Pangeo Cloud account? This solution seems cheaper and easier, but its future seems uncertain.
  2. Storing and Accessing Data in the Cloud: The variables of interest in the CESM 1.2 model output are stored in monthly time-slice .nc files in S3 Glacier, in different “folders” for each ensemble member. What is the best way to make these files accessible on the cloud?
    • 2.1 It may be convenient to download and convert the ensemble members’ output to Zarr datastores individually then upload each store to an S3 bucket, but then each ensemble member would have a separate datastore. Is there a more convenient way, perhaps by leveraging intake-esm, to make the raw model output accessible?
    • 2.2 Having a compressed and intelligently-chunked Zarr datastore would minimize the expense of costly Standard S3 storage, but I’m weary of compromising data integrity through compression loss. Is this a valid concern?

Thank you for your input and for hosting such a welcoming community. If these questions are easily answered by documentation or other posts, please let me know. Since they regard design tradeoffs which may be important to other users, I’m asking here.


1 Like