What I would like to do: My plan is to study future changes in dependence between atmospheric variables that CMIP6 models simulate, such as wind speed (‘sfcWind’) and precipitation (‘pr’). I would like to analyze these variables at a daily mean frequency, at a predefined list of approximately 200 tide gauges, for a set of 15 CMIP6 models. I would like to include multiple initial-condition variant simulations of each model, where available. The steps I think this would involve are:
- Extract daily mean wind speed and precipitation at each timestep at the grid cells nearest to the 200 tide gauge locations. so reduce the data from (lon,lat,time) to (tg_location,time) for each simulation.
- Perform extremes analysis on the remaining time series in rolling windows of say 40 years, and compute some dependence metrics in each of these windows
Question: I would appreciate some advice on a good workflow for this analysis. My naive approach would be to download the daily mean gridded fields for each model, simulation and variant on a HPC, and perform the analysis on the downloaded data. However, this would amount to downloading approximately 14 TB, which I would like to avoid if possible, as I only need information at tide gauge locations rather than the full global gridded fields. Instead, would Pangeo Cloud (e.g., on the Google Cloud Platform) be better suited for such an analysis? Would it be possible to use Pangeo Cloud to extract, for instance, timeseries of CMIP6 variables at a limited list of coordinates (i.e., to go from (lon,lat,time) to (tg_location,time)) and only download the reduced dataset, or even download only the output of the extremes analysis performed on that reduced dataset? I have tried to run some examples in the CMIP6 gallery that use intake-esm to fetch CMIP6 data from Google Cloud on my own laptop, but even after lazily selecting the time series for just one grid cell, loading that into memory took very long. Would this be faster on the cloud? Thanks!