Hey everyone! I need someone to guide me and share expertise on how to manipulate and work with NetCDF files. To give you more context:
I am dealing with 10,000+ NetCDF files (1994-2024) from NOAA OISST v2.1 hosted on Google Cloud Storage (noaa-cdr-sea-surface-temp-optimum-interpolation/data/v2.1/avhrr).
There are 10,957 files x 1.6MB for each NetCDF file so a total of 17.5 GB.
I am working on the Google Colab environment.
I only need to filter data within certain coordinates.
I have been experimenting but it seems that the calibration is lost when I export the data as the metadata in the NetCDF is excluded.
What is best method to collate these 10K+ filtered data into one file.
I will use the collated and filtered data to train DL models.
Shout out to Jack Kelly from Climate Change AI for sharing this forum.
Try to open all of these files (or a subset to start) with Xarray’s open_mfdatatset. This can be difficult with a large number of files.
Create a ‘virtual reference dataset’ of all of these files with Kerchunk (guide here). This can be quite a lot of work up front, but then you will have ‘zarr like acccess’
Create a copy of the dataset as a Zarr store with Xarray+Dask or with some other tools. Quite a lot of work and you need a place to store a copy of the dataset, but you can choose the chunking schema and you will have a performant dataset. https://guide.cloudnativegeo.org/
Others might have tips or a link to an already existing Zarr copy of NOAA OISST.
I know this sort of goes against the recommended approach of being cloud-native here but for data of this size, it’s relatively straightforward to download & do this locally.
Here’s a bash script to download & merge files by year so that you have 30~40 files to work with, which would be much more manageable to open with Xarray’s open_mfdataset function.
#! /bin/bash
# Download data from Google into folder 'avhrr'
gsutil -m cp -r "gs://noaa-cdr-sea-surface-temp-optimum-interpolation/data/v2.1/avhrr" .
# Merge each year into a single NetCDF file
for y in {1994..2024}; do
cdo --sortname -z zstd1 mergetime avhrr/$y*/*.nc $y.nc
done
# (Optional) Merge into a single file
cdo -z zstd1 mergetime *.nc avhrr.nc
# (Optional) Chunk for faster time-series access if looking at specific locations.
nccopy -c lat/128,lon/128,time/365 avhrr.nc avhrr_c.nc
I’m not too clear what you mean by ‘calibration is lost’ though?
@josephyang I wanted to do it on cloud especially for training later on. I wanted to do everything in one place, no switching environments but will consider doing it locally.
I also shared my concern in CCAI community and I got suggestions of using CDO. I asked a generative model if CDO can be used in Google Colab and it said yes, I will still have to verify it.
Regarding the specific location, is it only for bounding box or I can specify the specific shape?
On the ‘calibration is lost’ - I experimented: I filtered the SST values within a bounding box and stored it into a NumPy array which I eventually saved in a CSV. I plotted it again and it shows that the SST values are different because the calibration is lost which is in the metadate of the netCDF file.