Need guidance on manipulating NetCDF files

Hey everyone! I need someone to guide me and share expertise on how to manipulate and work with NetCDF files. To give you more context:

  • I am dealing with 10,000+ NetCDF files (1994-2024) from NOAA OISST v2.1 hosted on Google Cloud Storage (noaa-cdr-sea-surface-temp-optimum-interpolation/data/v2.1/avhrr).
  • There are 10,957 files x 1.6MB for each NetCDF file so a total of 17.5 GB.
  • I am working on the Google Colab environment.
  • I only need to filter data within certain coordinates.
  • I have been experimenting but it seems that the calibration is lost when I export the data as the metadata in the NetCDF is excluded.
  • What is best method to collate these 10K+ filtered data into one file.
  • I will use the collated and filtered data to train DL models.

Shout out to Jack Kelly from Climate Change AI for sharing this forum.

3 Likes

Hey there @isa, welcome to the Pangeo Forum :wave:

A few ideas of getting all of these data into a single Xarray dataset.

  1. If it’s helpful to start working with, there is a Kerchunk'ed version of NOAA OISST AVHRR stored on the pangeo OSN pod. You can access this with
! pip install kerchunk xarray s3fs
import xarray as xr 
ds = xr.open_dataset('https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/pangeo-forge/aws-noaa-oisst-feedstock/aws-noaa-oisst-avhrr-only.zarr/reference.json',engine='kerchunk')
  1. Try to open all of these files (or a subset to start) with Xarray’s open_mfdatatset. This can be difficult with a large number of files.

  2. Create a ‘virtual reference dataset’ of all of these files with Kerchunk (guide here). This can be quite a lot of work up front, but then you will have ‘zarr like acccess’

  3. Create a copy of the dataset as a Zarr store with Xarray+Dask or with some other tools. Quite a lot of work and you need a place to store a copy of the dataset, but you can choose the chunking schema and you will have a performant dataset. https://guide.cloudnativegeo.org/

Others might have tips or a link to an already existing Zarr copy of NOAA OISST.

4 Likes

I know this sort of goes against the recommended approach of being cloud-native here but for data of this size, it’s relatively straightforward to download & do this locally.

Here’s a bash script to download & merge files by year so that you have 30~40 files to work with, which would be much more manageable to open with Xarray’s open_mfdataset function.

When doing things locally, cdo (Overview - CDO - Project Management Service) tool is quite handy, which is what I use below.

#! /bin/bash

# Download data from Google into folder 'avhrr'
gsutil -m cp -r "gs://noaa-cdr-sea-surface-temp-optimum-interpolation/data/v2.1/avhrr" .

# Merge each year into a single NetCDF file
for y in {1994..2024}; do
   cdo --sortname -z zstd1 mergetime avhrr/$y*/*.nc $y.nc
done

# (Optional) Merge into a single file
cdo -z zstd1 mergetime *.nc avhrr.nc

# (Optional) Chunk for faster time-series access if looking at specific locations.
nccopy -c lat/128,lon/128,time/365 avhrr.nc avhrr_c.nc

I’m not too clear what you mean by ‘calibration is lost’ though?

2 Likes

@norlandrhagen Thank you for this! Will be trying your suggestions out.

1 Like

@josephyang I wanted to do it on cloud especially for training later on. I wanted to do everything in one place, no switching environments but will consider doing it locally.

I also shared my concern in CCAI community and I got suggestions of using CDO. I asked a generative model if CDO can be used in Google Colab and it said yes, I will still have to verify it.

Regarding the specific location, is it only for bounding box or I can specify the specific shape?

On the ‘calibration is lost’ - I experimented: I filtered the SST values within a bounding box and stored it into a NumPy array which I eventually saved in a CSV. I plotted it again and it shows that the SST values are different because the calibration is lost which is in the metadate of the netCDF file.