Need guidance on manipulating NetCDF files

isa · April 19, 2024, 10:00am

Hey everyone! I need someone to guide me and share expertise on how to manipulate and work with NetCDF files. To give you more context:

I am dealing with 10,000+ NetCDF files (1994-2024) from NOAA OISST v2.1 hosted on Google Cloud Storage (noaa-cdr-sea-surface-temp-optimum-interpolation/data/v2.1/avhrr).
There are 10,957 files x 1.6MB for each NetCDF file so a total of 17.5 GB.
I am working on the Google Colab environment.
I only need to filter data within certain coordinates.
I have been experimenting but it seems that the calibration is lost when I export the data as the metadata in the NetCDF is excluded.
What is best method to collate these 10K+ filtered data into one file.
I will use the collated and filtered data to train DL models.

Shout out to Jack Kelly from Climate Change AI for sharing this forum.

norlandrhagen · April 19, 2024, 3:53pm

Hey there @isa, welcome to the Pangeo Forum

A few ideas of getting all of these data into a single Xarray dataset.

If it’s helpful to start working with, there is a Kerchunk'ed version of NOAA OISST AVHRR stored on the pangeo OSN pod. You can access this with

! pip install kerchunk xarray s3fs
import xarray as xr 
ds = xr.open_dataset('https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/pangeo-forge/aws-noaa-oisst-feedstock/aws-noaa-oisst-avhrr-only.zarr/reference.json',engine='kerchunk')

Try to open all of these files (or a subset to start) with Xarray’s open_mfdatatset. This can be difficult with a large number of files.
Create a ‘virtual reference dataset’ of all of these files with Kerchunk (guide here). This can be quite a lot of work up front, but then you will have ‘zarr like acccess’
Create a copy of the dataset as a Zarr store with Xarray+Dask or with some other tools. Quite a lot of work and you need a place to store a copy of the dataset, but you can choose the chunking schema and you will have a performant dataset. https://guide.cloudnativegeo.org/

Others might have tips or a link to an already existing Zarr copy of NOAA OISST.

josephyang · April 20, 2024, 9:12pm

I know this sort of goes against the recommended approach of being cloud-native here but for data of this size, it’s relatively straightforward to download & do this locally.

Here’s a bash script to download & merge files by year so that you have 30~40 files to work with, which would be much more manageable to open with Xarray’s open_mfdataset function.

When doing things locally, cdo (Overview - CDO - Project Management Service) tool is quite handy, which is what I use below.

#! /bin/bash

# Download data from Google into folder 'avhrr'
gsutil -m cp -r "gs://noaa-cdr-sea-surface-temp-optimum-interpolation/data/v2.1/avhrr" .

# Merge each year into a single NetCDF file
for y in {1994..2024}; do
   cdo --sortname -z zstd1 mergetime avhrr/$y*/*.nc $y.nc
done

# (Optional) Merge into a single file
cdo -z zstd1 mergetime *.nc avhrr.nc

# (Optional) Chunk for faster time-series access if looking at specific locations.
nccopy -c lat/128,lon/128,time/365 avhrr.nc avhrr_c.nc

I’m not too clear what you mean by ‘calibration is lost’ though?

isa · April 22, 2024, 7:55am

@norlandrhagen Thank you for this! Will be trying your suggestions out.

isa · April 22, 2024, 8:13am

@josephyang I wanted to do it on cloud especially for training later on. I wanted to do everything in one place, no switching environments but will consider doing it locally.

I also shared my concern in CCAI community and I got suggestions of using CDO. I asked a generative model if CDO can be used in Google Colab and it said yes, I will still have to verify it.

Regarding the specific location, is it only for bounding box or I can specify the specific shape?

On the ‘calibration is lost’ - I experimented: I filtered the SST values within a bounding box and stored it into a NumPy array which I eventually saved in a CSV. I plotted it again and it shows that the SST values are different because the calibration is lost which is in the metadate of the netCDF file.

Topic		Replies	Views
Reading a Larger than RAM NetCDF4 using Xarray Data zarr	7	161	June 24, 2025
Netcdf to Zarr best practices Data	13	10390	February 10, 2021
Many netcdf to single zarr store using concurrent.futures Data	6	1426	March 29, 2022
Suggested database for large amount of NetCDF data Data	13	2942	April 7, 2022
Favorite way to go from netCDF (&xarray) to torch/TF/Jax et al Data location-ncar , machine-learning	7	5457	August 17, 2022

Need guidance on manipulating NetCDF files

Related topics