Using grib2 files with `open_mfdataset`: is there a better workflow than converting to netcdf?

I’m working on a project using a particular dataset where each time step is stored as a separate file. I want all the time steps!

The files are stored as .grib2 files which is cool because they don’t take up a lot of space. A typical snapshot (this is 1km radar precipitation data for most of the Continental US) is about 850kb. But I can’t work with lots of .grib2 files at once using tools like open_mfdataset and friends.

At the moment my workflow is to go through each file and (1) download the remote .grib2 file (it’s actually zipped so I unzip it), (2) read it in to xarray using xr.open_dataarray(grib2_fname, engine="cfgrib"), (3) do some subsetting, and (4) save the resulting .nc file to disk. This works but has two downsides:

  1. Reading in the grib2 file takes a long time – order of a couple seconds for each. This adds up.
  2. The resulting .nc files are much larger – almost 100MB per snapshot (ie, ~100x bigger than the grib2 files)

I have (work in progress!) code that implements this (and handles file naming conventions and some other stuff, probably still badly): GitHub - dossgollin-lab/nexrad-xarray at package

Is there a better way to build a large and easy to handle (ie, open_mfdataset-friendly) from these highly compressed grib files?

1 Like

@jdossgollin So I’m simultaneously working on the exact same problem here. In general, instead of using open_mfdataset, I’ve gone a bit lower level and use dask.delayed with a custom function to open each grib file, pull out the messages I want, and then concat at a later step into a Dataset - so conceptually identical to what open_mfdataset does under the hood, but I directly manage the file access.

Would definitely refer you to the nascent pangeo-forge discussion at #387 - I’m keen to build a pangeo-forge recipe that automates all of this to produce a final Zarr archive that is easier to read from. Would love to collaborate!

  1. The resulting .nc files are much larger – almost 100MB per snapshot (ie, ~100x bigger than the grib2 files)

GRIB data is very highly compressed. It’s not surprising that the final files are so much bigger. Are you using any compression on the resulting NetCDF files? In general I don’t bother with compressing them if I’m just using them as an intermediate for other processes since short-term storage is cheaper than the CPU cost for many of my workflows. But if you intend to retain them, you should definitely apply at least deflate level 1 compression.

2 Likes

Very cool, thanks for the detailed response and great to know that others are working on related challenges.

Thanks for the suggestion on compressing the NetCDF files. I’d naively assumed that the files were automatically compressed, but obviously that’s not the case – a little playing around with the encoding parameter in xr.DataArray.to_netcdf got my file size down by well over an order of magnitude, which is fantastic.

I will try to follow your project. Ultimately, getting this (NOAA-created and University of Iowa hosted) dataset onto the cloud would be a really cool enabling technology. And since so much meteorological data is stored in .grib2 format, developing a general solution makes tons of sense (thanks!)

Not sure if this would be helpful but do this fairly regularly and we found CDO tool (Overview - CDO - Project Management Service) to be a quite handy CLI tool for these. Here’s an example for selecting a single parameter (surface pressure) from a bunch of GRIB files and saving it into a single file (options specified to create NetCDF4 file and also some compression).

cdo -f nc4 -z zip select,name=sp *.grb  sp.nc

Part of the reason we do this is that CDO tools uses very little memory and tend to be reasonably fast. Once we’ve converted into NetCDF, we then use Xarray for further processing.

2 Likes

Wow, just tried it for one of my files and cdo is way way way faster than reading in with xarray and cfgrib then writing to netcdf!

1 Like