I’m working on a project using a particular dataset where each time step is stored as a separate file. I want all the time steps!
The files are stored as .grib2
files which is cool because they don’t take up a lot of space. A typical snapshot (this is 1km radar precipitation data for most of the Continental US) is about 850kb. But I can’t work with lots of .grib2
files at once using tools like open_mfdataset
and friends.
At the moment my workflow is to go through each file and (1) download the remote .grib2
file (it’s actually zipped so I unzip it), (2) read it in to xarray using xr.open_dataarray(grib2_fname, engine="cfgrib")
, (3) do some subsetting, and (4) save the resulting .nc
file to disk. This works but has two downsides:
- Reading in the grib2 file takes a long time – order of a couple seconds for each. This adds up.
- The resulting
.nc
files are much larger – almost 100MB per snapshot (ie, ~100x bigger than the grib2 files)
I have (work in progress!) code that implements this (and handles file naming conventions and some other stuff, probably still badly): GitHub - dossgollin-lab/nexrad-xarray at package
Is there a better way to build a large and easy to handle (ie, open_mfdataset
-friendly) from these highly compressed grib files?
1 Like
@jdossgollin So I’m simultaneously working on the exact same problem here. In general, instead of using open_mfdataset
, I’ve gone a bit lower level and use dask.delayed
with a custom function to open each grib file, pull out the messages I want, and then concat at a later step into a Dataset
- so conceptually identical to what open_mfdataset
does under the hood, but I directly manage the file access.
Would definitely refer you to the nascent pangeo-forge discussion at #387 - I’m keen to build a pangeo-forge recipe that automates all of this to produce a final Zarr archive that is easier to read from. Would love to collaborate!
- The resulting
.nc
files are much larger – almost 100MB per snapshot (ie, ~100x bigger than the grib2 files)
GRIB data is very highly compressed. It’s not surprising that the final files are so much bigger. Are you using any compression on the resulting NetCDF files? In general I don’t bother with compressing them if I’m just using them as an intermediate for other processes since short-term storage is cheaper than the CPU cost for many of my workflows. But if you intend to retain them, you should definitely apply at least deflate level 1 compression.
3 Likes
Very cool, thanks for the detailed response and great to know that others are working on related challenges.
Thanks for the suggestion on compressing the NetCDF files. I’d naively assumed that the files were automatically compressed, but obviously that’s not the case – a little playing around with the encoding parameter in xr.DataArray.to_netcdf
got my file size down by well over an order of magnitude, which is fantastic.
I will try to follow your project. Ultimately, getting this (NOAA-created and University of Iowa hosted) dataset onto the cloud would be a really cool enabling technology. And since so much meteorological data is stored in .grib2
format, developing a general solution makes tons of sense (thanks!)
Not sure if this would be helpful but do this fairly regularly and we found CDO tool (Overview - CDO - Project Management Service) to be a quite handy CLI tool for these. Here’s an example for selecting a single parameter (surface pressure) from a bunch of GRIB files and saving it into a single file (options specified to create NetCDF4 file and also some compression).
cdo -f nc4 -z zip select,name=sp *.grb sp.nc
Part of the reason we do this is that CDO tools uses very little memory and tend to be reasonably fast. Once we’ve converted into NetCDF, we then use Xarray for further processing.
3 Likes
Wow, just tried it for one of my files and cdo
is way way way faster than reading in with xarray
and cfgrib
then writing to netcdf
!
1 Like