Formatting Radio Occultation Data for the Cloud!

Hello all!

I’m the lead engineer on a data processing grant that AER has received through NASA Access. We are looking to move Radio Occultation (RO) Data from various data sources such as UCAR, ROMSAF and JPL into an Open Data S3 Bucket. The first line of effort will be to transfer these NetCDF files with a standardized format, but in the second line of effort we would like to generate a more cloud native format which will give researchers a lower barrier of entry to working with the files when stored on the cloud.

Here’s a data description from our Principle Scientist, Stephen Leroy:

GNSS radio occultation measures the bending of the signals of the Global Navigation Satellite System satellites (GPS, GLONASS, Galileo, BeiDou, etc.) as they transect the limb of the Earth’s atmosphere. The bending is measured by observations of the Doppler shifts of their microwave signals, which are calibrated using atomic clocks, and that bending can be inverted for extraordinarily high vertical resolution profiles of the atmosphere’s microwave index of refraction, temperature, pressure, and water vapor as functions of geopotential height. Its region of highest precision and accuracy is in the upper troposphere and lower stratosphere for temperature and the lower troposphere for water vapor.

Each file represents one occultation or pass through the atmophere so the files do not follow a standard grid format.

Read more here:
UCAR GNSS Radio Occulation

Files Example:
COSMIC-2 Files

Moving these files into the open data program is largely motivated by our wish for more researchers to use and have access to these files. With that goal in mind I would like to see if the larger community here has any thoughts on the best data format to transition these files to?

We have so far considered ZARR and Parquet. ZARR does not looks like it fits our use case well though since the data is not gridded and also not inherintly very high dimensional.

Rich Signell also recommended that we might simply add ZARR-like metadata to the NetCDF directories as he has experimented with and posted in some of his Pangeo blogs

1 Like

If the data can be stored in NetCDF, then they can be stored in Zarr. Zarr is an N-dimensional data container, but it handles the 1D case just fine.

If your data are “tabular” in nature, Parquet is a good fit.

It really comes down to what API you want to use to analyze the data. If you want to use Xarray / Numpy / Dask.array, I would go with Zarr. If you want to use pandas / Dask.dataframe, I would go with Parquet.