Reading xArray datasets in groups

tedhabermann · June 18, 2021, 3:16pm

I have created a file with 32 xArray datasets in groups named after the stations they are from. Writing these datasets to netCDF groups is easy with the group parameter in xarray.dataset.to_netcdf. I was surprised not to find a similar group argument in xarray.open_dataset. But… I read the fine print and discovered the **kwargs group=groupName and all is well… Sometimes the obvious things work!

rabernat · June 18, 2021, 3:32pm

Thanks for sharing your experience Ted! (And welcome to the forum!) I’m glad you were able to get your data read.

While Xarray can read a single netCDF / HDF group, it cannot represent a nested tree of groups with related variables in a single object with its current data model. However, this feature is currently being discussed and is in fact included as part of a pending CZI EOSS proposal.

github.com/pydata/xarray

Feature Request: Hierarchical storage and processing in xarray

opened 08:52PM - 01 Jun 20 UTC

emilbiju

I am using xarray for processing geospatial data and have encountered two major …challenges with existing data structures in xarray: - Data arrays stored in an xarray Dataset cannot be grouped into hierarchical levels/logical subsets to reflect the internal organisation of the data. This makes it difficult to identify and process a subset of the data variables that pertain to a specific problem. - When two data arrays having a shared dimension but different coordinate values along the dimension are merged into a Dataset, the union of coordinate values from the 2 data arrays becomes the new coordinate set corresponding to that dimension. Consequently, when the value of a variable in the dataset corresponding to a coordinate value is unknown, `nan` is used as a substitute which results in memory wastage. I would like to suggest a tree-based data structure for xarray in which the leaves store individual data arrays and the other nodes store the hierarchical information. Since data arrays are stored independently, each dimension only needs to be associated with coordinate values that are valid for that data array. To meet these requirements, I have implemented a data structure that also supports the below capabilities: - Standard xarray methods can be applied to the tree at all hierarchical levels, i.e., when a function is called at a hierarchical level, it is mapped over all data arrays that occur at the leaves under the corresponding node. For example, say I have a tree object (lets call it `dt`) with child nodes: `weather`, `satellite image` and `population`. Each of these nodes has data arrays/subtrees under it. > ![Screenshot 2020-06-02 at 2 10 28 AM](https://user-images.githubusercontent.com/39640592/83452402-42152680-a476-11ea-9e88-cfb4ddb80310.png) The mean over time of all data variables associated with weather can be obtained using `dt.weather.mean('time')` which applies the function to `sea_surface_temperature`, `dew_point_temperature`, `wind_speed` and `pressure`. - It can be encoded into the netCDF format, like xarray Datasets. - It supports item assignment at all hierarchical levels. I would like to know of the possibility of introducing such a data structure in xarray and the challenges involved in the same.

tedhabermann · June 18, 2021, 3:50pm

Ryan,
Glad to be here! I am working with the Incorporated Research Institutes for Seismology (IRIS) and UNAVCO to design a container for many types of geophysical data, mostly timeseries. We are learning about the xArray data model and tools as a candidate data model for that work. The Pangeo community has been very helpful. Thanks!
Ted

cgentemann · June 18, 2021, 3:59pm

Ted,
as a workaround… one trick I use a lot is to read xarray datasets into a dictionary. It seems like this might be a nice way to handle all these station data groups. Something like this:

ds_dict = {}
for name in filelist:
ds = xr.open_dataset(name) #read in a group here
ds_dict[name] = ds # Add data to dictionary

chelle

tedhabermann · June 18, 2021, 4:26pm

Chelle,
I like that idea.

I wrote the file from a notebook that reads daily positions of a set of GNSS stations in some region from a UNAVCO web service into a set of dataframes. I created a dictionary that includes a metadata dictionary and a dataframe for each station (below), then I wrote the datasets out to the file with the station metadata dictionary as attributes in each group:

allData: {
    stationID 1: {
        positionMetadata {
            position metadata dictionary
        },
        position data dataframe
    },
    stationID 2: {
        positionMetadata {
            position metadata dictionary
        },
        position data dataframe
    },
    ....
}

I was also thinking of trying to merge them all into one xArray dataset with stationID as a string dimension so I could use xArray to select data from each station rather than reading the appropriate group… My next experiment…

Of course, all of the stations have data for different time periods and I am hoping that the xArray merge creates a single time dimension for all stations…

Thanks again for the idea and the snippet.
Ted

rsignell · June 18, 2021, 4:33pm

Maybe you could get the group names from h5py, then pass each group to xarray?

Here’s an example.

rabernat · June 18, 2021, 4:33pm

This behavior is customizable and documented in the align function: xarray.align

In general, it is not a trivial problem to align different timeseries. You may also want to consider interpolation: Interpolating data

tedhabermann · June 18, 2021, 4:37pm

Ryan,

Thanks for the pointer to align… I will check it out. In this case the positions are all daily, i.e. low resolution, so I think it should be ok.

Ted

tedhabermann · June 18, 2021, 4:45pm

Rich,

I like that general approach…

This file includes a metadata group that is an xArray dataset with a column ID so, in this specific case, I can also get the IDs like:
dataFileName = ‘coloradoStations.nc’
metadata_ds = xr.open_dataset(dataFileName,group=‘metadata’)
metadata_df = metadata_ds.to_dataframe()
metadata_df[‘ID’].unique() =
array([‘SA00’, ‘SG24’, ‘AMC2’, ‘P041’, ‘P037’, ‘NISU’, ‘P040’, ‘P044’,
‘P031’, ‘RG17’, ‘RG22’, ‘RG19’, ‘RG23’, ‘RG16’, ‘RG15’, ‘RG24’,
‘RG20’, ‘RG21’, ‘RG14’, ‘P029’, ‘RG18’, ‘MFP0’, ‘MFTN’, ‘MFTW’,
‘MFTS’, ‘MFTC’, ‘UNAC’, ‘P728’, ‘NIST’, ‘SA62’, ‘RG26’, ‘PRX5’],
dtype=object)
This also provides 30 other fields of information about each station…

I like the ability of xArray to easily include detailed metadata along with the data in these files…

Ted

Topic		Replies	Views
Xarray for raster data (DEMs) with inconsistent spatial extent Data	10	3009	January 6, 2024
Given a xarray dataset opened from zarr, how to determine store and group?	2	506	August 15, 2022
Wednesday February 1st: Xarray-Datatree: Hierarchical Data Structures for Multi-Model Science Pangeo Showcase	0	571	February 27, 2023
First 2023 Pangeo showcase at the Feb 1 community meeting! News & Announcements	1	1038	January 27, 2023
Netcdf to Zarr best practices Data	13	10336	February 10, 2021

Reading xArray datasets in groups

Related topics