No problem! It was me that took a long time to get to it.
Thanks for posting a data sample! I’m at the level of listing keys in the HDF5 file right now, and I noticed that only ATL10 had the same keys as the files from EarthData. The others have variations on the same themes (“atlas,” “heights,” “geolocations,” …)—all of these words sound sensible, but I don’t know what are the important ones to focus on.
From your description, it sounds like these files (both level 2A “raw” and level 3A “refined”) are not what you do exploratory data analysis on, but something that you bulk-process to produce summary files for the more interactive data analysis (e.g. plotting). What’s an example of the kind of bulk-processing that you do?
The advantage of Awkward Array over NumPy would come from any manipulations that involve variable-length data, such as jagged/ragged arrays. A single HDF5 file has a lot of 1D or fixed-dimension (rectilinear) arrays, which NumPy is good at manipulating. If, maybe, a group of different-length 1D arrays are supposed to represent parts of a larger dataset containing variable-length data, or maybe a set of HDF5 files collectively represent a jagged/ragged dataset through the fact that each file contains arrays of different sizes, we could benefit from re-expressing the group or set of files as a single Awkward Array and do manipulations on that.
The other two datasets that I referred to—Million Songs and Argus Floats—had very few array elements in each HDF5/NetCDF file (hundreds or thousands), but they were split up that way to encode the size of a meaningful unit in the data analysis: each HDF5 in Million Songs was a song (length depends on the duration of the song), and each NetCDF file in Argus Floats was a day of data-taking (different numbers of floats were accessible each day). Combining hundreds or thousands of kilobyte-to-megabyte HDF5 files into one Parquet file reduces a lot of header overhead, so the total disk size was much smaller, and analyzing a single feature across many units (e.g. “starting note of each song”) could be expressed as NumPy-like slices, rather than iteration through a set of files.
So I started by looking for something similar here, but each of these HDF5 files is reasonably large, so they’re not wasting disk space with headers the way the other datasets were. From the key names, these look like qualitatively different types of information—satellite positions, calibrations, measurements at sea, land, etc.—rather than things like “segment0000,” “segment0001,” “segment0002,” etc. that should be concatenated to make something meaningful. Do you typically analyze data across a set of files, in which data in different files with the same key name represent the same data in different units, similar to the songs and days of the Million Songs and Argus Floats? I ask this particularly because your Google Cloud bucket contained three files that I think aren’t supposed to be concatenated like that, since they all have different sets of keys.
I also didn’t see any vlen_dtype data in any of these files, which would be better expressed in Parquet than in HDF5.
What’s an example of a pain point in your analysis that you think could be fixed with a jagged/ragged array library?
(No doubt Dask would help, regardless of whether the data themselves need jagged/ragged handling; the question is whether you need Dask-Awkward or just dask.array
.)