Datasets for awkward

We are developing dask-awkard at Anaconda, a distributed, chunk-wise extension to awkward-array. Awkward is for processing data with nested and variable-length structures (“jagged arrays”) with numpy semantics (cpu and gpu) or numba-jit functions. Whilst there are excellent solutions on pydata for Nd-array and tabular data processing, this large space of data types is hard to work with in python, often requiring iteration of dicts and lists, which are very slow. Typically, tutorials concentrate on normalising to arrays/tables as a first step towards being able to do vectorised compute, but with awkward you can work directly on the nested data.

I would like to ask the community if you have any datasets that you would like to experiment with in the awkward and dask-awkward framework. Initially we will be supporting JSON and parquet as input formats, the former ideally with a schema definition. Later we will work on other formats both text (XML) and binary (avro, protobuf…). The size of the data is not too important, except that current processing by python iteration is painful for you.

This is a very early stage for dask-awkward, and we are aiming for the first beta release in the next few months. Awkward itself is better established, especially in the high-energy physics field, but undergoing a rewrite to “v2”, so this is a good time to experiment and optimise, and to make feature requests.

1 Like

Our ICESat-2 laser altimetry data may be relevant here! There are a bunch of different ICESat-2 data products but generally they are nested hdf5 datasets which can include raw and along-track averages nested within the same dataset: | National Snow and Ice Data Center. They are also annoyingly indexed by time which can include duplicates (the time difference between shots is often lower than the data precision) which adds to the challenge.

It’s hard to really take advantage of dask/xarray considering the file structure so would be interested in testing any new ideas out using his new library. All good if this sounds not exactly what you were thinking.

It would be great to see the nyc_buildings redirect example reimplemented on top of dask-awkward, especially because you’ll then be able to compare performance between awkward and the ragged array support provided by spatialpandas.

Indeed, for the NYC data, I see the following parquet schema

- schema: REQUIRED
| - geometry: LIST, LIST, OPTIONAL
|   - list: REPEATED
|     - item: LIST, LIST, OPTIONAL
|       - list: REPEATED
|         - item: LIST, LIST, OPTIONAL
|           - list: REPEATED
|             - item: DOUBLE, OPTIONAL
| - type: BYTE_ARRAY, STRING, UTF8, OPTIONAL
| - name: BYTE_ARRAY, STRING, UTF8, OPTIONAL
  - hilbert_distance: INT64, OPTIONAL

(note that the data is stored in http://s3.amazonaws.com/datashader-data/nyc_buildings.parq.zip )

There are 16.4M leaf values in the geometry column (list[list[list[float64]]], apparently the same as a “multipolygon”), across 32 partitions. If I recall, the hilbert thing is a way to find which geometries are in which partitions. Bizarrely, the metadata includes a base64 serialised version of the arrow schema.

For the ICE thing, @akpetty could you post a file of interest somewhere or some indication of how to get them? The interface seems a little complex and I don’t have a login.

Hi!

These both look like good applications. I’m looking at @akpetty’s ICESat-2 data right now because it looks like a case in which Parquet, Arrow, Awkward Array, and Dask can relieve some pain points, which may be a bigger deal than something spatialpandas can already do (though that’s worth looking at as well).

The “collection of HDF5 files to represent an array of variable-length lists” is a pattern that I’ve seen before, and in previous cases, switching to a format that can deal with nested variable-length data in a native way has made a big difference. The Million Song Dataset had a separate HDF5 file for each song; losslessly converting the compressed HDF5 to similarly compressed Parquet made it 3× smaller. Argo Float data is segmented by date into NetCDF files, and applying the same procedure there made it 20× smaller. Beyond the smaller file sizes, it’s less cumbersome to address data as a big array than as a bunch of little files.

I just got an account on NASA’s EarthData and I’m downloading a 32 GB slice around Chicago (we have plenty of snow and ice) to get a sense of the data. In the first few files, I see that there’s a lot of little Datasets in it (gzip level 6), which would likely benefit from combining into Parquet. I’ll keep you posted if I get something interesting to look at.

From the downloader tool, I can see that there’s a lot of data in this service; it would also make a good demonstrator of massively distributed processing with Dask. This would be a good test case where I think we’d be able to help out.

Let me know if you’re interested in following up. Thanks!

(Actually, each of these HDF5 files is already relatively large, unlike the Million Song files and especially unlike the Argo Floats, so I’m predicting that at the same level of compression, they’ll be 25 GB instead of 32 GB (21% reduction). It won’t be as extreme. But there are other benefits.)

Ugh sorry for dropping the ball on this - fun covid/daycare issues (that have now resolved!). I just dumped some example ICESat-2 data in my personal google bucket to make it easier to grab but sounds like you already have had some joy accessing through EarthData/NSIDC. I typically work with the higher level (Level 3A) sea ice data on our own own internal servers - typically just looping through granules, doing some analysis on each one, then saving output as a simpler netcdf file to then use xarray/dask for higher level data analysis/plotting. Nothing very fancy.

The lower level (level 2A) geolocated photon data ATL03 is much larger so typically one would use a subsetting tool like Earthdata which grabs all the data within your region of interest (I think all the granules that cross the region?) - or just work with a given granule of interest.

If still interested here is the bucket (should be public read access):

ATL03* (~7 Gb) - an example Level 2A raw geolocated photon cloud data granule from the laser altimeter. Pretty big data!
ATL07* (~160 mb) - an example L3A sea ice height/type granule for a given hemisphere.
ATL10* (~250 mb) - an example L3A sea ice freeboard granule for a given hemisphere.

Thanks again, definitely still very interested in learning more (as I’m assuming will the NSIDC, I think they have a couple people that are active in the PANGEO community).

No problem! It was me that took a long time to get to it.

Thanks for posting a data sample! I’m at the level of listing keys in the HDF5 file right now, and I noticed that only ATL10 had the same keys as the files from EarthData. The others have variations on the same themes (“atlas,” “heights,” “geolocations,” …)—all of these words sound sensible, but I don’t know what are the important ones to focus on.

From your description, it sounds like these files (both level 2A “raw” and level 3A “refined”) are not what you do exploratory data analysis on, but something that you bulk-process to produce summary files for the more interactive data analysis (e.g. plotting). What’s an example of the kind of bulk-processing that you do?

The advantage of Awkward Array over NumPy would come from any manipulations that involve variable-length data, such as jagged/ragged arrays. A single HDF5 file has a lot of 1D or fixed-dimension (rectilinear) arrays, which NumPy is good at manipulating. If, maybe, a group of different-length 1D arrays are supposed to represent parts of a larger dataset containing variable-length data, or maybe a set of HDF5 files collectively represent a jagged/ragged dataset through the fact that each file contains arrays of different sizes, we could benefit from re-expressing the group or set of files as a single Awkward Array and do manipulations on that.

The other two datasets that I referred to—Million Songs and Argus Floats—had very few array elements in each HDF5/NetCDF file (hundreds or thousands), but they were split up that way to encode the size of a meaningful unit in the data analysis: each HDF5 in Million Songs was a song (length depends on the duration of the song), and each NetCDF file in Argus Floats was a day of data-taking (different numbers of floats were accessible each day). Combining hundreds or thousands of kilobyte-to-megabyte HDF5 files into one Parquet file reduces a lot of header overhead, so the total disk size was much smaller, and analyzing a single feature across many units (e.g. “starting note of each song”) could be expressed as NumPy-like slices, rather than iteration through a set of files.

So I started by looking for something similar here, but each of these HDF5 files is reasonably large, so they’re not wasting disk space with headers the way the other datasets were. From the key names, these look like qualitatively different types of information—satellite positions, calibrations, measurements at sea, land, etc.—rather than things like “segment0000,” “segment0001,” “segment0002,” etc. that should be concatenated to make something meaningful. Do you typically analyze data across a set of files, in which data in different files with the same key name represent the same data in different units, similar to the songs and days of the Million Songs and Argus Floats? I ask this particularly because your Google Cloud bucket contained three files that I think aren’t supposed to be concatenated like that, since they all have different sets of keys.

I also didn’t see any vlen_dtype data in any of these files, which would be better expressed in Parquet than in HDF5.

What’s an example of a pain point in your analysis that you think could be fixed with a jagged/ragged array library?

(No doubt Dask would help, regardless of whether the data themselves need jagged/ragged handling; the question is whether you need Dask-Awkward or just dask.array.)