Datasets for awkward

We are developing dask-awkard at Anaconda, a distributed, chunk-wise extension to awkward-array. Awkward is for processing data with nested and variable-length structures (“jagged arrays”) with numpy semantics (cpu and gpu) or numba-jit functions. Whilst there are excellent solutions on pydata for Nd-array and tabular data processing, this large space of data types is hard to work with in python, often requiring iteration of dicts and lists, which are very slow. Typically, tutorials concentrate on normalising to arrays/tables as a first step towards being able to do vectorised compute, but with awkward you can work directly on the nested data.

I would like to ask the community if you have any datasets that you would like to experiment with in the awkward and dask-awkward framework. Initially we will be supporting JSON and parquet as input formats, the former ideally with a schema definition. Later we will work on other formats both text (XML) and binary (avro, protobuf…). The size of the data is not too important, except that current processing by python iteration is painful for you.

This is a very early stage for dask-awkward, and we are aiming for the first beta release in the next few months. Awkward itself is better established, especially in the high-energy physics field, but undergoing a rewrite to “v2”, so this is a good time to experiment and optimise, and to make feature requests.

Our ICESat-2 laser altimetry data may be relevant here! There are a bunch of different ICESat-2 data products but generally they are nested hdf5 datasets which can include raw and along-track averages nested within the same dataset: | National Snow and Ice Data Center. They are also annoyingly indexed by time which can include duplicates (the time difference between shots is often lower than the data precision) which adds to the challenge.

It’s hard to really take advantage of dask/xarray considering the file structure so would be interested in testing any new ideas out using his new library. All good if this sounds not exactly what you were thinking.

It would be great to see the nyc_buildings redirect example reimplemented on top of dask-awkward, especially because you’ll then be able to compare performance between awkward and the ragged array support provided by spatialpandas.

Indeed, for the NYC data, I see the following parquet schema

- schema: REQUIRED
| - geometry: LIST, LIST, OPTIONAL
|   - list: REPEATED
|     - item: LIST, LIST, OPTIONAL
|       - list: REPEATED
|         - item: LIST, LIST, OPTIONAL
|           - list: REPEATED
|             - item: DOUBLE, OPTIONAL
| - type: BYTE_ARRAY, STRING, UTF8, OPTIONAL
| - name: BYTE_ARRAY, STRING, UTF8, OPTIONAL
  - hilbert_distance: INT64, OPTIONAL

(note that the data is stored in http://s3.amazonaws.com/datashader-data/nyc_buildings.parq.zip )

There are 16.4M leaf values in the geometry column (list[list[list[float64]]], apparently the same as a “multipolygon”), across 32 partitions. If I recall, the hilbert thing is a way to find which geometries are in which partitions. Bizarrely, the metadata includes a base64 serialised version of the arrow schema.

For the ICE thing, @akpetty could you post a file of interest somewhere or some indication of how to get them? The interface seems a little complex and I don’t have a login.