We are developing dask-awkard at Anaconda, a distributed, chunk-wise extension to awkward-array. Awkward is for processing data with nested and variable-length structures (“jagged arrays”) with numpy semantics (cpu and gpu) or numba-jit functions. Whilst there are excellent solutions on pydata for Nd-array and tabular data processing, this large space of data types is hard to work with in python, often requiring iteration of dicts and lists, which are very slow. Typically, tutorials concentrate on normalising to arrays/tables as a first step towards being able to do vectorised compute, but with awkward you can work directly on the nested data.
I would like to ask the community if you have any datasets that you would like to experiment with in the awkward and dask-awkward framework. Initially we will be supporting JSON and parquet as input formats, the former ideally with a schema definition. Later we will work on other formats both text (XML) and binary (avro, protobuf…). The size of the data is not too important, except that current processing by python iteration is painful for you.
This is a very early stage for dask-awkward, and we are aiming for the first beta release in the next few months. Awkward itself is better established, especially in the high-energy physics field, but undergoing a rewrite to “v2”, so this is a good time to experiment and optimise, and to make feature requests.