Data format for a nested 2-D big array?

Hi Pangeo community,

I have a 2-D numpy array with a size of ~10 million. Each element of this array is a customized object that can be basically represented as an N-by-6 array, and N varies over different elements.

What is the recommended data format for this kind of data? I am using Pickle now but don’t think it is interoperable enough, not to mention Pickle’s security issue.

Also, is there a better package than numpy for this data in terms of ease of labeling and slicing?

I am looking into xarray (and its to_netcdf method) but am not sure if it really fits my needs. Any thoughts are appreciated!

you might want have a look awkward-array which was built for this kind of irregular (ragged) arrays. It also allows writing to bunch of different formats (e.g. parquet).

xarray itself might become a layer on top of a subset of awkward, see pydata/xarray#4285 for more discussion on that idea.

1 Like

More recently, there’s this discussion: pydata/xarray#7988.

The idea of using Awkward Arrays in xarray has been floating around for a while; it hasn’t happened yet because the Awkward Array type system is too general for it to have meaningful properties like shape and dtype (i.e. it’s too “awkward”). So we’ve been talking about a subset that only has three type elements: variable-length lists (which contribute non-integers to a shape), regular-length lists (which contribute integers to a shape), and numerical data (which has a dtype). The trouble I’ve had is knowing where to draw a line: should we have missing data? Should only the numerical data be allowed to be missing, or can lists be missing as well? The types that the data structure has determine what functions will be possible.

Recently, I learned that the new Array API standard specifies that a shape can be

shape: Tuple[Optional[int], ...]

Their purpose for allowing None in the shape is so that the length of a dimension can be not-known, for instance because JAX needs shapes to be compile-time constants and the length of a dimension might depend on values that are only known at runtime. But I could reinterpret dimensions with None as dimensions that are not-uniform, instead of not-known. I don’t know if there will be consequences of this that are inconsistent with the rest of the specification, but I’m going to try implementing it and seeing what happens. I’m planning to try it out this upcoming Christmas break.

Something that I’ve known for a while is that xarray can accept any Array API compliant array library as a backend. Maybe it would accept this pared-down ragged array library without effort, or maybe it would require a little effort to let it accept the missing shape items as irregular, rather than regular-but-unknown. We’ll see!

Meanwhile, I should point out that Awkward Array 2.5.0 onward has attrs that are propagated through calculations, which were prompted by an issue with trying to make Awkward Array work with xarray (scikit-hep/awkward#1391). This doesn’t include named axes, though.

3 Likes

I’m planning to try it out this upcoming Christmas break.

You’ll probably notice when you play around with it, but xarray might not be ready for None-sized dimensions. Getting this to work would help with other things like conversion from / to dask.dataframe, but it may turn out to be a huge undertaking.

I see. But at least it would be a fixed-point, a target that can be addressed. If it’s self-consistent to interpret None in shapes this way, the Array API dictates how these ragged arrays have to behave, so my questions about where to draw the line would be answered. With a working example, maybe we could see what, specifically, breaks when xarray tries to use it as a backend. (At least, it might be more clear how big of an undertaking it is.)

This sounds great @jpivarski. I agree with @keewis that xarray might not be ready for None dimensions yet, but also that it’s a targeted and well-justified generalization that we can discuss in detail.

Simple tests should be the place to start, but then after that you might be interested in the automated duckarray tests we’re building in https://github.com/pydata/xarray/pull/6903. That could help with finding out how much of the codebase would need to change to support None dimensions.

Thank you all for your ideas and discussion! It looks like there is a goal for xarray to accommodate variable-length arrays but we need more time to work on this. I will try awkward-array for now but would be happy to get posted about our discussion (and maybe find a way to contribute to it!).