Kerchunk planning

Hello, I have been looking at virtualizarr, I think it makes sense and i really like the idea. One question that i have is, wouldn’t virtualizarr be a replacement for rechunker? If you generate indexes of some format, then you can rechunk it efficiently as needed.

I guess the metadata is part of zarr and is coming with v3, is that right?

For kerchunk I was wondering, is there any plan to add more formats?
I don’t know about unidimensional data and how it would be represented using Zarr/xarray (for example fastq or fasta files from genomics, but i think tiledb does it anyway), but what about other formats such as imzML format for spectrometry data? I see that there’s a need for distributed and parallel processing in scientific pipelines (worked on a genomics pipeline and currently working on radio interferometry pipeline), and i feel like multidimensional arrays are a really powerful abstraction.

Thanks for your amazing work.

The extent to which a format will be useful to kerchunk will depend on the specifics (like how well it meets zarr/xarray’s model) and how complex the encoding is, but I am happy to consider any array binary format.

I think imzML specifically is one I came across recently: base64-ascii blocks within an XML file, right? This can be kerchunked; whether it’s useful wil depend on how big the chunks are compared to the reference that represents them.

Yes, you might find this useful:

It’s from a colleague of mine, but the principle is the same as kerchunk, parallel processing and data indexing.

There are more formats if you want to look at it.

What i miss is some abstraction underneath that has a “universal” api, I really like the virtualzarr idea since it might fit in there. Currently the framework needs for the developer to create a new “data abstraction” each time a new format is added.

Yes, kerchunk aims for a universal API for all the formats, but the specifics of each format would still need to be handled somewhere (once for each format, and scanning run once for each input file).

Sorry, i wasn’t clear, i was refering to the interaction of datasets after they’re ingested with kerchunk, basically I like how kerchunk integrates xarray, It doesn’t matter what the user wants to ingest, the resulting dataset will be on an xarray format.

The dataplug framework instead needs the developer to define the abstraction.

I’m not sure your suggestion makes sense, at least for compressed chunks? Kerchunk / VirtualiZarr are manipulating references to compressed chunks on-disk, but they can’t change the content of those chunks. A rechunk would imply changing the contents of the chunks.

In the VirtualiZarr model metadata either becomes part of the xarray model (i.e. the dimension_names) or it gets moved around as .attrs. It can then be written back out as part of a zarr store yes, but v3 won’t really change anything about how that works.

If kerchunk keeps adding more backends to support creating reference dicts from other file formats, VirtualiZarr should still be able to consume those. That would be a nice separation of concerns.

1 Like

So if my understanding is correct, rechunker comes from the need of resizing chunks from zarr, since some chunk configurations can be suboptimal.

If you use kerchunk to read some format and have the reference-index as a virtualizarr, doesn’t it make sense that if the user wants to create arbitraty size chunks, to use the index? Since virtualizarr and kerchunk create a “virtual view” of the dataset, wouldn’t that mean a more efficient rechunking algorithm?

This would basically mean rearranging the chunk metadata/index to create bigger-smaller chunks.

Maybe I am missing something? I haven’t really used rechunker

@abourramouss this approach might work for “fusing” chunks together to make bigger chunks, but IIUC it won’t allow you to make smaller chunks than you started with, because you would then be trying to read only part of a compressed chunk.

1 Like

Perhaps you mean that you can use rechunker across a set of files that don’t otherwise form a logical dataset, but that kerchunk/vitualizarr can “combine” for you? In that case, yes, you could use the tools before rechunker, but I’m not sure it provides any benefit over open_mfdataset or other xarray API combiners. After all, you would only normally be doing this once.


On a different topic, do you think measurement sets could be potentially read by kerchunk?

They are datasets used in radio interferometry, the complexity here is that the dataset consists of different tables, the dataset is like a relational database internally.

Another thing is that different formats are used phisically, so that would involve implementing different algorithms for each table format (If i remember correctly, there were 4)

Given the structure of the dataset, is this something worth pursuing?

Measurement set structure:

Kerchunk can probably find the binary blocks corresponding to the tables (are they chunked? compressed/encoded?), and assign a compound dtype to each. But in the zarr model, these would all just be free-standing arrays without any relations. Zarr works best for chunked multi-dimensional arrays, offering easy parallelism.

Kerchunk is not entirely tied to zarr, you can make a reference filesystem from anything and pass them to fsspec-expecting libraries, but there isn’t really much else you can do with these at the moment. Does CASA deal with zarr at all?

I have been trying to learn more about the format as having a xarray interface to it as well as doing parallel processing would be amazing.

The measurement set tables are written and read on different formats, those formats are storage managers and each storage manager used to handle different tables. They are responsible for the physical storage of the data, including aspects like chunking and compression:

  1. IncrementalStMan: Is used for columns where data changes incrementally, the idea is to use it with tables that repeat data, this way it can be compressed.
  2. StandardStMan: serves as a general purpose manager without specific compression techniques
  3. TiledColumnStMan: manages multidimensional arrays by dividing them into tiles (chunks). This is aligned with the concept of chunking in Zarr.

The intricate part is that storage managers are used per table and not per dataset.

My idea is to have a reference (byte-range reference, what virtualizarr is doing) to these type of files, this would ease parallel processing of the data as well as pipeline operations. Currently I am doing exactly that, but using casa-core tools, which aren’t cloud-native.

From our point of view, each table would be a different variable/array, and having a different encoding for each is fine. We just have to write the three codecs in numcodecs-style classes (which is super easy if you already have an implementation). For the specific case of “tiled”, the chunking in the virtual zarr dataset would match the original chunking. For the other two, you probably can only have one chunk for each table in each input dataset. Whether multiple sets can be combined in any logical way, I leave for the domain experts :slight_smile: .

Yes, that’s another discussion, the data set is a collection of measurement sets, and at the end, a data set would be processed together in parallel.

I have worked a bit on the problem a couple months ago, altought i didn’t get any results, i started understanding the format.

Do you think it would be feasible to make a kerchunk reader and be able to create a reference using virtualizarr? it would immensely ease parallel data processing in the cloud.

From what you have said here, I imagine a kerchunk scanner for the format is feasible, yes. Since it’s quite complicated, it will take some effort to get all the details right - perhaps you would start by just extracting a big tiled/chunked array at first to prove the concept.

The dataset not being regular, you would probably end up making your own combine routine rather than using kerchunk.combine or virtualizarr, but that part can wait.

1 Like

Well you give me some confidence then, thank you!

Sorry, I was wrong here, measurement sets storage managers (tiled, standard, incremental) are used per column on a single table.

When reading columns from a table, different storage managers are used, so tables are composed by columns that are managed by different storage managers.

Altought I still find some stuff confusing, I am still trying to do the kerchunk-measurement set reader.