Kerchunk planning

abourramouss · April 1, 2024, 5:54pm

Hello, I have been looking at virtualizarr, I think it makes sense and i really like the idea. One question that i have is, wouldn’t virtualizarr be a replacement for rechunker? If you generate indexes of some format, then you can rechunk it efficiently as needed.

I guess the metadata is part of zarr and is coming with v3, is that right?

For kerchunk I was wondering, is there any plan to add more formats?
I don’t know about unidimensional data and how it would be represented using Zarr/xarray (for example fastq or fasta files from genomics, but i think tiledb does it anyway), but what about other formats such as imzML format for spectrometry data? I see that there’s a need for distributed and parallel processing in scientific pipelines (worked on a genomics pipeline and currently working on radio interferometry pipeline), and i feel like multidimensional arrays are a really powerful abstraction.

Thanks for your amazing work.

martindurant · April 1, 2024, 6:57pm

Absolutely!
The extent to which a format will be useful to kerchunk will depend on the specifics (like how well it meets zarr/xarray’s model) and how complex the encoding is, but I am happy to consider any array binary format.

I think imzML specifically is one I came across recently: base64-ascii blocks within an XML file, right? This can be kerchunked; whether it’s useful wil depend on how big the chunks are compared to the reference that represents them.

abourramouss · April 1, 2024, 7:09pm

Yes, you might find this useful:

github.com

aitorarjona/dataplug/blob/master/dataplug/formats/metabolomics/imzml.py

from __future__ import annotations

import io
import logging
import shutil
from typing import TYPE_CHECKING

from pyimzml.ImzMLParser import ImzMLParser

from ...core import *
from ...preprocessing.preprocessor import BatchPreprocessor, PreprocessingMetadata

if TYPE_CHECKING:
    from typing import List, Tuple
    from dataplug.cloudobject import CloudObject

logger = logging.getLogger(__name__)


@CloudDataFormat

This file has been truncated. show original

It’s from a colleague of mine, but the principle is the same as kerchunk, parallel processing and data indexing.

There are more formats if you want to look at it.

What i miss is some abstraction underneath that has a “universal” api, I really like the virtualzarr idea since it might fit in there. Currently the framework needs for the developer to create a new “data abstraction” each time a new format is added.

martindurant · April 1, 2024, 7:27pm

Yes, kerchunk aims for a universal API for all the formats, but the specifics of each format would still need to be handled somewhere (once for each format, and scanning run once for each input file).

abourramouss · April 1, 2024, 8:12pm

Sorry, i wasn’t clear, i was refering to the interaction of datasets after they’re ingested with kerchunk, basically I like how kerchunk integrates xarray, It doesn’t matter what the user wants to ingest, the resulting dataset will be on an xarray format.

The dataplug framework instead needs the developer to define the abstraction.

TomNicholas · April 1, 2024, 8:26pm

I’m not sure your suggestion makes sense, at least for compressed chunks? Kerchunk / VirtualiZarr are manipulating references to compressed chunks on-disk, but they can’t change the content of those chunks. A rechunk would imply changing the contents of the chunks.

In the VirtualiZarr model metadata either becomes part of the xarray model (i.e. the dimension_names) or it gets moved around as .attrs. It can then be written back out as part of a zarr store yes, but v3 won’t really change anything about how that works.

If kerchunk keeps adding more backends to support creating reference dicts from other file formats, VirtualiZarr should still be able to consume those. That would be a nice separation of concerns.

abourramouss · April 1, 2024, 9:16pm

So if my understanding is correct, rechunker comes from the need of resizing chunks from zarr, since some chunk configurations can be suboptimal.

If you use kerchunk to read some format and have the reference-index as a virtualizarr, doesn’t it make sense that if the user wants to create arbitraty size chunks, to use the index? Since virtualizarr and kerchunk create a “virtual view” of the dataset, wouldn’t that mean a more efficient rechunking algorithm?

This would basically mean rearranging the chunk metadata/index to create bigger-smaller chunks.

Maybe I am missing something? I haven’t really used rechunker

TomNicholas · April 1, 2024, 9:49pm

@abourramouss this approach might work for “fusing” chunks together to make bigger chunks, but IIUC it won’t allow you to make smaller chunks than you started with, because you would then be trying to read only part of a compressed chunk.

martindurant · April 2, 2024, 1:32am

Perhaps you mean that you can use rechunker across a set of files that don’t otherwise form a logical dataset, but that kerchunk/vitualizarr can “combine” for you? In that case, yes, you could use the tools before rechunker, but I’m not sure it provides any benefit over open_mfdataset or other xarray API combiners. After all, you would only normally be doing this once.

abourramouss · April 2, 2024, 11:01pm

On a different topic, do you think measurement sets could be potentially read by kerchunk?
https://casaguides.nrao.edu/index.php/Measurement_Set_Contents

They are datasets used in radio interferometry, the complexity here is that the dataset consists of different tables, the dataset is like a relational database internally.

Another thing is that different formats are used phisically, so that would involve implementing different algorithms for each table format (If i remember correctly, there were 4)

Given the structure of the dataset, is this something worth pursuing?

Measurement set structure:

martindurant · April 3, 2024, 12:26am

Kerchunk can probably find the binary blocks corresponding to the tables (are they chunked? compressed/encoded?), and assign a compound dtype to each. But in the zarr model, these would all just be free-standing arrays without any relations. Zarr works best for chunked multi-dimensional arrays, offering easy parallelism.

Kerchunk is not entirely tied to zarr, you can make a reference filesystem from anything and pass them to fsspec-expecting libraries, but there isn’t really much else you can do with these at the moment. Does CASA deal with zarr at all?

abourramouss · April 3, 2024, 11:54am

I have been trying to learn more about the format as having a xarray interface to it as well as doing parallel processing would be amazing.

The measurement set tables are written and read on different formats, those formats are storage managers and each storage manager used to handle different tables. They are responsible for the physical storage of the data, including aspects like chunking and compression:

IncrementalStMan: Is used for columns where data changes incrementally, the idea is to use it with tables that repeat data, this way it can be compressed.
StandardStMan: serves as a general purpose manager without specific compression techniques
TiledColumnStMan: manages multidimensional arrays by dividing them into tiles (chunks). This is aligned with the concept of chunking in Zarr.

The intricate part is that storage managers are used per table and not per dataset.

My idea is to have a reference (byte-range reference, what virtualizarr is doing) to these type of files, this would ease parallel processing of the data as well as pipeline operations. Currently I am doing exactly that, but using casa-core tools, which aren’t cloud-native.

martindurant · April 3, 2024, 1:04pm

From our point of view, each table would be a different variable/array, and having a different encoding for each is fine. We just have to write the three codecs in numcodecs-style classes (which is super easy if you already have an implementation). For the specific case of “tiled”, the chunking in the virtual zarr dataset would match the original chunking. For the other two, you probably can only have one chunk for each table in each input dataset. Whether multiple sets can be combined in any logical way, I leave for the domain experts .

abourramouss · April 3, 2024, 1:15pm

Yes, that’s another discussion, the data set is a collection of measurement sets, and at the end, a data set would be processed together in parallel.

I have worked a bit on the problem a couple months ago, altought i didn’t get any results, i started understanding the format.

Do you think it would be feasible to make a kerchunk reader and be able to create a reference using virtualizarr? it would immensely ease parallel data processing in the cloud.

martindurant · April 3, 2024, 1:28pm

From what you have said here, I imagine a kerchunk scanner for the format is feasible, yes. Since it’s quite complicated, it will take some effort to get all the details right - perhaps you would start by just extracting a big tiled/chunked array at first to prove the concept.

The dataset not being regular, you would probably end up making your own combine routine rather than using kerchunk.combine or virtualizarr, but that part can wait.

abourramouss · April 3, 2024, 2:20pm

Well you give me some confidence then, thank you!

abourramouss · April 14, 2024, 9:06pm

Sorry, I was wrong here, measurement sets storage managers (tiled, standard, incremental) are used per column on a single table.

When reading columns from a table, different storage managers are used, so tables are composed by columns that are managed by different storage managers.

Altought I still find some stuff confusing, I am still trying to do the kerchunk-measurement set reader.

Topic		Replies	Views
Making kerchunk as simple as a toggle? Open Science	30	1288	August 20, 2024
Recommendation for hosting cloud-optimized data Data	15	2863	January 21, 2022
Extremely slow rechunking of Zarr store with xarray Data	16	4133	October 22, 2021
Trick for improving Kerchunk performance for large numbers of chunks/files Data	11	1705	February 2, 2023
Pangeo Showcase: "VirtualiZarr: Create virtual Zarr stores using xarray syntax" Pangeo Showcase	1	909	May 15, 2024

Kerchunk planning

Related topics