Hello community,
I hope I’m right here!
My goal is to compute quantiles of satellite datasets depending on a mask encoding certain plant species (those can occur anywhere on the whole dataset). This mask is also quite spare, roughly 80 % are no data at all. The dataset itself (satellite data) is a few TB large.
I already tried multiple approaches:
Using xarrays groupby (with and without flox)
but this failed. This failed since the mask cannot be NaN anywhere. If I replace the NaN mask value with an int it works, but all results are Nan (I assume this happens due to the only implemented method being ‘blockwise‘), but im not sure. My code is something like this (for flox):
import xarray as xr
import numpy as np
from flox.xarray import xarray_reduce
def compute_species_quantiles(
ds,
speciesvar_name="speciesmap",
target_var_name="band_data",
dim=("y", "x"),
expected_groups=None,
quantiles=None,
method="blockwise",
):
if expected_groups is None:
expected_groups = np.arange(1, 7)
if quantiles is None:
quantiles = np.arange(0.01, 1.0, 0.01)
return xarray_reduce(
ds[target_var_name],
ds[speciesvar_name].rename("species"),
dim=dim,
expected_groups=(expected_groups,),
method=method,
func="nanquantile",
q=quantiles,
)
Dumping to parquet
My next approach was dumping the data as a parquet file. I assumed since most of the data is NaN anyway (and will be dropped), I can then easily compute the quantiles using pandas. But this just explodes in RAM and all my HPC jobs get killed.
My code:
import xarray as xr
ddf = ds.to_dask_dataframe()
ddf.dropna().to_parquet('result.parquet', write_index=False, engine='pyarrow')
I also tried writing one file per block using pandas & map_blocks, but this currently fails since the result does not align with my template (maybe this will work after some tinkering?):
output_dir = Path('/some/path')
def filter_and_save(block):
df = block.to_dataframe().dropna()
# Write to disk if not empty
if len(df) > 0:
unique_id = uuid.uuid4()
filename = output_dir / f'part_{unique_id}.parquet'
df.to_parquet(filename, engine='pyarrow', index=False)
print(f"Wrote {filename}")
return block.isel(x=slice(0, 0), y=slice(0, 0))
template = ds.isel(x=slice(0, 0), y=slice(0, 0))
result = ds.map_blocks(filter_and_save, template=template)
So my questions are: Am I doing something fundamentally wrong? Or is xarray + dask just not fit for the task? (My datasets are a few TB large if that matters)
My next approaches will probably go further in the direction of dumping the data as parquet or something, dropping NaN rows and then being able to calculate quantiles in memory or something.
I’m thankful for any kind of hint/help! Just write if you need any further information!