Machine Learning training with large Xarray Datasets

ms2hhc · May 15, 2025, 3:49pm

Hi all,

I’d like to share a common workflow I face in Land Cover mapping tasks and hear your thoughts on how best to handle part of it with Xarray/Dask.

The general idea is to use a global land cover product as ground truth for a reference year (e.g., ESA WorldCover 2021), and then train a machine learning model to predict land cover in other years based on Earth Observation data.

My setup typically looks like this:

I have a huge, lazy xarray.Dataset of EO features — for example, yearly aggregations (e.g., median, percentiles) of Sentinel-2 L2A bands over a large region.
I also have a lazy xarray.DataArray of labels covering the same region. These labels are derived from the land cover product (after some cleaning), and are not very sparse.
I want to train an in-memory ML model like Random Forest, XGBoost, or LightGBM — which work well for this kind of task.

The challenge is that I obviously can’t use all the available labeled pixels due to memory constraints. So I’m looking to:

Perform random stratified subsampling of the labeled pixels — ideally distributed/lazy until the actual sampling step.
Then, compute and bring the sampled data in memory for training the model.

Has anyone tackled something similar? Do you have suggestions on:

Efficiently implementing stratified sampling across Dask-backed Xarray objects?
Any tools or best practices you recommend for subsampling while retaining class balance?
Potential gotchas in performing this kind of reduction operation before loading to memory?

Thanks in advance for any insights!

Topic		Replies	Views
Struggling with large dataset loading/reading using xarray Science	39	15209	February 16, 2023
Large-scale data processing benchmarks for Xarray-Beam	6	1584	June 13, 2022
Best practice for memory management to iteratively write a large dataset with xarray Data	0	1366	December 4, 2021
Intersecting shapefiles and raster data at scale - good design patterns? Science	12	4633	January 30, 2021
Xarray/Dask - Specify that a given task use huge amount of RAM to the Dask Ressource Manager Science	2	600	October 19, 2022

Machine Learning training with large Xarray Datasets

Related topics