Problem plotting large memory dataset

LeoTolstoi · April 11, 2025, 3:25pm

Hello All,

I’m trying to plot a 60gb tiled dataset (EO derived MODIS tiles) on a 64gb machine and am consistently running out of memory when plotting. Is there a standard way to plot such a large datasets while using Dask to avoid loading everything into memory (one tile is 23mb)? Am asking as that would later enable data resampling and statistics before plotting.

I have tried

ds = xr.open_mfdataset(files,
                       combine='by_coords',
                       chunks={
                           "band": 1,
                           "x": 512,
                           "y": 512
                       },
                       parallel=True)

plot = ds.sel(band=1).hvplot.image(x='x',
                                   y='y',
                                   rasterize=True,
                                   cmap='viridis',
                                   title='SEN4GPP - 2020-01-01\n ',
                                   frame_width=800,
                                   frame_height=800)

hvplot.save(plot, "foo.png", fmt="png")

and

ds = xr.open_mfdataset(files_dir,
                           combine='by_coords',
                           chunks={
                               "band": 1,
                               "x": 512,
                               "y": 512
                           },
                           parallel=True)

xr.plot.imshow(ds.isel(band=0), x='x', y='y')
plt.savefig('foo.png')

as they’re the common examples found online. Alternatively, it seems like I should be able to convert the output data into a point-cloud and then plot that, though that feels like too much effort…

I can make a sample set available, if it helps.

Thanks very much for your help!

Kind regards

PS: XArray dataset info, using up-to-data conda-forge xarray packages.

Dataset tile info:
<xarray.Dataset> Size: 23MB
Dimensions: (band: 1, x: 2400, y: 2400)
Coordinates:

band (band) int64 8B 1
x (x) float64 19kB -2.001e+07 -2.001e+07 … -1.89e+07 -1.89e+07
y (y) float64 19kB 1.112e+06 1.111e+06 1.111e+06 … 695.0 231.7
spatial_ref int64 8B …
Data variables:
band_data (band, y, x) float32 23MB …

Whole dataset info:
<xarray.Dataset> Size: 60GB
Dimensions: (band: 2, y: 43200, x: 86400)
Coordinates:

band (band) int64 16B 1 2
x (x) float64 691kB -2.001e+07 -2.001e+07 … 2.001e+07 2.001e+07
y (y) float64 346kB 1.001e+07 1.001e+07 … -1.001e+07 -1.001e+07
spatial_ref int64 8B 0
Data variables:
band_data (band, y, x) float64 60GB dask.array<chunksize=(1, 512, 480), meta=np.ndarray>

RichardScottOZ · April 14, 2025, 11:19am

Make the whole thing float32 and it would be 30GB? If it is this specific example you are interested in.

rabernat · April 14, 2025, 12:34pm

Perhaps try using Dask plus coarsen to reduce the resolution of your dataset before trying to plot it with matplotlib.

ahuang11 · April 15, 2025, 4:12pm

Perhaps datashader through hvplot?

scottyhq · April 16, 2025, 4:38pm

Definitely helps! Are you working with tifs or netcdf/h5 ?

Perhaps datashader through hvplot?

I thought that is what rasterize=True does (from first code snippet above)? I haven’t used hvplot in a while, but if they are Tiffs, it reminds me me of this old issue where it would be nice to take advantage of overviews/pyramids if present Visualizing cloud-optimized geotiffs · Issue #370 · holoviz/geoviews · GitHub . At some level, either manually coarsening to an appropriate resolution via dask, or using a visualization tool that knows about pyramids is key for datasets larger than available RAM

LeoTolstoi · April 25, 2025, 1:10pm

Sorry for the delay and thanks for the replies! (and sorry for the few ats - new user limit)

@RichardScottOZ thanks for the catch - should have been float32. Will change it. While it helps in this case, the next step would be global at 20m resolution, so I’m aiming for a more general working approach.

at_rabernat will try, thanks.

at_ahuang11 as scottyhq says, the “rasterize=True” option is supposed to already do that.

@scottyhq am currently working with tifs, but the final version will also include NetCDF.

Dataset sample for one time point: sample_data_s4g - Google Drive

ahuang11 · April 25, 2025, 7:18pm

Oops yeah, skimmed it and didn’t see rasterize=True.

There was some work done on building pyramids
https://blog.holoviz.org/posts/czi_r5/

Michael_Sumner · April 28, 2025, 1:14pm

Oh, so literally MODIS sinusoidal tiles - I looked a small sample, it looks like they get mosaiced in correctly i.e.

<xarray.Dataset> Size: 2GB
Dimensions:      (band: 2, y: 14400, x: 9600)
Coordinates:
  * band         (band) int64 16B 1 2
  * x            (x) float64 77kB -2.001e+07 -2.001e+07 ... -1.557e+07
  * y            (y) float64 115kB 3.336e+06 3.335e+06 ... -3.335e+06 -3.336e+06
    spatial_ref  int64 8B 0
Data variables:
    band_data    (band, y, x) float64 2GB dask.array<chunksize=(1, 96, 2400), meta=np.ndarray>


## from just this list of tifs
['tifs/QISCARF_GPP_20200101_h00v10_m0.830_b0.053_pc31.000_qi0.080_v02.tif', 'tifs/QISCARF_GPP_20200101_h01v10_m0.830_b0.053_pc31.000_qi0.080_v02.tif', 'tifs/QISCARF_GPP_20200101_h01v11_m0.830_b0.053_pc31.000_qi0.080_v02.tif', 'tifs/QISCARF_GPP_20200101_h01v08_m0.830_b0.053_pc31.000_qi0.080_v02.tif', 'tifs/QISCARF_GPP_20200101_h00v09_m0.830_b0.053_pc31.000_qi0.080_v02.tif', 'tifs/QISCARF_GPP_20200101_h00v08_m0.830_b0.053_pc31.000_qi0.080_v02.tif', 'tifs/QISCARF_GPP_20200101_h02v11_m0.830_b0.053_pc31.000_qi0.080_v02.tif', 'tifs/QISCARF_GPP_20200101_h03v06_m0.830_b0.053_pc31.000_qi0.080_v02.tif', 'tifs/QISCARF_GPP_20200101_h02v06_m0.830_b0.053_pc31.000_qi0.080_v02.tif', 'tifs/QISCARF_GPP_20200101_h01v09_m0.830_b0.053_pc31.000_qi0.080_v02.tif', 'tifs/QISCARF_GPP_20200101_h02v09_m0.830_b0.053_pc31.000_qi0.080_v02.tif', 'tifs/QISCARF_GPP_20200101_h02v08_m0.830_b0.053_pc31.000_qi0.080_v02.tif', 'tifs/QISCARF_GPP_20200101_h02v10_m0.830_b0.053_pc31.000_qi0.080_v02.tif', 'tifs/QISCARF_GPP_20200101_h01v07_m0.830_b0.053_pc31.000_qi0.080_v02.tif']

looks like this (I just got the first files so am limited to a western edge in the ocean)

I’d suggest using chunks = {"x": 2400, "y": 2400} and see if that does any better, but (and a big one), there’s no organization here by time, because you probably have time steps but haven’t told xarray about them - how many files in total are there?

also I’m not getting valid data out of the tifs themselves, so presumably there’s a lot of sparsity here. I don’t know if there’s a best way here with xarray, but I’d first group by time (if there is time), create a VRT mosaic for each of those, and then inject those into xarray on a new time axis (but I’m not qualified to do that myself, this is just how a GDAL person thinks about it).

Topic		Replies	Views
Reading larger than memory HDF data and writing concatenated xarray (or Zarr) dataset on HPC Data	13	2391	October 8, 2020
Xarray unable to allocate memory, How to "size up" problem Data location-uw	9	3003	July 27, 2023
Best practice for memory management to iteratively write a large dataset with xarray Data	0	1363	December 4, 2021
Saving larger-than-memory objects to zarr using dask and xarray Data zarr	9	552	December 3, 2024
Using Dask client and running out of memory Science	8	4150	June 22, 2023

Problem plotting large memory dataset

Related topics