Problem plotting large memory dataset

Hello All,

I’m trying to plot a 60gb tiled dataset (EO derived MODIS tiles) on a 64gb machine and am consistently running out of memory when plotting. Is there a standard way to plot such a large datasets while using Dask to avoid loading everything into memory (one tile is 23mb)? Am asking as that would later enable data resampling and statistics before plotting.

I have tried

ds = xr.open_mfdataset(files,
                       combine='by_coords',
                       chunks={
                           "band": 1,
                           "x": 512,
                           "y": 512
                       },
                       parallel=True)

plot = ds.sel(band=1).hvplot.image(x='x',
                                   y='y',
                                   rasterize=True,
                                   cmap='viridis',
                                   title='SEN4GPP - 2020-01-01\n ',
                                   frame_width=800,
                                   frame_height=800)

hvplot.save(plot, "foo.png", fmt="png")

and

ds = xr.open_mfdataset(files_dir,
                           combine='by_coords',
                           chunks={
                               "band": 1,
                               "x": 512,
                               "y": 512
                           },
                           parallel=True)

xr.plot.imshow(ds.isel(band=0), x='x', y='y')
plt.savefig('foo.png')

as they’re the common examples found online. Alternatively, it seems like I should be able to convert the output data into a point-cloud and then plot that, though that feels like too much effort…

I can make a sample set available, if it helps.

Thanks very much for your help!

Kind regards

PS: XArray dataset info, using up-to-data conda-forge xarray packages.

Dataset tile info:
<xarray.Dataset> Size: 23MB
Dimensions: (band: 1, x: 2400, y: 2400)
Coordinates:

  • band (band) int64 8B 1
  • x (x) float64 19kB -2.001e+07 -2.001e+07 … -1.89e+07 -1.89e+07
  • y (y) float64 19kB 1.112e+06 1.111e+06 1.111e+06 … 695.0 231.7
    spatial_ref int64 8B …
    Data variables:
    band_data (band, y, x) float32 23MB …

Whole dataset info:
<xarray.Dataset> Size: 60GB
Dimensions: (band: 2, y: 43200, x: 86400)
Coordinates:

  • band (band) int64 16B 1 2
  • x (x) float64 691kB -2.001e+07 -2.001e+07 … 2.001e+07 2.001e+07
  • y (y) float64 346kB 1.001e+07 1.001e+07 … -1.001e+07 -1.001e+07
    spatial_ref int64 8B 0
    Data variables:
    band_data (band, y, x) float64 60GB dask.array<chunksize=(1, 512, 480), meta=np.ndarray>
1 Like

Make the whole thing float32 and it would be 30GB? If it is this specific example you are interested in.

Perhaps try using Dask plus coarsen to reduce the resolution of your dataset before trying to plot it with matplotlib.

Perhaps datashader through hvplot?

Definitely helps! Are you working with tifs or netcdf/h5 ?

Perhaps datashader through hvplot?

I thought that is what rasterize=True does (from first code snippet above)? I haven’t used hvplot in a while, but if they are Tiffs, it reminds me me of this old issue where it would be nice to take advantage of overviews/pyramids if present Visualizing cloud-optimized geotiffs · Issue #370 · holoviz/geoviews · GitHub . At some level, either manually coarsening to an appropriate resolution via dask, or using a visualization tool that knows about pyramids is key for datasets larger than available RAM