Problem plotting large memory dataset

Hello All,

I’m trying to plot a 60gb tiled dataset (EO derived MODIS tiles) on a 64gb machine and am consistently running out of memory when plotting. Is there a standard way to plot such a large datasets while using Dask to avoid loading everything into memory (one tile is 23mb)? Am asking as that would later enable data resampling and statistics before plotting.

I have tried

ds = xr.open_mfdataset(files,
                       combine='by_coords',
                       chunks={
                           "band": 1,
                           "x": 512,
                           "y": 512
                       },
                       parallel=True)

plot = ds.sel(band=1).hvplot.image(x='x',
                                   y='y',
                                   rasterize=True,
                                   cmap='viridis',
                                   title='SEN4GPP - 2020-01-01\n ',
                                   frame_width=800,
                                   frame_height=800)

hvplot.save(plot, "foo.png", fmt="png")

and

ds = xr.open_mfdataset(files_dir,
                           combine='by_coords',
                           chunks={
                               "band": 1,
                               "x": 512,
                               "y": 512
                           },
                           parallel=True)

xr.plot.imshow(ds.isel(band=0), x='x', y='y')
plt.savefig('foo.png')

as they’re the common examples found online. Alternatively, it seems like I should be able to convert the output data into a point-cloud and then plot that, though that feels like too much effort…

I can make a sample set available, if it helps.

Thanks very much for your help!

Kind regards

PS: XArray dataset info, using up-to-data conda-forge xarray packages.

Dataset tile info:
<xarray.Dataset> Size: 23MB
Dimensions: (band: 1, x: 2400, y: 2400)
Coordinates:

  • band (band) int64 8B 1
  • x (x) float64 19kB -2.001e+07 -2.001e+07 … -1.89e+07 -1.89e+07
  • y (y) float64 19kB 1.112e+06 1.111e+06 1.111e+06 … 695.0 231.7
    spatial_ref int64 8B …
    Data variables:
    band_data (band, y, x) float32 23MB …

Whole dataset info:
<xarray.Dataset> Size: 60GB
Dimensions: (band: 2, y: 43200, x: 86400)
Coordinates:

  • band (band) int64 16B 1 2
  • x (x) float64 691kB -2.001e+07 -2.001e+07 … 2.001e+07 2.001e+07
  • y (y) float64 346kB 1.001e+07 1.001e+07 … -1.001e+07 -1.001e+07
    spatial_ref int64 8B 0
    Data variables:
    band_data (band, y, x) float64 60GB dask.array<chunksize=(1, 512, 480), meta=np.ndarray>
1 Like

Make the whole thing float32 and it would be 30GB? If it is this specific example you are interested in.

Perhaps try using Dask plus coarsen to reduce the resolution of your dataset before trying to plot it with matplotlib.

Perhaps datashader through hvplot?

Definitely helps! Are you working with tifs or netcdf/h5 ?

Perhaps datashader through hvplot?

I thought that is what rasterize=True does (from first code snippet above)? I haven’t used hvplot in a while, but if they are Tiffs, it reminds me me of this old issue where it would be nice to take advantage of overviews/pyramids if present Visualizing cloud-optimized geotiffs · Issue #370 · holoviz/geoviews · GitHub . At some level, either manually coarsening to an appropriate resolution via dask, or using a visualization tool that knows about pyramids is key for datasets larger than available RAM

Sorry for the delay and thanks for the replies! (and sorry for the few ats - new user limit)

@RichardScottOZ thanks for the catch - should have been float32. Will change it. While it helps in this case, the next step would be global at 20m resolution, so I’m aiming for a more general working approach.

at_rabernat will try, thanks.

at_ahuang11 as scottyhq says, the “rasterize=True” option is supposed to already do that.

@scottyhq am currently working with tifs, but the final version will also include NetCDF.

Dataset sample for one time point: sample_data_s4g - Google Drive

Oops yeah, skimmed it and didn’t see rasterize=True.

There was some work done on building pyramids
https://blog.holoviz.org/posts/czi_r5/

Oh, so literally MODIS sinusoidal tiles - I looked a small sample, it looks like they get mosaiced in correctly i.e.

<xarray.Dataset> Size: 2GB
Dimensions:      (band: 2, y: 14400, x: 9600)
Coordinates:
  * band         (band) int64 16B 1 2
  * x            (x) float64 77kB -2.001e+07 -2.001e+07 ... -1.557e+07
  * y            (y) float64 115kB 3.336e+06 3.335e+06 ... -3.335e+06 -3.336e+06
    spatial_ref  int64 8B 0
Data variables:
    band_data    (band, y, x) float64 2GB dask.array<chunksize=(1, 96, 2400), meta=np.ndarray>


## from just this list of tifs
['tifs/QISCARF_GPP_20200101_h00v10_m0.830_b0.053_pc31.000_qi0.080_v02.tif', 'tifs/QISCARF_GPP_20200101_h01v10_m0.830_b0.053_pc31.000_qi0.080_v02.tif', 'tifs/QISCARF_GPP_20200101_h01v11_m0.830_b0.053_pc31.000_qi0.080_v02.tif', 'tifs/QISCARF_GPP_20200101_h01v08_m0.830_b0.053_pc31.000_qi0.080_v02.tif', 'tifs/QISCARF_GPP_20200101_h00v09_m0.830_b0.053_pc31.000_qi0.080_v02.tif', 'tifs/QISCARF_GPP_20200101_h00v08_m0.830_b0.053_pc31.000_qi0.080_v02.tif', 'tifs/QISCARF_GPP_20200101_h02v11_m0.830_b0.053_pc31.000_qi0.080_v02.tif', 'tifs/QISCARF_GPP_20200101_h03v06_m0.830_b0.053_pc31.000_qi0.080_v02.tif', 'tifs/QISCARF_GPP_20200101_h02v06_m0.830_b0.053_pc31.000_qi0.080_v02.tif', 'tifs/QISCARF_GPP_20200101_h01v09_m0.830_b0.053_pc31.000_qi0.080_v02.tif', 'tifs/QISCARF_GPP_20200101_h02v09_m0.830_b0.053_pc31.000_qi0.080_v02.tif', 'tifs/QISCARF_GPP_20200101_h02v08_m0.830_b0.053_pc31.000_qi0.080_v02.tif', 'tifs/QISCARF_GPP_20200101_h02v10_m0.830_b0.053_pc31.000_qi0.080_v02.tif', 'tifs/QISCARF_GPP_20200101_h01v07_m0.830_b0.053_pc31.000_qi0.080_v02.tif']

looks like this (I just got the first files so am limited to a western edge in the ocean)

I’d suggest using chunks = {"x": 2400, "y": 2400} and see if that does any better, but (and a big one), there’s no organization here by time, because you probably have time steps but haven’t told xarray about them - how many files in total are there?

also I’m not getting valid data out of the tifs themselves, so presumably there’s a lot of sparsity here. I don’t know if there’s a best way here with xarray, but I’d first group by time (if there is time), create a VRT mosaic for each of those, and then inject those into xarray on a new time axis (but I’m not qualified to do that myself, this is just how a GDAL person thinks about it).