How can I merge two xarray datasets without crashing in plotting 2D?

I have two xarray’s datasets as follows:

First dataset:

<xarray.Dataset>
Dimensions:            (cluster_labels: 300, time: 8784)
Coordinates:
  * time               (time) datetime64[ns] 2000-01-01 ... 2000-12-31T23:00:00
  * cluster_labels     (cluster_labels) int64 0 1 2 3 4 ... 295 296 297 298 299
    reference_time     datetime64[ns] ...
Data variables: (12/17)
    t                  (cluster_labels, time) float64 dask.array<chunksize=(10, 8784), meta=np.ndarray>
    u                  (cluster_labels, time) float64 dask.array<chunksize=(10, 8784), meta=np.ndarray>
    v                  (cluster_labels, time) float64 dask.array<chunksize=(10, 8784), meta=np.ndarray>
    q                  (cluster_labels, time) float64 dask.array<chunksize=(10, 8784), meta=np.ndarray>
    p                  (cluster_labels, time) float64 dask.array<chunksize=(10, 8784), meta=np.ndarray>
    precip_lapse_rate  (cluster_labels, time) float64 dask.array<chunksize=(10, 8784), meta=np.ndarray>
    ...                 ...
    cse                (cluster_labels, time) float64 dask.array<chunksize=(10, 8784), meta=np.ndarray>
    LW                 (cluster_labels, time) float64 dask.array<chunksize=(10, 8784), meta=np.ndarray>
    SW_diffuse         (cluster_labels, time) float64 dask.array<chunksize=(10, 8784), meta=np.ndarray>
    cos_illumination   (cluster_labels, time) float64 dask.array<chunksize=(10, 8784), meta=np.ndarray>
    SW_direct          (cluster_labels, time) float64 dask.array<chunksize=(10, 8784), meta=np.ndarray>
    SW                 (cluster_labels, time) float64 dask.array<chunksize=(10, 8784), meta=np.n

darray>

Second dataset:

<xarray.Dataset>
Dimensions:         (y: 530, x: 855)
Coordinates:
  * y               (y) float64 31.62 31.61 31.6 31.6 ... 27.23 27.22 27.21
  * x               (x) float64 49.85 49.85 49.86 49.87 ... 56.95 56.95 56.96
Data variables:
    elevation       (y, x) int16 ...
    slope           (y, x) float64 ...
    aspect          (y, x) float64 ...
    aspect_cos      (y, x) float64 ...
    aspect_sin      (y, x) float64 ...
    svf             (y, x) float64 ...
    cluster_labels  (y, x) int32 ...

I want to add x and y coordinates from the second dataset to the first dataset, for do this, there is a variable name cluster_labels in the second dataset and also cluster_labels as a coordinate in the first dataset, so I used it by below code:

first_dataset.sel(cluster_labels=second_dataset.cluster_labels)

This line did successfully and I can create a new dataset same as follow:

<xarray.Dataset>
Dimensions:            (y: 530, x: 855, time: 8784)
Coordinates:
  * time               (time) datetime64[ns] 2000-01-01 ... 2000-12-31T23:00:00
    cluster_labels     (y, x) int64 96 235 130 130 104 176 ... 34 34 16 266 266
    reference_time     datetime64[ns] ...
  * y                  (y) float64 31.62 31.61 31.6 31.6 ... 27.23 27.22 27.21
  * x                  (x) float64 49.85 49.85 49.86 49.87 ... 56.95 56.95 56.96
Data variables: (12/17)
    t                  (y, x, time) float64 dask.array<chunksize=(530, 855, 8784), meta=np.ndarray>
    u                  (y, x, time) float64 dask.array<chunksize=(530, 855, 8784), meta=np.ndarray>
    v                  (y, x, time) float64 dask.array<chunksize=(530, 855, 8784), meta=np.ndarray>
    q                  (y, x, time) float64 dask.array<chunksize=(530, 855, 8784), meta=np.ndarray>
    p                  (y, x, time) float64 dask.array<chunksize=(530, 855, 8784), meta=np.ndarray>
    precip_lapse_rate  (y, x, time) float64 dask.array<chunksize=(530, 855, 8784), meta=np.ndarray>
    ...                 ...
    cse                (y, x, time) float64 dask.array<chunksize=(530, 855, 8784), meta=np.ndarray>
    LW                 (y, x, time) float64 dask.array<chunksize=(530, 855, 8784), meta=np.ndarray>
    SW_diffuse         (y, x, time) float64 dask.array<chunksize=(530, 855, 8784), meta=np.ndarray>
    cos_illumination   (y, x, time) float64 dask.array<chunksize=(530, 855, 8784), meta=np.ndarray>
    SW_direct          (y, x, time) float64 dask.array<chunksize=(530, 855, 8784), meta=np.ndarray>
    SW                 (y, x, time) float64 dask.array<chunksize=(530, 855, 8784), meta=np.ndarray>

But when I want to plot 2D based on x, y in a single time:

df.sel(time='2000-01-01T00:00:00.000000000')['t']

it used all of the ram and then crashed, How can I plot any time of the merged dataset?

1 Like

As you can see from the output you posted, the chunksize of your merged dataset is very large ((530, 855, 8784)). You’re running out of memory when you try to load this for plotting.

In general, I think you need to be a bit more clever about how you chunk and join your datasets. Perhaps you could say a few words about what you’re trying to accomplish (not code, just describe it in plain english) and we can suggest some different approaches.

1 Like

Thanks for your suggestion. I try to explain my problem, I have a dataset (NC format) with multiple id numbers. Each id has multiple climate variables such as temperature, in this dataset there is no latitude and longitude, I want to join latitude(x) and longitude(y) from another dataset that in the second dataset there are only id numbers, and x and y corresponding each id number. Id numbers in two datasets are common but in the first dataset, it is as a coordinate but in the second dataset it is as a variable. I remember that in QGIS we can join two tables with a common field, I want to join x, and y for each of the id numbers from the second dataset to the first dataset. So that each id numbers of climate variable have x, y (longitude, latitude). I can do this but when I want to select a single time from timeseries of the joined dataset for plotting I encounter the loss of memory problem. I hope I could describe my problem obviously.