Issue with Dask-xarray migration

Hello all!

I am relatively new to Dask and I was trying to implement a (I think) heavy computation following some tutorials online. Turns out that the outcome was not what I expected so I would like some help if anyone can give me some ideas.

I am trying to mask a 4D matrix of temperature variable by multiplying it with another 4D mask matrix (which is full of 1s and NaNs).So:
Variable matrix of temperature: temp=[365,32,58,50] = [time,deptht,lat,lon]
Mask file: D= [365,32,58,50] = [time,deptht,lat,lon]

These are for the moment loaded lazy as xarrays like this:
temp = xr.open_dataset(’/home/directory/T_1993_2D.nc’)['temp][:,:,lat1:lat2,lon1:lon2]

When I try to do the multiplication between the 2 xarrays: mask1 = temp * D
this took: CPU times: user 1.83 s, sys: 12.1 s, total: 14 s Wall time: 6min 46s

So I decided to do it with dask like this:

AA = da.from_array(temp,chunks=(1,5,340,481))
MM = da.from_array(MLDMASKF,chunks=(1,5,340,481))
result = AA*MM
%% time
result.compute()
This took : CPU times: user 20.8 s, sys: 10.6 s, total: 31.4 sWall time: 4min 7s

In that case I was expecting the result to be an xarray. However It was not:
type(result)
Out[7]: dask.array.core.Array
**In [14]: result **
Out[14]: dask.array<mul, shape=(365, 32, 58, 50), dtype=float64, chunksize=(1, 5, 58, 50), chunktype=xarray.DataArray>

which I do not understand why.I thought that .compute() was there to load the values.

so I tried to convert it to one DataArraybut without success using:
TEST= xr.DataArray(result, dims=my_dataarray.dims, attrs=my_dataarray.attrs.copy())
But I got an error:
AttributeError: ‘Array’ object has no attribute 'dims’

What I really need is to access the values of this new masked matrix, in order to:

  1. check that the multiplication is doing what I think is doing
  2. use this new matrix in other following computations

Can someone help me into transforming dask arrays to real values/xarrays??

The reason I used dask is to to save time from the time-consuming calculation/creation of the 4D masked temperature.After that I just need to go back to the real values of the new matrix and use them.

(## The number of chunks I chose them based on the attribute “_chunkSizes” inside the .nc file themselves. I assumed these numbers where there to “flag” the right chunking of data, if needed,maybe i was wrong in that)

I also tried to mask temperature by using another way:
testmask = xr.where(temp.deptht<=temp.deptht[idx2dT],temp,np.nan)
which took: CPU times: user 1.39 s, sys: 7.08 s, total: 8.46 s Wall time: 2min 42s

The idx2dT is a 3D matrix idx2dT=[time,lat,lon] which indicates the index of vertical level that the mask has to start at each grid point.So it is my mask of vertical limits kind of.

The temperature files come with their own mask in x,y,z, by default, but I want to implement on top of this another mask which basically masks out some more vertical levels in some grid points. And this masking is not uniform throughout the grid points.

Can someone help me with that? Is there a way to improve the performance of what I am trying to ?Is it correct?Wrong?( I have other 7 variables to do the same masking on and then computations - in the same program- which means ~15 min approximatly to run a program for just a very small portion of the ocea.

thank you in advance for your time and help,
Sofi

Hi Sofi,

Others can jump in and correct me here if I’ve misunderstood, but I think what you’re seeing is a result of just calling

result.compute()

This should return the final, computed result, after having loaded everything and done all your calculations. But, it doesn’t automatically save it in the object result. Instead, what I think you need to do is call

result = result.compute()

Then, after the line executes, result should have all the data from the computation eagerly loaded and available for using in downstream calculations. As for why you don’t see as large of a speed-up when you operate on lazily versus eagerly loaded data, this has a lot to do with the hardware you run on and the chunking of the data (either on disk or when you choose to load it into memory). You should try different chunk sizes when you read in via xarray, e.g.

chunks = {'time': ... , 'lat': ..., 'lon': ..., 'depth': ...}
temp = xr.open_dataset("/path/to/my/data.nc", chunks=chunks)

When masking data, one of the cool things you can do is combine masks together. If you could share a small snippet of your dataset and a description of the exact masking operation you want to do, I’m sure someone might be able to volunteer to help you work out a simple example that you could build from.

For instance, based on your description, I think what you actually want to do is a bit of vectorized indexing, since you described temp['idx2dT'] as a pre-computed “index” of the vertical level you need at each grid cell. But if you could write out a bit more detail on what you have and how you want to ultimately manipulate your data, it would help us refine a recommendation for you.

  • Daniel
1 Like

Hello Daniel,

Thank you for answering to me. Probably what you said is right about the “result” variable. Indeed. I will try to perform the syntax you suggested.

So, thanks for helping me to figure this out. I am sorry I was not very clear before . Here is more details about it:

1. I have temperature variable temp = [365,32,58,50] = [time,depth,lat,lon] and this is a slice/3D-box of the ocean. Below is its output (I am not sure how else I could share a snippet of my dataset, the files are too big to upload):

In [4]: temp
Out[4]:
<xarray.DataArray ‘votemper’ (time_counter: 365, deptht: 32, y: 58, x: 50)>
[33872000 values with dtype=float32
Coordinates:

  • deptht (deptht) float32 0.49402538 1.5413754 … 453.93774 541.0889
    nav_lat (y, x) float32 …
    nav_lon (y, x) float32 …
  • time_counter (time_counter) datetime64[ns] 1993-01-01T12:00:00 … 1993-12-31T12:00:00**
  • x (x) int32 469 470 471 472 473 474 … 513 514 515 516 517 518
  • y (y) int32 1088 1089 1090 1091 1092 … 1142 1143 1144 1145
    Attributes:
    units: degree_Celsius
    valid_min: -20.0
    valid_max: 42.0
    long_name: Temperature
    standard_name: sea_water_potential_temperature
    short_name: votemper
    online_operation: N/A
    interval_operation: 86400
    interval_write: 86400
    associate: time_counter deptht nav_lat nav_lon
    _ChunkSizes: [ 1 5 340 481]

I want to use this file (after I mask it) to compute temperature tendency inside this box,every day. . So i want to have oC/day as a result. That is:

dt=86400 #seconds
VT[day,depth,lat,lon] = temp[day,depth,lat,lon] * Volume[day]
integralT[day] = (VT.sum(dim=(“deptht”,“x”,“y”),skipna=True) ) /(Volume[day])
dT_dt [day] = (integralT[day+1] - integralT[day] ) /dt

Now, I only need to reach a certain vertical level at each grid point every day, which however, is different at each lat,lon and different for each day at each lat/lon. The way I tried to create that, is by first finding the index of the “accepted” vertical level at each lat,lon and every day like this:

A[depth,lat,lon,time] = np.fabs(depthsmask[depth,lat,lon]- MLD[day,lat,lon]) ## find minimum difference between depth file and mixed layer depth
B = A.transpose(“time”,“deptht”,“y”,“x”)
D = B.fillna(1e20)
idx2dT = D.argmin(dim=“deptht”)

In [5]: idx2dT
Out[5]:
<xarray.DataArray (time_counter: 365, y: 58, x: 50)>


Coordinates:

  • y (y) int64 1088 1089 1090 1091 1092 … 1142 1143 1144 1145
  • x (x) int64 469 470 471 472 473 474 … 513 514 515 516 517 518
    nav_lat (y, x) float32 …
    nav_lon (y, x) float32 …
  • time_counter (time_counter) datetime64[ns] 1993-01-01T12:00:00 … 1993-12-31T12:00:00

Then, I used this idx2dT to create the mask for my temperature file like this:

tempnew = temp.where(temp.deptht<=temp.deptht[idx2dT],np.nan)

What I tried to say by this is: “For all the values from surface until the “accepted” vertical level keep the temperature values as they are, the rest, put them to nan.”
Essentially it is only a mask, as you said, over the depth dimension, only that it is a time-space-varying mask. As I am new to xarray, I am trying to see how vectorizing indexing can help me. Can you please elaborate with that?

Finally, I am trying to run all this in a server full of people and the RAM I have available every time, depends on the number of people connected. Therefore, I am trying to switch to dask,in order to save time, because usually the creation of the mask-file takes a lot of time (I need to repeat this process for 7 other variables and many more ocean boxes).

I would appreciate it a lot if you could give me ideas on how to make my program more efficient with dask?.

Kind regards,
Sofi