Hello all!
I am relatively new to Dask and I was trying to implement a (I think) heavy computation following some tutorials online. Turns out that the outcome was not what I expected so I would like some help if anyone can give me some ideas.
I am trying to mask a 4D matrix of temperature variable by multiplying it with another 4D mask matrix (which is full of 1s and NaNs).So:
Variable matrix of temperature: temp=[365,32,58,50] = [time,deptht,lat,lon]
Mask file: D= [365,32,58,50] = [time,deptht,lat,lon]
These are for the moment loaded lazy as xarrays like this:
temp = xr.open_dataset(’/home/directory/T_1993_2D.nc’)['temp][:,:,lat1:lat2,lon1:lon2]
When I try to do the multiplication between the 2 xarrays: mask1 = temp * D
this took: CPU times: user 1.83 s, sys: 12.1 s, total: 14 s Wall time: 6min 46s
So I decided to do it with dask like this:
AA = da.from_array(temp,chunks=(1,5,340,481))
MM = da.from_array(MLDMASKF,chunks=(1,5,340,481))
result = AA*MM
%% time
result.compute()
This took : CPU times: user 20.8 s, sys: 10.6 s, total: 31.4 sWall time: 4min 7s
In that case I was expecting the result to be an xarray. However It was not:
type(result)
Out[7]: dask.array.core.Array
**In [14]: result **
Out[14]: dask.array<mul, shape=(365, 32, 58, 50), dtype=float64, chunksize=(1, 5, 58, 50), chunktype=xarray.DataArray>
which I do not understand why.I thought that .compute() was there to load the values.
so I tried to convert it to one DataArraybut without success using:
TEST= xr.DataArray(result, dims=my_dataarray.dims, attrs=my_dataarray.attrs.copy())
But I got an error:
AttributeError: ‘Array’ object has no attribute 'dims’
What I really need is to access the values of this new masked matrix, in order to:
- check that the multiplication is doing what I think is doing
- use this new matrix in other following computations
Can someone help me into transforming dask arrays to real values/xarrays??
The reason I used dask is to to save time from the time-consuming calculation/creation of the 4D masked temperature.After that I just need to go back to the real values of the new matrix and use them.
(## The number of chunks I chose them based on the attribute “_chunkSizes” inside the .nc file themselves. I assumed these numbers where there to “flag” the right chunking of data, if needed,maybe i was wrong in that)
I also tried to mask temperature by using another way:
testmask = xr.where(temp.deptht<=temp.deptht[idx2dT],temp,np.nan)
which took: CPU times: user 1.39 s, sys: 7.08 s, total: 8.46 s Wall time: 2min 42s
The idx2dT is a 3D matrix idx2dT=[time,lat,lon] which indicates the index of vertical level that the mask has to start at each grid point.So it is my mask of vertical limits kind of.
The temperature files come with their own mask in x,y,z, by default, but I want to implement on top of this another mask which basically masks out some more vertical levels in some grid points. And this masking is not uniform throughout the grid points.
Can someone help me with that? Is there a way to improve the performance of what I am trying to ?Is it correct?Wrong?( I have other 7 variables to do the same masking on and then computations - in the same program- which means ~15 min approximatly to run a program for just a very small portion of the ocea.
thank you in advance for your time and help,
Sofi