Hi all! I am a physical oceanographer and new in python and I recently watched a tutorial about minimizing the processing time of computationally expensive data (If I understood it correctly) by using dask= “parallelized”.
My issue here is that I am dealing with 4D (sliced) files of dimensions:[time,depth,lat,lon] = [365,32,56,48]. I discovered 2 ways of handling the data:
A) The “lazy” xarray reading of data like this:
temp= xr.open_dataset(’/home/directory/T_2011_2D.nc’)[‘votemper’][:,:,:,:] , where ipython reads the file instantly but later on when I want to use temp e.g. in a for loop, it takes ages.
B) The very slow xarray reading of data like this:
temp[:,:,:,:]= xr.open_dataset(’/home/directory/T_2011_2D.nc’)[‘votemper’][:,:,:,:], where ipython takes too much time to open/read/load the temperature values but then, processing temp e.g in a for loop, takes only a minute.
Now I am not sure if there is a solution to this problem but recently I learned about the dask=“parallelized” function and some high-order programming that could potentially offer a compromise in the time that ipython takes to read & process the large dataset (because I have to repeat the same procedure for 7 other variables of the same dimension and throughout many boxes in the ocean).
The example of for loop is like this:
mld2d = np.asarray(MLD) ##make it a numpy array
depths2 = np.asarray(depths)
depths2d = np.tile(depths2[:,None,None],(1,len(latp[:])+2,len(lonp[:])+2)) ## 3D (depth,lat,lon)
idx2dT = np.nannp.zeros(temp[:,0,:,:].shape)
idx2d = np.nannp.zeros(mld2d[:,:,:].shape)
depthsmask = np.nan*np.zeros(depths2d[:,:,:].shape)
depthsmask = depths2d[:,:,:]*maskfile2[:,:,:]
for day in range(0,len(time)):
for j in range(0,len(latp)):
for i in range(0,len(lonp)):
if np.isnan(mld[day,j,i]):
idx2dT[day,j,i] = np.nan
else:
idx2dT[day,j,i] = np.abs(depthsmask[:,j,i]-mld2d[day,j,i]).argmin()
Can anyone help me with this?
Thank you in advance for your time and help,
Kind regards,
Sofi