Struggling with large dataset loading/reading using xarray

Hi all! I am a physical oceanographer and new in python and I recently watched a tutorial about minimizing the processing time of computationally expensive data (If I understood it correctly) by using dask= “parallelized”.

My issue here is that I am dealing with 4D (sliced) files of dimensions:[time,depth,lat,lon] = [365,32,56,48]. I discovered 2 ways of handling the data:

A) The “lazy” xarray reading of data like this:
temp= xr.open_dataset(’/home/directory/’)[‘votemper’][:,:,:,:] , where ipython reads the file instantly but later on when I want to use temp e.g. in a for loop, it takes ages.
B) The very slow xarray reading of data like this:
temp[:,:,:,:]= xr.open_dataset(’/home/directory/’)[‘votemper’][:,:,:,:], where ipython takes too much time to open/read/load the temperature values but then, processing temp e.g in a for loop, takes only a minute.

Now I am not sure if there is a solution to this problem but recently I learned about the dask=“parallelized” function and some high-order programming that could potentially offer a compromise in the time that ipython takes to read & process the large dataset (because I have to repeat the same procedure for 7 other variables of the same dimension and throughout many boxes in the ocean).

The example of for loop is like this:

mld2d = np.asarray(MLD) ##make it a numpy array
depths2 = np.asarray(depths)
depths2d = np.tile(depths2[:,None,None],(1,len(latp[:])+2,len(lonp[:])+2)) ## 3D (depth,lat,lon)
idx2dT = np.nannp.zeros(temp[:,0,:,:].shape)
idx2d = np.nan
depthsmask = np.nan*np.zeros(depths2d[:,:,:].shape)
depthsmask = depths2d[:,:,:]*maskfile2[:,:,:]

for day in range(0,len(time)):
for j in range(0,len(latp)):
for i in range(0,len(lonp)):
if np.isnan(mld[day,j,i]):
idx2dT[day,j,i] = np.nan
idx2dT[day,j,i] = np.abs(depthsmask[:,j,i]-mld2d[day,j,i]).argmin()

Can anyone help me with this?
Thank you in advance for your time and help,
Kind regards,

1 Like

Hi Sofi. Thanks for posting this question! We’d love to help you learn to use Xarray more effectively for your work.

Based on examining your code, I have the impression that you are new to Xarray and are coming from a language like MATLAB, which uses a very different programming paradigm. In Xarray, we tend not to write loops at all. Rather, we use high-level methods like reduction operators (e.g. mean, argmin, etc.) and groupby to operate on our data. This allows us to take advantage of Dask’s automatic parallelization.

I would recommend you spend some time brushing up on the fundamentals of Xarray and Dask. This workshop would be a great way to do so:

Xarray’s tutorial material is also shared here:

Also a quick note / recommendation on Xarray usage. You almost never need to write code like this

Instead, write

ds = xr.open_dataset(’/home/directory/’)
# example calculation:
# calculate the mean temperature in regions of shallow mixed layer
temp_mean = ds.temp.where(ds.mld < 100).mean(dim='time')


I’m not sure what you’re trying to accomplish in the code example in your post. If you share more details, we could provide more specific suggestions about how to do it using good Xarray syntax.

Hello rabernat!

Thanks for your reply. Yes, I have already subscribed to this new tutorial. I am actually coming from ncl language.

So Dask’s parallelization does exactly what?How does it operate to the data? I am a bit confused.


Yes you’ll have to start by breaking your dataset into chunks by providing chunks to open_dataset. The [:, :, :, :] is not needed.

See this tutorial for how to start:

Once you are comfortable with dask, I would do as Ryan says and try to avoid for loops. If that ends up being too hard, then I would go down the apply_ufunc route.

1 Like

Dask’s parallelization is often not explicit. It happens automatically, “under the hood.” Your data must be “chunked”; these chunks will determine how the computation is parallelized. For example, if you write code like this

# create one chunk per timestep
ds = xr.open_dataset('/home/directory/', chunks={'time': 1})
# create a "lazy" representation of the mean
temp_mean = ds.temp.mean(dim='time')
# trigger the computation; dask will parallelize over the time dimension

This can be a lot to digest if you’re new to this type of programming. That’s why a tutorial can be helpful for getting started.

Ok yeah, I understand these lines of code here.More neat way of writing things.I agree, I will look at the tutorials that you sent. Thanks a lot.

But just in case you can help me more, here is what I am trying to do:

To the fist order, I would like to create a 4D mask file (with nans and 1s). This 4D file will have dimensions of [time,depth,lat,lon] and I will later multiply it with every variable that I have (temperature,velocity etc.). This multiplication also is what takes so much time for ipython, when I have read all my temperature and velocity files the “lazy” xarray way.

Now, this 4D mask file represents with 1s all the valid grid points (horizontal and vertically) that are inside the mixed layer depth and whatever is NaN, means that it is outside the mixed layer depth.

I am open to suggestions is there is a much quicker way to write this without for loops… etc
Thank you,

1 Like

Thanks for taking the time to share your application. I’m sure it will be useful to others to see as a reference!

If you don’t mind iterating a bit more, I would encourage you to back up even further and describe, at a higher level, what you want to accomplish. I suspect your end coal is not truly to have a 4D matrix of NaNs and ones. Instead, you want to answer some scientific question. For example:

I want to calculate an average of ocean temperature over the mixed layer by combining a mixed-layer dataset with a temperature dataset.

(I don’t know what you want to do–I’m just making up an example.)

Your post above is more about how you want to implement the calculation. It still contains many assumptions about the programming pattern you want to use (e.g. creating large 4D matrices with 1s and NaNs, etc.) It may be the case that Xarray has some functions that will greatly simplify the analysis.

In general, with Xarray, you almost never have to create an empty array and fill it up using a loop. There are alternative, more efficient strategies, taking advantage of xarray’s broadcasting capabilities…

For example, if you might be able to do what you want with code like this

ds = xr.open_dataset('...')
# let's imagine that ds has three variables with the following dimenions
# - depth: [z]
# - mld: [time, lat, lon]
# - temp: [time, z, lat lon]

# this DataArray will have dimension [time, depth, lat, lon] thanks to broadcasting
mld_mask = ds.depth < ds.mld
temp_mld_avg = ds.tmp.where(mld_mask).mean(dim='z')

No loops, no explicitly assigning values to NaN, etc.

Ah ok. I will need to look more into it, in order to understand what the last 2 lines do.

So, ok what I want to do is basically use these valid grid points, those that are inside the mixed-layer-depth and calculate the temperature tendency with time, that is the dT/dt. But the dT/dt needs to be calculated in a volume.Which means that I need to integrate 3 times temperature in a volume, which is fixed in the x and y dimensions, but the z dimension is changing with time and with each grid point (because the mixed layer depth is not the same everywhere.)
And i wanted to do this by using this mask that (I thought) it would quickly mask out all the unnecessary vertical levels for each grid point and it would give my every day a specific 3D volume to perform my integrations into. In any case, I will have to use a lot for loops for these 3D integrations, which is also something that takes time. e.g.:

for day in range(0,len(time)):
Tav[day] = (np.nansum(np.nansum(np.nansum(temp[day,:,:,:]*dz_fin[day,:,:,:]*dy_fin[day,:,:,:]*dx_fin[day,:,:,:],axis=0),axis=0),axis=0))/Volume[day]

and that is taking time if I just open the temp file the “lazy” way. And I do not understand why it takes some much time.


1 Like

The reason it takes lots of time may be because you are not following xarray best practices in terms of broadcasting, indexing, etc. This results in inefficient code. That’s why I recommended starting from the basics (e.g. the tutorials) and working up to a more complex scenario.

Many of the things you think you have to do manually (e.g. loop over day) are done automatically by xarray, using the most efficient possible implementation. For example

Tav_per_day = ds.temp.mean(dim=['x', 'y', 'z'])

Masking can be done with where. Weighted averages can be done with weighted array reductions.

The theme here is that you may have to “unlearn” many of the programming patterns you are used to using with NCL in order to get the most out of Xarray.

For working with gridded / GCM data, you may also enjoy using xgcm:

But I recommend learning Xarray well first before diving into xgcm. :wink:

Ok thanks a lot. I will do that. I will start from the tutorials you sent before.

But could you please explain me what is the difference in terms of data processing of these 2 lines here?
1.temp= xr.open_dataset(’/home/directory/’)[‘votemper’][:,:,lat1:lat2,lon1:lon2]
2. temp[:,:,:,:]= xr.open_dataset(’/home/directory/’)[‘votemper’][:,:,lat1:lat2,lon1:lon2]

Do they represent two fundamentaly different things, or it is just my programming writing that it is not adapted to the xarray capabilities in the 2nd case, which takes so much time?

Finally, I was wondering whether, there is an efficient way with xarray to multiply 2 4D matrices? Or it is computationally expensive no matter what I do?


1 Like

It’s likely that 1 is lazy while 2 is not. 2 uses assignment to a numpy array, which triggers eager computation. In general, you should try to avoid assigning to slices of arrays if you want to keep your xarray / dask code efficient.

I assume you mean the elementwise / Hadamard product, rather than a true matrix multiplication. This is what is done by default in numpy and xarray when you multiply two arrays:

# a and b are 4D arrays
c = a * b

The arrays will automatically be aligned based on their dimensions and coordinates.

Whether this is “efficient” depends on a number of factors. If you are using numpy arrays, you need to have enough memory (RAM) available to hold all of the different arrays in memory. If you are using dask, you can do it on arrays that are larger than you RAM and create a lazy representation of the result (another Dask array). The efficiency will depend on the chunking of a and b.

Normally a 4D array is not the end result of our analysis, since it’s hard to visualize. We often want to combine it with a reduction, e.g.

c = (a * b).mean(dim='time')

Ok thank you very much for this. It was very helpful!
Looks like I have a lot of catching up.

1 Like

It takes a lot of courage to pose beginner questions like this on a public forum. Hopefully the discussion can be useful for others who are learning. So thank you! :pray: