Struggling with large dataset loading/reading using xarray

Hi all! I am a physical oceanographer and new in python and I recently watched a tutorial about minimizing the processing time of computationally expensive data (If I understood it correctly) by using dask= “parallelized”.

My issue here is that I am dealing with 4D (sliced) files of dimensions:[time,depth,lat,lon] = [365,32,56,48]. I discovered 2 ways of handling the data:

A) The “lazy” xarray reading of data like this:
temp= xr.open_dataset(’/home/directory/T_2011_2D.nc’)[‘votemper’][:,:,:,:] , where ipython reads the file instantly but later on when I want to use temp e.g. in a for loop, it takes ages.
B) The very slow xarray reading of data like this:
temp[:,:,:,:]= xr.open_dataset(’/home/directory/T_2011_2D.nc’)[‘votemper’][:,:,:,:], where ipython takes too much time to open/read/load the temperature values but then, processing temp e.g in a for loop, takes only a minute.

Now I am not sure if there is a solution to this problem but recently I learned about the dask=“parallelized” function and some high-order programming that could potentially offer a compromise in the time that ipython takes to read & process the large dataset (because I have to repeat the same procedure for 7 other variables of the same dimension and throughout many boxes in the ocean).

The example of for loop is like this:

mld2d = np.asarray(MLD) ##make it a numpy array
depths2 = np.asarray(depths)
depths2d = np.tile(depths2[:,None,None],(1,len(latp[:])+2,len(lonp[:])+2)) ## 3D (depth,lat,lon)
idx2dT = np.nannp.zeros(temp[:,0,:,:].shape)
idx2d = np.nan
np.zeros(mld2d[:,:,:].shape)
depthsmask = np.nan*np.zeros(depths2d[:,:,:].shape)
depthsmask = depths2d[:,:,:]*maskfile2[:,:,:]

for day in range(0,len(time)):
for j in range(0,len(latp)):
for i in range(0,len(lonp)):
if np.isnan(mld[day,j,i]):
idx2dT[day,j,i] = np.nan
else:
idx2dT[day,j,i] = np.abs(depthsmask[:,j,i]-mld2d[day,j,i]).argmin()

Can anyone help me with this?
Thank you in advance for your time and help,
Kind regards,
Sofi

1 Like

Hi Sofi. Thanks for posting this question! We’d love to help you learn to use Xarray more effectively for your work.

Based on examining your code, I have the impression that you are new to Xarray and are coming from a language like MATLAB, which uses a very different programming paradigm. In Xarray, we tend not to write loops at all. Rather, we use high-level methods like reduction operators (e.g. mean, argmin, etc.) and groupby to operate on our data. This allows us to take advantage of Dask’s automatic parallelization.

I would recommend you spend some time brushing up on the fundamentals of Xarray and Dask. This workshop would be a great way to do so:

Xarray’s tutorial material is also shared here:

1 Like

Also a quick note / recommendation on Xarray usage. You almost never need to write code like this

Instead, write

ds = xr.open_dataset(’/home/directory/T_2011_2D.nc’)
# example calculation:
# calculate the mean temperature in regions of shallow mixed layer
temp_mean = ds.temp.where(ds.mld < 100).mean(dim='time')

etc.

I’m not sure what you’re trying to accomplish in the code example in your post. If you share more details, we could provide more specific suggestions about how to do it using good Xarray syntax.

Hello rabernat!

Thanks for your reply. Yes, I have already subscribed to this new tutorial. I am actually coming from ncl language.

So Dask’s parallelization does exactly what?How does it operate to the data? I am a bit confused.

Thanks,
Sofi

Yes you’ll have to start by breaking your dataset into chunks by providing chunks to open_dataset. The [:, :, :, :] is not needed.

See this tutorial for how to start: https://xarray-contrib.github.io/xarray-tutorial/scipy-tutorial/06_xarray_and_dask.html

Once you are comfortable with dask, I would do as Ryan says and try to avoid for loops. If that ends up being too hard, then I would go down the apply_ufunc route.

1 Like

Dask’s parallelization is often not explicit. It happens automatically, “under the hood.” Your data must be “chunked”; these chunks will determine how the computation is parallelized. For example, if you write code like this

# create one chunk per timestep
ds = xr.open_dataset('/home/directory/T_2011_2D.nc', chunks={'time': 1})
# create a "lazy" representation of the mean
temp_mean = ds.temp.mean(dim='time')
# trigger the computation; dask will parallelize over the time dimension
temp_mean.load()

This can be a lot to digest if you’re new to this type of programming. That’s why a tutorial can be helpful for getting started.

Ok yeah, I understand these lines of code here.More neat way of writing things.I agree, I will look at the tutorials that you sent. Thanks a lot.

But just in case you can help me more, here is what I am trying to do:

To the fist order, I would like to create a 4D mask file (with nans and 1s). This 4D file will have dimensions of [time,depth,lat,lon] and I will later multiply it with every variable that I have (temperature,velocity etc.). This multiplication also is what takes so much time for ipython, when I have read all my temperature and velocity files the “lazy” xarray way.

Now, this 4D mask file represents with 1s all the valid grid points (horizontal and vertically) that are inside the mixed layer depth and whatever is NaN, means that it is outside the mixed layer depth.

I am open to suggestions is there is a much quicker way to write this without for loops… etc
Thank you,
Sofi

1 Like

Thanks for taking the time to share your application. I’m sure it will be useful to others to see as a reference!

If you don’t mind iterating a bit more, I would encourage you to back up even further and describe, at a higher level, what you want to accomplish. I suspect your end coal is not truly to have a 4D matrix of NaNs and ones. Instead, you want to answer some scientific question. For example:

I want to calculate an average of ocean temperature over the mixed layer by combining a mixed-layer dataset with a temperature dataset.

(I don’t know what you want to do–I’m just making up an example.)

Your post above is more about how you want to implement the calculation. It still contains many assumptions about the programming pattern you want to use (e.g. creating large 4D matrices with 1s and NaNs, etc.) It may be the case that Xarray has some functions that will greatly simplify the analysis.

In general, with Xarray, you almost never have to create an empty array and fill it up using a loop. There are alternative, more efficient strategies, taking advantage of xarray’s broadcasting capabilities…

For example, if you might be able to do what you want with code like this

ds = xr.open_dataset('...')
# let's imagine that ds has three variables with the following dimenions
# - depth: [z]
# - mld: [time, lat, lon]
# - temp: [time, z, lat lon]

# this DataArray will have dimension [time, depth, lat, lon] thanks to broadcasting
mld_mask = ds.depth < ds.mld
temp_mld_avg = ds.tmp.where(mld_mask).mean(dim='z')

No loops, no explicitly assigning values to NaN, etc.

Ah ok. I will need to look more into it, in order to understand what the last 2 lines do.

So, ok what I want to do is basically use these valid grid points, those that are inside the mixed-layer-depth and calculate the temperature tendency with time, that is the dT/dt. But the dT/dt needs to be calculated in a volume.Which means that I need to integrate 3 times temperature in a volume, which is fixed in the x and y dimensions, but the z dimension is changing with time and with each grid point (because the mixed layer depth is not the same everywhere.)
And i wanted to do this by using this mask that (I thought) it would quickly mask out all the unnecessary vertical levels for each grid point and it would give my every day a specific 3D volume to perform my integrations into. In any case, I will have to use a lot for loops for these 3D integrations, which is also something that takes time. e.g.:

for day in range(0,len(time)):
Tav[day] = (np.nansum(np.nansum(np.nansum(temp[day,:,:,:]*dz_fin[day,:,:,:]*dy_fin[day,:,:,:]*dx_fin[day,:,:,:],axis=0),axis=0),axis=0))/Volume[day]

and that is taking time if I just open the temp file the “lazy” way. And I do not understand why it takes some much time.

Sofi

1 Like

The reason it takes lots of time may be because you are not following xarray best practices in terms of broadcasting, indexing, etc. This results in inefficient code. That’s why I recommended starting from the basics (e.g. the tutorials) and working up to a more complex scenario.

Many of the things you think you have to do manually (e.g. loop over day) are done automatically by xarray, using the most efficient possible implementation. For example

Tav_per_day = ds.temp.mean(dim=['x', 'y', 'z'])

Masking can be done with where. Weighted averages can be done with weighted array reductions.

The theme here is that you may have to “unlearn” many of the programming patterns you are used to using with NCL in order to get the most out of Xarray.

For working with gridded / GCM data, you may also enjoy using xgcm:
http://xgcm.readthedocs.org/

But I recommend learning Xarray well first before diving into xgcm. :wink:

1 Like

Ok thanks a lot. I will do that. I will start from the tutorials you sent before.

But could you please explain me what is the difference in terms of data processing of these 2 lines here?
1.temp= xr.open_dataset(’/home/directory/T_2011_2D.nc’)[‘votemper’][:,:,lat1:lat2,lon1:lon2]
2. temp[:,:,:,:]= xr.open_dataset(’/home/directory/T_2011_2D.nc’)[‘votemper’][:,:,lat1:lat2,lon1:lon2]

Do they represent two fundamentaly different things, or it is just my programming writing that it is not adapted to the xarray capabilities in the 2nd case, which takes so much time?

Finally, I was wondering whether, there is an efficient way with xarray to multiply 2 4D matrices? Or it is computationally expensive no matter what I do?

Sofi

1 Like

It’s likely that 1 is lazy while 2 is not. 2 uses assignment to a numpy array, which triggers eager computation. In general, you should try to avoid assigning to slices of arrays if you want to keep your xarray / dask code efficient.

I assume you mean the elementwise / Hadamard product, rather than a true matrix multiplication. This is what is done by default in numpy and xarray when you multiply two arrays:

# a and b are 4D arrays
c = a * b

The arrays will automatically be aligned based on their dimensions and coordinates.

Whether this is “efficient” depends on a number of factors. If you are using numpy arrays, you need to have enough memory (RAM) available to hold all of the different arrays in memory. If you are using dask, you can do it on arrays that are larger than you RAM and create a lazy representation of the result (another Dask array). The efficiency will depend on the chunking of a and b.

Normally a 4D array is not the end result of our analysis, since it’s hard to visualize. We often want to combine it with a reduction, e.g.

c = (a * b).mean(dim='time')

Ok thank you very much for this. It was very helpful!
Looks like I have a lot of catching up.
:slight_smile:
Sofi

2 Likes

It takes a lot of courage to pose beginner questions like this on a public forum. Hopefully the discussion can be useful for others who are learning. So thank you! :pray:

7 Likes

Hello Everyone
I am a Physical Oceanographer and like Sofi, I am quite new to Python but I learned a few good things in Python e.g. xarray, and still continuing my journey to improve my python skills. Currently, I am facing a similar problem with the large data set.

I want to compute the running trend over some large data sets of globally gridded monthly ocean temperature (time:852, depth:34, lat:173, lon:360). So I stacked my 4-d data to a 3-d data i.e. I merged my lat and lon axes to one dimension (call it by z) such that my stacked data is 3-d now (time:852, lev:34, z:62280), [173*360 = 62280].

Then I applied three for loops in time, lev and z to compute my linear running trend for a sliding window of 10 years i.e. 120 months. I tried running this code on a supercomputer and on my laptop but the code is very slow and it keeps on running for ages and I am unable to produce any analysis.

So my query is if someone can look into my code and suggest to me how I can make the code faster, such that I can use dask with loops or anything else. I am sharing my code also with this query.

Also, I am a new member of this platform and I do not know where to ask questions so I am putting my question in Sofi’s thread.

It will be very helpful if someone can suggest how to make my code efficient.

Cheers, Saurabh

import xarray as xr
import pandas as pd
import numpy as np
import netCDF4 as nc
import datetime

ohc1 = xr.open_mfdataset("ohc-enc13-anm.nc")
heat = ohc1.HEATANM

#Stack the dimension
heat1 = heat.stack(z = ("lat", "lon"))

#load the data in the memory for faster processing
heat2 = heat1.load()

#running trend
x = np.linspace(1,120,120) 
runt = np.zeros([61,34,62280])
for i in range(61):
    for j in range(34):
        for k in range(62280):
            r = np.polyfit(x,heat2[0 + 12*i : 120 + 12*i, j, k], 1)
            runt[i, j, k] = r[0]

#convert np array to xarray
runt1 = xr.DataArray(runt, coords=[heat1.time[0:61], heat1.lev[0:34], heat1.z], name='runt')

#unstack the dimensions using xarray
runt2 = runt1.unstack("z")

#convert xr dataarray to xr dataset
ds_ci = xr.Dataset({'heatrunt':runt2})
ds_ci.to_netcdf('heat-run-trend-enc13.nc')

Welcome @saurabh_rathore! You want to avoid looping over the individual cells. Try using xarray’s rolling and polyfit functions applied over the time dimension instead.

1 Like

Dear @rabernat thank you for your suggestion. One thing that I like to know that how to use the rolling function? As I want to compute the running trend over the window of 120 months and after computing the trend for the chunk of 120 months the window must slide by 12 months.
e.g. trend 1 = 0:120, trend 2 = 12:132, trend 3 = 24:144 and so on.
So how to slide the 120-month window by 12 months using the rolling function?

Thanks again for your help and suggestions.

Cheers, Saurabh

That would probably be a good question for the xarray discussion forum:

Hi Saurabh, hi everyone,

This is an interesting question. First, let us say immediately, the computation should be fast, really fast, and no need to go on a supercomputer. How fast? a second, actually less than a second

Why? because computers are fast. My intel core7, 6 years old, can deliver up to 27 GFlops per sec. 27 billions of operations (+ and * on floats) per second. Think about it for a while. That’s a tremendous speed. How many operations are involved in your computation ? Well, one multiplication (apply the 1/120 coefficient) and one addition per element. This means 2 operations per element. Your array has 8523462280 elements, so 1.8 billions elements. This means you have 3.6 GFlops to perform. Harvesting the 27 GFlops per sec is tough and probably impossible in any realistic algorithm but getting 12 GFlops/s is possible. Now you have your target time: 3.6 / 12 = 0.3 sec.

I believe this kind of back of the envelope calculation should be routine for everyone in the community to set bounds on what performance we should aim with our computations. Clearly it’s not yet the case.

Now the question is how to get this time ? Well, in one word, you need to ensure data contiguity. Simple. The computation being done on the time axis, the time dimension should be the one to the right. In addition you may chunks over the latitude. I can see two loops. Assuming heat2.shape = (173, 360, 34, 852)

nlat = 173
heat2 = np.zeros(((nlat, 360, 34, 852))
runt = np.zeros((nlat, 360, 34, 61))
for i in range(nlat):
    data = heat2[i]
    res = runt[i] 
    for kt in range(61):
        res[:, :, kt] = np.mean(data[:,:,kt*12:120+kt*12], axis=2)

On my laptop, with nlat=20 instead of 173 (not enough RAM), the computation is done in less than a sec. I’m probably around 2 GFlops/s. Getting the 12 GFlops/s with numpy is hard but not impossible.

Hope it helps,

Guillaume

Guillaume–nice to see you on this forum!

Writing low-level loopy code like what you showed in your post will almost never be fast in python. Because it is a dynamically interpreted language, there is a lot of overhead with each loop iteration When you use numpy correctly, the low-level loops all happen in C code.

If you really want to write loopy code, you should look into numba (python compiler) or use a different language like Fortran or Julia.