Hello all,
I am currently trying to subset and then load using Dask chunking, the same ocean temperature variable but in 2 different (yearly files) years with the following code:
Dimensions of subset files:
T_1993_2D.nc [365,32,65,48], (original global file dimension: [365,32,4031,3057] ~ 400 GB of size)
T_1994_2D.nc [365,32,65,48]
Code for loading the subset temperature:
%%time
chunks = {'time_counter':41,'deptht':5,'y':340,'x':481}
dataset = xr.open_dataset('/home/folder/T_1993_2D.nc',chunks=chunks,engine='h5netcdf')
TEMP = dataset.votemper.isel(deptht=slice(1,33),y=slice(lat1,lat2),x=slice(lon1,lon2))
temp = TEMP.compute()
Insert Temperature
CPU times: user 1.75 s, sys: 11.8 s, total: 13.6 s
Wall time: 12.7 s
and the 2nd file like this:
%%time
chunks = {'time_counter':41,'deptht':5,'y':340,'x':481}
dataset = xr.open_dataset('/home/folder/T_1994_2D.nc',chunks=chunks,engine='h5netcdf')
TEMP = dataset.votemper.isel(deptht=slice(1,33),y=slice(lat1,lat2),x=slice(lon1,lon2))
temp = TEMP.compute()
Insert Temperature
CPU times: user 6min 7s, sys: 31.2 s, total: 6min 38s
Wall time: 10min 10s
These are 2 different (in terms of year) files, from which however, I am trying to load into python memory the same variable (temperature), using the same chunking and the same dimension for subsetting in 2 different ipython sessions. So far the only difference between the two loading times has been (I suspect) due to the fact that I have already loaded in python the first file (1993) a million times, upon trying to find the optimal way to load the subset of the 4D matrix of this particular year, without having to read the entire global file before .
Anyone any ideas as to why the performance of loading a subset of a small 4D matrix might be so slow?Subsetting a big data file and loading the subset afterwards, means that python has to parse the entire global file first??
Thank you in advance for your time and help,
Kind regards,
Sofi