Hello all,
I am currently trying to subset and then load using Dask chunking, the same ocean temperature variable but in 2 different (yearly files) years with the following code:
Dimensions of subset files:
T_1993_2D.nc [365,32,65,48], (original global file dimension: [365,32,4031,3057] ~ 400 GB of size)
T_1994_2D.nc [365,32,65,48]
Code for loading the subset temperature:
chunks = {'time_counter':41,'deptht':5,'y':340,'x':481}
dataset = xr.open_dataset('/home/folder/T_1993_2D.nc',chunks=chunks,engine='h5netcdf')
TEMP = dataset.votemper.isel(deptht=slice(1,33),y=slice(lat1,lat2),x=slice(lon1,lon2))
temp = TEMP.compute()
Insert Temperature
CPU times: user 1.75 s, sys: 11.8 s, total: 13.6 s
Wall time: 12.7 s
and the 2nd file like this:
chunks = {'time_counter':41,'deptht':5,'y':340,'x':481}
dataset = xr.open_dataset('/home/folder/T_1994_2D.nc',chunks=chunks,engine='h5netcdf')
TEMP = dataset.votemper.isel(deptht=slice(1,33),y=slice(lat1,lat2),x=slice(lon1,lon2))
temp = TEMP.compute()
Insert Temperature
CPU times: user 6min 7s, sys: 31.2 s, total: 6min 38s
Wall time: 10min 10s
These are 2 different (in terms of year) files, from which however, I am trying to load into python memory the same variable (temperature), using the same chunking and the same dimension for subsetting in 2 different ipython sessions. So far the only difference between the two loading times has been (I suspect) due to the fact that I have already loaded in python the first file (1993) a million times, upon trying to find the optimal way to load the subset of the 4D matrix of this particular year, without having to read the entire global file before .
Anyone any ideas as to why the performance of loading a subset of a small 4D matrix might be so slow?Subsetting a big data file and loading the subset afterwards, means that python has to parse the entire global file first??
Thank you in advance for your time and help,
Kind regards,