Question about xm.open_mdsdataset of xmitgcm

Hello all,

I would like to ask if the command “xm.open_mdsdataset” has an option to read the .meta files from a different folder than the one the .data files are located, or if it is compulsory for the .meta files to be located in the same folder as the .data files.

In addition, I would like to point out that the “xm.open_mdsdataset” is very slow when it comes to read a file from a directory with many other binary data. Is there a way to speed it up?

This is the code I am using:

datatot = xm.open_mdsdataset(data_dir='/home/Entr',grid_dir='/home/xdar/GRID_DATA',prefix=['TT'],iters=[day],delta_t=90,read_grid=False,nx=1999,ny=1999,ref_date='1979-01-15 00:00',geometry='sphericalpolar')

Thank you in advance for your time and help,
Sofi

As described in the documentation, the meta and data files must live in the same directory.

Since xmitgcm has to peek at lots of small meta files, the speed of reading usually depends on your filesystem. A good baseline is to just run ls -l on the directory where the files live. If this takes a long time, it means that your filesystem is the source of the slowness. There is not much that can be done about that. If you continue to have problems with xmitgcm, please open an issue at Issues · MITgcm/xmitgcm · GitHub.

Thanks @rabernat.

I have one last question. I am trying to use both bash and python in one python script that I am going to do two things:
1: Copy files from one directory to another using a bash script named "COPY_FILES.sh. The copying of the files is done like this:

cp -r /home/FILES2/TOT_R."000"${day}.data  /home/FILES1/REGIONS

2: Then open the copied files in /home/FILES1/REGIONS using xm.open_mdsdataset
3: Run the same script for multiple timesteps.

The code I wrote is the following :

for day in range(7730880,12987840,960):
    output = subprocess.run(['bash', '/home/REGIONS/COPY_FILES.sh',str(ndays),str(day)])
    datatot =xm.open_mdsdataset(data_dir='/home/FILES1/REGIONS',grid_dir='/home/grid/GRID_DATA',prefix=['TTc_R'],iters=[day],delta_t=90,read_grid=False,nx=1999,ny=1999,ref_date='1979-01-15 00:00',geometry='sphericalpolar')
    dT_dt   = datatot.TOTTEND2; dTnan=np.ma.array(dT_dt,mask=np.isnan(dT_dt))

I noticed that when I try to run the code in the loop of days and after the files are copied (step number 1) in the first 2 or 3 (loops) timesteps xm.open_mdsdataset doesn’t seem to be able to read the data properly and I get the following error:

AttributeError: 'Dataset' object has no attribute 'TOTTEND2'

After a while IF i am not in the loop and I let the script rest and then just run:

datatot =xm.open_mdsdataset(data_dir='/home/FILES1/REGIONS',grid_dir='/home/grid/GRID_DATA',prefix=['TTc_R'],iters=[day],delta_t=90,read_grid=False,nx=1999,ny=1999,ref_date='1979-01-15 00:00',geometry='sphericalpolar')
dT_dt   = datatot.TOTTEND2; dTnan=np.ma.array(dT_dt,mask=np.isnan(dT_dt))

on its own then the files are read without a problem. Any ideas why this could happening? it seems that there is a lag in the response of xm.open_mdsdataset for some reason. Have you ever heard of this before?

Sofi

xmitgcm uses caching to avoid re-scanning the filesystem (which can be slow). If you move the files around, you may want to clear the cache manually.

from xmitgcm.file_utils import clear_cache
clear_cache()

This is unfortuantely a poorly documented part of the package. You can see the source code here:

Thanks a lot for this @rabernat. The clear cache() works like a charm. Thank you so much for this.

Sofi

FWIW, I wish that xmitgcm would work properly without the workaround of copying files around. Ideally you should be able to open all the files into a single Xarray dataset. Having to loop over timesteps is not the pattern we want to encourage.

What happens if you just try to read all the timesteps at once.

So, the problem is that the original binary files that I try to read with xmitgcm are saved as daily ones (by default) in a server that I have to mount in my local workstation in order to have access to.

The reason why I first copy the daily files from this server into my workstation (and then use xmitgcm to read them) is that the particular folder in which the files are located, contains many thousands of files (bad practice) and xm.open_mdsdataset takes ages to locate them.

So I decided to copy each file into my workstation (doesn’t take long at all) and then open it fast with xm.open_mdsdataset in my local machine (and directory) and continue my processing.

Sofi

2 Likes