Question about xm.open_mdsdataset of xmitgcm

oceanusofi · February 14, 2022, 9:03am

Hello all,

I would like to ask if the command “xm.open_mdsdataset” has an option to read the .meta files from a different folder than the one the .data files are located, or if it is compulsory for the .meta files to be located in the same folder as the .data files.

In addition, I would like to point out that the “xm.open_mdsdataset” is very slow when it comes to read a file from a directory with many other binary data. Is there a way to speed it up?

This is the code I am using:

datatot = xm.open_mdsdataset(data_dir='/home/Entr',grid_dir='/home/xdar/GRID_DATA',prefix=['TT'],iters=[day],delta_t=90,read_grid=False,nx=1999,ny=1999,ref_date='1979-01-15 00:00',geometry='sphericalpolar')

Thank you in advance for your time and help,
Sofi

rabernat · February 14, 2022, 1:44pm

As described in the documentation, the meta and data files must live in the same directory.

Since xmitgcm has to peek at lots of small meta files, the speed of reading usually depends on your filesystem. A good baseline is to just run ls -l on the directory where the files live. If this takes a long time, it means that your filesystem is the source of the slowness. There is not much that can be done about that. If you continue to have problems with xmitgcm, please open an issue at Issues · MITgcm/xmitgcm · GitHub.

oceanusofi · February 14, 2022, 5:31pm

Thanks @rabernat.

I have one last question. I am trying to use both bash and python in one python script that I am going to do two things:
1: Copy files from one directory to another using a bash script named "COPY_FILES.sh. The copying of the files is done like this:

cp -r /home/FILES2/TOT_R."000"${day}.data  /home/FILES1/REGIONS

2: Then open the copied files in /home/FILES1/REGIONS using xm.open_mdsdataset
3: Run the same script for multiple timesteps.

The code I wrote is the following :

for day in range(7730880,12987840,960):
    output = subprocess.run(['bash', '/home/REGIONS/COPY_FILES.sh',str(ndays),str(day)])
    datatot =xm.open_mdsdataset(data_dir='/home/FILES1/REGIONS',grid_dir='/home/grid/GRID_DATA',prefix=['TTc_R'],iters=[day],delta_t=90,read_grid=False,nx=1999,ny=1999,ref_date='1979-01-15 00:00',geometry='sphericalpolar')
    dT_dt   = datatot.TOTTEND2; dTnan=np.ma.array(dT_dt,mask=np.isnan(dT_dt))

I noticed that when I try to run the code in the loop of days and after the files are copied (step number 1) in the first 2 or 3 (loops) timesteps xm.open_mdsdataset doesn’t seem to be able to read the data properly and I get the following error:

AttributeError: 'Dataset' object has no attribute 'TOTTEND2'

After a while IF i am not in the loop and I let the script rest and then just run:

datatot =xm.open_mdsdataset(data_dir='/home/FILES1/REGIONS',grid_dir='/home/grid/GRID_DATA',prefix=['TTc_R'],iters=[day],delta_t=90,read_grid=False,nx=1999,ny=1999,ref_date='1979-01-15 00:00',geometry='sphericalpolar')
dT_dt   = datatot.TOTTEND2; dTnan=np.ma.array(dT_dt,mask=np.isnan(dT_dt))

on its own then the files are read without a problem. Any ideas why this could happening? it seems that there is a lag in the response of xm.open_mdsdataset for some reason. Have you ever heard of this before?

Sofi

rabernat · February 14, 2022, 5:38pm

xmitgcm uses caching to avoid re-scanning the filesystem (which can be slow). If you move the files around, you may want to clear the cache manually.

from xmitgcm.file_utils import clear_cache
clear_cache()

This is unfortuantely a poorly documented part of the package. You can see the source code here:

github.com

MITgcm/xmitgcm/blob/master/xmitgcm/file_utils.py

import cachetools.func
import os
import fnmatch

cache_maxsize = 100
cache_ttl = 600 # tem minutes

@cachetools.func.ttl_cache(maxsize=cache_maxsize, ttl=cache_ttl)
def listdir(path):
    return os.listdir(path)

@cachetools.func.ttl_cache(maxsize=cache_maxsize, ttl=cache_ttl)
def listdir_startswith(path, pattern):
    files = listdir(path)
    return [f for f in files if f.startswith(pattern)]

@cachetools.func.ttl_cache(maxsize=cache_maxsize, ttl=cache_ttl)
def listdir_endswith(path, pattern):
    files = listdir(path)
    return [f for f in files if f.endswith(pattern)]

This file has been truncated. show original

oceanusofi · February 15, 2022, 9:28am

Thanks a lot for this @rabernat. The clear cache() works like a charm. Thank you so much for this.

Sofi

rabernat · February 15, 2022, 1:37pm

FWIW, I wish that xmitgcm would work properly without the workaround of copying files around. Ideally you should be able to open all the files into a single Xarray dataset. Having to loop over timesteps is not the pattern we want to encourage.

What happens if you just try to read all the timesteps at once.

oceanusofi · February 15, 2022, 2:06pm

So, the problem is that the original binary files that I try to read with xmitgcm are saved as daily ones (by default) in a server that I have to mount in my local workstation in order to have access to.

The reason why I first copy the daily files from this server into my workstation (and then use xmitgcm to read them) is that the particular folder in which the files are located, contains many thousands of files (bad practice) and xm.open_mdsdataset takes ages to locate them.

So I decided to copy each file into my workstation (doesn’t take long at all) and then open it fast with xm.open_mdsdataset in my local machine (and directory) and continue my processing.

Sofi

Topic		Replies	Views
Unable to open binary data with xm.open_mdsdataset Data	2	556	January 11, 2022
Tools for reading llc4320 regional files Data	3	959	January 14, 2021
Avoid metadata reads when loading many similar NetCDF files Data	4	540	August 3, 2023
Using grib2 files with `open_mfdataset`: is there a better workflow than converting to netcdf?	4	1376	July 27, 2022
MITgcm LLC4320 data extracting Data	3	934	August 16, 2023

Question about xm.open_mdsdataset of xmitgcm

Related topics