Very slow selection of multiple points in large dataset using xarray

Hello,
I am trying to sample hourly model output to match the time and location of observations collected during a flight.
There are only 1000 points and it seems the “lazy” selection works fine. However, I have never managed to actually complete the calculation.

the dataset is created using open_mfdataset (6 files ~6.8Gb per file)

xarray.Dataset

    Dimensions:
        lat: 180lon: 288time: 52608pfull_sub03: 21phalf: 50
    Coordinates:
        lat
        (lat)
        float64
        -89.5 -88.5 -87.5 ... 88.5 89.5
        lon
        (lon)
        float64
        0.625 1.875 3.125 ... 358.1 359.4
        time
        (time)
        datetime64[ns]
        2007-01-01T00:30:00 ... 2012-12-...
        pfull_sub03
        (pfull_sub03)
        float64
        229.3 277.4 333.2 ... 992.8 997.9
        phalf
        (phalf)
        float64
        0.01 0.02697 ... 995.9 1e+03
    Data variables:
        H2
        (time, pfull_sub03, lat, lon)
        float32
        dask.array<chunksize=(8760, 21, 1, 2), meta=np.ndarray>
        CO
        (time, pfull_sub03, lat, lon)
        float32
        dask.array<chunksize=(8760, 21, 1, 2), meta=np.ndarray>
        ps
        (time, lat, lon)
        float32
        dask.array<chunksize=(8760, 3, 23), meta=np.ndarray>
        temp
        (time, pfull_sub03, lat, lon)
        float32
        dask.array<chunksize=(8760, 21, 1, 2), meta=np.ndarray>
        sphum
        (time, pfull_sub03, lat, lon)
        float32
        dask.array<chunksize=(8760, 21, 1, 2), meta=np.ndarray>
        ucomp
        (time, pfull_sub03, lat, lon)
        float32
        dask.array<chunksize=(8760, 21, 1, 2), meta=np.ndarray>
        vcomp
        (time, pfull_sub03, lat, lon)
        float32
        dask.array<chunksize=(8760, 21, 1, 2), meta=np.ndarray>
        u_ref
        (time, lat, lon)
        float32
        dask.array<chunksize=(8760, 3, 23), meta=np.ndarray>
        v_ref
        (time, lat, lon)
        float32
        dask.array<chunksize=(8760, 3, 23), meta=np.ndarray>
        zsurf
        (lat, lon)
        float32
        ...
        pk
        (phalf)
        float32
        ...
        bk
        (phalf)
        float32
        ...

I then select the observations using

obs_select = f1.sel(lat=target_lat,lon=target_lon,time=target_time,method='nearest')

where target_lat looks like

xarray.DataArray
'lat'

    points: 1003

    array([61.854 , 63.436 , 65.8994, ..., 62.26  , 60.96  , 59.35  ])

    Coordinates:
        points
        (points)
        int64
        0 1 2 3 4 ... 999 1000 1001 1002
    Attributes: (0)

This gives me

obs_select

xarray.Dataset

    Dimensions:
        points: 1003pfull_sub03: 21phalf: 50
    Coordinates:
        lat
        (points)
        float64
        61.5 63.5 65.5 ... 62.5 60.5 59.5
        lon
        (points)
        float64
        205.6 204.4 203.1 ... 203.1 205.6
        time
        (points)
        datetime64[ns]
        2009-10-27T17:30:00 ... 2012-10-...
        pfull_sub03
        (pfull_sub03)
        float64
        229.3 277.4 333.2 ... 992.8 997.9
        phalf
        (phalf)
        float64
        0.01 0.02697 ... 995.9 1e+03
        points
        (points)
        int64
        0 1 2 3 4 ... 999 1000 1001 1002
    Data variables:
        H2
        (points, pfull_sub03)
        float32
        dask.array<chunksize=(1003, 21), meta=np.ndarray>
        CO
        (points, pfull_sub03)
        float32
        dask.array<chunksize=(1003, 21), meta=np.ndarray>
        ps
        (points)
        float32
        dask.array<chunksize=(1003,), meta=np.ndarray>
        temp
        (points, pfull_sub03)
        float32
        dask.array<chunksize=(1003, 21), meta=np.ndarray>
        sphum
        (points, pfull_sub03)
        float32
        dask.array<chunksize=(1003, 21), meta=np.ndarray>
        ucomp
        (points, pfull_sub03)
        float32
        dask.array<chunksize=(1003, 21), meta=np.ndarray>
        vcomp
        (points, pfull_sub03)
        float32
        dask.array<chunksize=(1003, 21), meta=np.ndarray>
        u_ref
        (points)
        float32
        dask.array<chunksize=(1003,), meta=np.ndarray>
        v_ref
        (points)
        float32
        dask.array<chunksize=(1003,), meta=np.ndarray>
        zsurf
        (points)
        float32
        746.8 263.4 194.8 ... 240.2 262.1
        pk
        (phalf)
        float32
        1.0 2.697 5.171 ... 225.0 10.0 0.0
        bk
        (phalf)
        float32
        0.0 0.0 0.0 ... 0.9874 0.9958 1.0

I further select the altitude range using

OF = obsf.where(np.logical_and(obsf[‘alt_bottom’]<=target_alt,obsf[‘alt_top’]>target_alt)).mean(dim=‘pfull’)

This yields

OF = 
xarray.Dataset

    Dimensions:
        points: 1003
        phalf: 22
    Coordinates: 
    lat
    (points)
    float64
    61.5 63.5 65.5 ... 62.5 60.5 59.5
    lon
    (points)
    float64
    205.6 204.4 203.1 ... 203.1 205.6
    time
    (points)
    datetime64[ns]
    2009-10-27T17:30:00 ... 2012-10-...
    phalf
    (phalf)
    float64
    207.8 252.2 304.1 ... 995.9 1e+03
    points
    (points)
    int64
    0 1 2 3 4 ... 999 1000 1001 1002

Data variables:

    H2
    (points)
    float32
    dask.array<chunksize=(1003,), meta=np.ndarray>
    CO
    (points)
    float32
    dask.array<chunksize=(1003,), meta=np.ndarray>
    ps
    (points)
    float32
    dask.array<chunksize=(1003,), meta=np.ndarray>
    temp
    (points)
    float32
    dask.array<chunksize=(1003,), meta=np.ndarray>
    sphum
    (points)
    float32
    dask.array<chunksize=(1003,), meta=np.ndarray>
    ucomp
    (points)
    float32
    dask.array<chunksize=(1003,), meta=np.ndarray>
    vcomp
    (points)
    float32
    dask.array<chunksize=(1003,), meta=np.ndarray>
    u_ref
    (points)
    float32
    dask.array<chunksize=(1003,), meta=np.ndarray>
    v_ref
    (points)
    float32
    dask.array<chunksize=(1003,), meta=np.ndarray>
    zsurf
    (points)
    float32
    dask.array<chunksize=(1003,), meta=np.ndarray>
    pk
    (phalf, points)
    float32
    dask.array<chunksize=(22, 1003), meta=np.ndarray>
    bk
    (phalf, points)
    float32
    dask.array<chunksize=(22, 1003), meta=np.ndarray>
    alt_center
    (points)
    float32
    dask.array<chunksize=(1003,), meta=np.ndarray>
    alt_edge
    (points, phalf)
    float32
    dask.array<chunksize=(1003, 21), meta=np.ndarray>
    alt_top
    (points)
    float32
    dask.array<chunksize=(1003,), meta=np.ndarray>
    alt_bottom
    (points)
    float32
    dask.array<chunksize=(1003,), meta=np.ndarray>

Attributes: (0)

But OF.compute() never completes.
I tried to first sample alongside the time dimension (200 unique timesteps) but it’s also extremely slow.
I think I am probably doing something wrong with dask. I am running a local cluster

client = Client(n_workers=4, threads_per_worker=1, memory_limit='50GB',local_directory=os.environ['TMPDIR'])

Thanks a lot for your help

Hi @sombrero2003. Do you have a full example, showing how you’re claling open_mfdataset? In particular, showing where the files are stored (in blob storage or locally?)

If it’s in blob storage you’re probably running into Pangeo and Data — Pangeo documentation.

The files are stored locally on fast scratch storage

f1 = xr.open_mfdataset('/vftmp/f1p/*.ASK_aircraft.nc')

I think this performs poorly due to the following issue.

Once you use open_mfsdataset, you are required to use Dask. The fact that the arrays are chunked means that you are basically loading the whole dataset into memory even to select just a few points.

To verify this, try just using one single file and using open_dataset without any dask or chunking. See if you can select data quickly from a single file. If the answer is no, then you’re looking at a fundamental I/O limitation of your filesystem related to how fast it can randomly access locations on disk. If the answer is yes, then the issue is likely the one linked above.

1 Like

Thank you very much.
If I load a single file, the selection is indeed very quick so the issue seems to be associated to the post you referred to.
In my case, would you suggest sampling each model file independently in a for loop?
Thanks

That is probably the best workaround available today.

Or you could roll up your sleeves and try to fix the upstream issue in Xarray! :wink: That’s how many of us here got into open source development.