Very slow selection of multiple points in large dataset using xarray

sombrero2003 · July 14, 2022, 7:44pm

Hello,
I am trying to sample hourly model output to match the time and location of observations collected during a flight.
There are only 1000 points and it seems the “lazy” selection works fine. However, I have never managed to actually complete the calculation.

the dataset is created using open_mfdataset (6 files ~6.8Gb per file)

xarray.Dataset

    Dimensions:
        lat: 180lon: 288time: 52608pfull_sub03: 21phalf: 50
    Coordinates:
        lat
        (lat)
        float64
        -89.5 -88.5 -87.5 ... 88.5 89.5
        lon
        (lon)
        float64
        0.625 1.875 3.125 ... 358.1 359.4
        time
        (time)
        datetime64[ns]
        2007-01-01T00:30:00 ... 2012-12-...
        pfull_sub03
        (pfull_sub03)
        float64
        229.3 277.4 333.2 ... 992.8 997.9
        phalf
        (phalf)
        float64
        0.01 0.02697 ... 995.9 1e+03
    Data variables:
        H2
        (time, pfull_sub03, lat, lon)
        float32
        dask.array<chunksize=(8760, 21, 1, 2), meta=np.ndarray>
        CO
        (time, pfull_sub03, lat, lon)
        float32
        dask.array<chunksize=(8760, 21, 1, 2), meta=np.ndarray>
        ps
        (time, lat, lon)
        float32
        dask.array<chunksize=(8760, 3, 23), meta=np.ndarray>
        temp
        (time, pfull_sub03, lat, lon)
        float32
        dask.array<chunksize=(8760, 21, 1, 2), meta=np.ndarray>
        sphum
        (time, pfull_sub03, lat, lon)
        float32
        dask.array<chunksize=(8760, 21, 1, 2), meta=np.ndarray>
        ucomp
        (time, pfull_sub03, lat, lon)
        float32
        dask.array<chunksize=(8760, 21, 1, 2), meta=np.ndarray>
        vcomp
        (time, pfull_sub03, lat, lon)
        float32
        dask.array<chunksize=(8760, 21, 1, 2), meta=np.ndarray>
        u_ref
        (time, lat, lon)
        float32
        dask.array<chunksize=(8760, 3, 23), meta=np.ndarray>
        v_ref
        (time, lat, lon)
        float32
        dask.array<chunksize=(8760, 3, 23), meta=np.ndarray>
        zsurf
        (lat, lon)
        float32
        ...
        pk
        (phalf)
        float32
        ...
        bk
        (phalf)
        float32
        ...

I then select the observations using

obs_select = f1.sel(lat=target_lat,lon=target_lon,time=target_time,method='nearest')

where target_lat looks like

xarray.DataArray
'lat'

    points: 1003

    array([61.854 , 63.436 , 65.8994, ..., 62.26  , 60.96  , 59.35  ])

    Coordinates:
        points
        (points)
        int64
        0 1 2 3 4 ... 999 1000 1001 1002
    Attributes: (0)

This gives me

obs_select

xarray.Dataset

    Dimensions:
        points: 1003pfull_sub03: 21phalf: 50
    Coordinates:
        lat
        (points)
        float64
        61.5 63.5 65.5 ... 62.5 60.5 59.5
        lon
        (points)
        float64
        205.6 204.4 203.1 ... 203.1 205.6
        time
        (points)
        datetime64[ns]
        2009-10-27T17:30:00 ... 2012-10-...
        pfull_sub03
        (pfull_sub03)
        float64
        229.3 277.4 333.2 ... 992.8 997.9
        phalf
        (phalf)
        float64
        0.01 0.02697 ... 995.9 1e+03
        points
        (points)
        int64
        0 1 2 3 4 ... 999 1000 1001 1002
    Data variables:
        H2
        (points, pfull_sub03)
        float32
        dask.array<chunksize=(1003, 21), meta=np.ndarray>
        CO
        (points, pfull_sub03)
        float32
        dask.array<chunksize=(1003, 21), meta=np.ndarray>
        ps
        (points)
        float32
        dask.array<chunksize=(1003,), meta=np.ndarray>
        temp
        (points, pfull_sub03)
        float32
        dask.array<chunksize=(1003, 21), meta=np.ndarray>
        sphum
        (points, pfull_sub03)
        float32
        dask.array<chunksize=(1003, 21), meta=np.ndarray>
        ucomp
        (points, pfull_sub03)
        float32
        dask.array<chunksize=(1003, 21), meta=np.ndarray>
        vcomp
        (points, pfull_sub03)
        float32
        dask.array<chunksize=(1003, 21), meta=np.ndarray>
        u_ref
        (points)
        float32
        dask.array<chunksize=(1003,), meta=np.ndarray>
        v_ref
        (points)
        float32
        dask.array<chunksize=(1003,), meta=np.ndarray>
        zsurf
        (points)
        float32
        746.8 263.4 194.8 ... 240.2 262.1
        pk
        (phalf)
        float32
        1.0 2.697 5.171 ... 225.0 10.0 0.0
        bk
        (phalf)
        float32
        0.0 0.0 0.0 ... 0.9874 0.9958 1.0

I further select the altitude range using

OF = obsf.where(np.logical_and(obsf[‘alt_bottom’]<=target_alt,obsf[‘alt_top’]>target_alt)).mean(dim=‘pfull’)

This yields

OF = 
xarray.Dataset

    Dimensions:
        points: 1003
        phalf: 22
    Coordinates: 
    lat
    (points)
    float64
    61.5 63.5 65.5 ... 62.5 60.5 59.5
    lon
    (points)
    float64
    205.6 204.4 203.1 ... 203.1 205.6
    time
    (points)
    datetime64[ns]
    2009-10-27T17:30:00 ... 2012-10-...
    phalf
    (phalf)
    float64
    207.8 252.2 304.1 ... 995.9 1e+03
    points
    (points)
    int64
    0 1 2 3 4 ... 999 1000 1001 1002

Data variables:

    H2
    (points)
    float32
    dask.array<chunksize=(1003,), meta=np.ndarray>
    CO
    (points)
    float32
    dask.array<chunksize=(1003,), meta=np.ndarray>
    ps
    (points)
    float32
    dask.array<chunksize=(1003,), meta=np.ndarray>
    temp
    (points)
    float32
    dask.array<chunksize=(1003,), meta=np.ndarray>
    sphum
    (points)
    float32
    dask.array<chunksize=(1003,), meta=np.ndarray>
    ucomp
    (points)
    float32
    dask.array<chunksize=(1003,), meta=np.ndarray>
    vcomp
    (points)
    float32
    dask.array<chunksize=(1003,), meta=np.ndarray>
    u_ref
    (points)
    float32
    dask.array<chunksize=(1003,), meta=np.ndarray>
    v_ref
    (points)
    float32
    dask.array<chunksize=(1003,), meta=np.ndarray>
    zsurf
    (points)
    float32
    dask.array<chunksize=(1003,), meta=np.ndarray>
    pk
    (phalf, points)
    float32
    dask.array<chunksize=(22, 1003), meta=np.ndarray>
    bk
    (phalf, points)
    float32
    dask.array<chunksize=(22, 1003), meta=np.ndarray>
    alt_center
    (points)
    float32
    dask.array<chunksize=(1003,), meta=np.ndarray>
    alt_edge
    (points, phalf)
    float32
    dask.array<chunksize=(1003, 21), meta=np.ndarray>
    alt_top
    (points)
    float32
    dask.array<chunksize=(1003,), meta=np.ndarray>
    alt_bottom
    (points)
    float32
    dask.array<chunksize=(1003,), meta=np.ndarray>

Attributes: (0)

But OF.compute() never completes.
I tried to first sample alongside the time dimension (200 unique timesteps) but it’s also extremely slow.
I think I am probably doing something wrong with dask. I am running a local cluster

client = Client(n_workers=4, threads_per_worker=1, memory_limit='50GB',local_directory=os.environ['TMPDIR'])

Thanks a lot for your help

TomAugspurger · July 14, 2022, 8:08pm

Hi @sombrero2003. Do you have a full example, showing how you’re claling open_mfdataset? In particular, showing where the files are stored (in blob storage or locally?)

If it’s in blob storage you’re probably running into Pangeo and Data — Pangeo documentation.

sombrero2003 · July 14, 2022, 8:09pm

The files are stored locally on fast scratch storage

f1 = xr.open_mfdataset('/vftmp/f1p/*.ASK_aircraft.nc')

rabernat · July 16, 2022, 3:32pm

I think this performs poorly due to the following issue.

github.com/pydata/xarray

Lazy concatenation of arrays

opened 10:32PM - 30 Nov 20 UTC

nbren12

topic-combine topic-lazy array

**Is your feature request related to a problem? Please describe.** Concatenatin…g xarray objects forces the data to load. I recently learned about this [object](https://github.com/pydata/xarray/blob/235b2e5bcec253ca6a85762323121d28c3b06038/xarray/core/indexing.py#L592) allowing lazy indexing into an DataArrays/sets without using dask. Concatenation along a single dimension is the inverse operation of slicing, so it seems natural to also support it. Also, concatenating along dimensions (e.g. "run"/"simulation"/"ensemble") can be a common merging workflow. **Describe the solution you'd like** `xr.concat([a, b], dim=...)` does not load any data in a or b. **Describe alternatives you've considered** One could rename the variables in a and b to allow them to be merged (e.g. `a['air_temperature'] -> "air_temperature_a"`), but it's more natural to make a new dimension. **Additional context** This is useful when not using dask for performance reasons (e.g. using another parallelism engine like Apache Beam).

Once you use open_mfsdataset, you are required to use Dask. The fact that the arrays are chunked means that you are basically loading the whole dataset into memory even to select just a few points.

To verify this, try just using one single file and using open_dataset without any dask or chunking. See if you can select data quickly from a single file. If the answer is no, then you’re looking at a fundamental I/O limitation of your filesystem related to how fast it can randomly access locations on disk. If the answer is yes, then the issue is likely the one linked above.

sombrero2003 · July 19, 2022, 1:53pm

Thank you very much.
If I load a single file, the selection is indeed very quick so the issue seems to be associated to the post you referred to.
In my case, would you suggest sampling each model file independently in a for loop?
Thanks

rabernat · July 19, 2022, 2:07pm

That is probably the best workaround available today.

Or you could roll up your sleeves and try to fix the upstream issue in Xarray! That’s how many of us here got into open source development.

Topic		Replies	Views
Struggling with large dataset loading/reading using xarray Science	39	15345	February 16, 2023
Any suggestions for efficiently operating over windows of data? Data	4	1197	February 2, 2023
Strange issue with .compute() of Dask array values Science	15	919	October 17, 2020
Xarray unable to allocate memory, How to "size up" problem Data location-uw	9	3058	July 27, 2023
How to be more efficient with subsetting data from multiple netcdf files? Data	4	1278	June 9, 2022

Very slow selection of multiple points in large dataset using xarray

Related topics