PangeoForge Xarray-to-Zarr recipe codec buffer overflow

I’m following the PangeoForge recipe tutorial Xarray-to-Zarr Sequential Recipe: NOAA OISST to create a recipe for CEDA monthly daytime land surface temperature data, but I’m running into issues using pangeo-forge-recipes version 0.10.1 (obtained via conda-forge).

Here’s my code in recipe.py (I’m using python 3.11):

import os
from tempfile import TemporaryDirectory

import apache_beam as beam
import pandas as pd
import xarray as xr
from pangeo_forge_recipes.patterns import pattern_from_file_sequence
from pangeo_forge_recipes.transforms import (
    OpenURLWithFSSpec,
    OpenWithXarray,
    StoreToZarr,
)

url_pattern = (
    "https://dap.ceda.ac.uk/neodc/esacci/land_surface_temperature/data/"
    "MULTISENSOR_IRCDR/L3S/0.01/v2.00/monthly/{time:%Y}/{time:%m}/"
    "ESACCI-LST-L3S-LST-IRCDR_-0.01deg_1MONTHLY_DAY-{time:%Y%m}01000000-fv2.00.nc"
)
months = pd.date_range("1995-08", "2020-12", freq=pd.offsets.MonthBegin())
urls = tuple(url_pattern.format(time=month) for month in months)
pattern = pattern_from_file_sequence(urls, "time", nitems_per_file=1).prune(1)

temp_dir = TemporaryDirectory()
target_root = temp_dir.name
store_name = "output.zarr"
target_store = os.path.join(target_root, store_name)

transforms = (
    beam.Create(pattern.items())
    | OpenURLWithFSSpec()
    | OpenWithXarray(file_type=pattern.file_type)
    | StoreToZarr(
        target_root=target_root,
        store_name=store_name,
        combine_dims=pattern.combine_dim_keys,
        target_chunks={"time": 1, "lat": 5, "lon": 5},
    )
)

print(f"{pattern=}")
print(f"{target_store=}")
print(f"{transforms=}")

with beam.Pipeline() as p:
    p | transforms  # type: ignore[reportUnusedExpression]

with xr.open_zarr(target_store) as ds:
    print(ds)

NOTE: In my attempt to reduce the likelihood of a memory problem, my pattern is pruned to a single element. Unfortunately, this didn’t help.

When I run this, it is eventually killed because it consumes an obscene amount of memory. I saw the python process exceed 40G of memory (on my 16G machine), but it may very well have gone beyond that while I wasn’t watching it – it ran for about 3.5 hours!:

$ time python recipe.py
...
.../python3.11/site-packages/xarray/core/dataset.py:2461: SerializationWarning: saving variable None with floating point data as an integer dtype without any _FillValue to use for NaNs
  return to_zarr(  # type: ignore[call-overload,misc]
Killed: 9

real    216m31.108s
user    76m14.794s
sys     90m21.965s
.../python3.11/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

I’m going to downgrade pangeo-forge-recipes to a version prior to the recently introduced breaking API changes to see if I encounter the same problem with the old API, but in the meantime, is there anything glaringly wrong with what I’ve written above that would cause the memory issue?

1 Like

Thanks for reporting this! Definitely not normal or expected behavior.

For Pangeo Forge Recipes support, you will likely get a much better response on the issue tracker, rather than this forum: Issues · pangeo-forge/pangeo-forge-recipes · GitHub

Thanks @rabernat. I wasn’t sure if I was just doing something wrong, so I didn’t want to open an issue against the repo in that case, but as you suggest, I’ll go that route.

Here’s the issue I created: Xarray-to-Zarr recipe runs out of memory · Issue #614 · pangeo-forge/pangeo-forge-recipes · GitHub