I’m following the PangeoForge recipe tutorial Xarray-to-Zarr Sequential Recipe: NOAA OISST to create a recipe for CEDA monthly daytime land surface temperature data, but I’m running into issues using pangeo-forge-recipes version 0.10.1 (obtained via conda-forge).
Here’s my code in recipe.py
(I’m using python 3.11):
import os
from tempfile import TemporaryDirectory
import apache_beam as beam
import pandas as pd
import xarray as xr
from pangeo_forge_recipes.patterns import pattern_from_file_sequence
from pangeo_forge_recipes.transforms import (
OpenURLWithFSSpec,
OpenWithXarray,
StoreToZarr,
)
url_pattern = (
"https://dap.ceda.ac.uk/neodc/esacci/land_surface_temperature/data/"
"MULTISENSOR_IRCDR/L3S/0.01/v2.00/monthly/{time:%Y}/{time:%m}/"
"ESACCI-LST-L3S-LST-IRCDR_-0.01deg_1MONTHLY_DAY-{time:%Y%m}01000000-fv2.00.nc"
)
months = pd.date_range("1995-08", "2020-12", freq=pd.offsets.MonthBegin())
urls = tuple(url_pattern.format(time=month) for month in months)
pattern = pattern_from_file_sequence(urls, "time", nitems_per_file=1).prune(1)
temp_dir = TemporaryDirectory()
target_root = temp_dir.name
store_name = "output.zarr"
target_store = os.path.join(target_root, store_name)
transforms = (
beam.Create(pattern.items())
| OpenURLWithFSSpec()
| OpenWithXarray(file_type=pattern.file_type)
| StoreToZarr(
target_root=target_root,
store_name=store_name,
combine_dims=pattern.combine_dim_keys,
target_chunks={"time": 1, "lat": 5, "lon": 5},
)
)
print(f"{pattern=}")
print(f"{target_store=}")
print(f"{transforms=}")
with beam.Pipeline() as p:
p | transforms # type: ignore[reportUnusedExpression]
with xr.open_zarr(target_store) as ds:
print(ds)
NOTE: In my attempt to reduce the likelihood of a memory problem, my pattern
is pruned to a single element. Unfortunately, this didn’t help.
When I run this, it is eventually killed because it consumes an obscene amount of memory. I saw the python process exceed 40G of memory (on my 16G machine), but it may very well have gone beyond that while I wasn’t watching it – it ran for about 3.5 hours!:
$ time python recipe.py
...
.../python3.11/site-packages/xarray/core/dataset.py:2461: SerializationWarning: saving variable None with floating point data as an integer dtype without any _FillValue to use for NaNs
return to_zarr( # type: ignore[call-overload,misc]
Killed: 9
real 216m31.108s
user 76m14.794s
sys 90m21.965s
.../python3.11/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
I’m going to downgrade pangeo-forge-recipes to a version prior to the recently introduced breaking API changes to see if I encounter the same problem with the old API, but in the meantime, is there anything glaringly wrong with what I’ve written above that would cause the memory issue?