MemoryError when trying to save a dataset to a NetCDF file

Sakina-A · September 18, 2024, 7:08am

Hello,

Am working with MOD16A2GF data (Evapotranspiration) as an .nc file with a 8-day temporal resolution, and 500 m spatial resolution as the below pic.

Then I converted the time from object to datetime64[ns] and replaced _FillValue in the dataset using the below code:

# Convert the 'time' coordinate from cftime.DatetimeJulian to datetime64[ns]
time_values = mod16netcdf['time'].values
converted_time = pd.to_datetime([t.strftime('%Y-%m-%d') for t in time_values])

# Replace the 'time' coordinate in the dataset
mod16netcdf ['time'] = converted_time

# Verify the conversion
mod16netcdf 

# Replace values between 32761 and 32767 (inclusive) in 'ET_500m' and 'PET_500m' with NaN
mod16netcdf['ET_500m'] = mod16netcdf['ET_500m'].where(~((mod16netcdf['ET_500m'] >= 3276.1) & (mod16netcdf['ET_500m'] <= 3276.7)), np.nan)
mod16netcdf['PET_500m'] = mod16netcdf['PET_500m'].where(~((mod16netcdf['PET_500m'] >= 3276.1) & (mod16netcdf['PET_500m'] <= 3276.7)), np.nan)

After that I calculated the real data by multiplying the scale factor and then converted it into monthly data using the below code:

#Claculate the real_value by multiply by 0.1 (scale factor) 
ET_scaled= mod16netcdf['ET_500m']*0.1
PET_scaled= mod16netcdf['PET_500m']*0.1

# Convert 8-day ET_scaled and PET_scaled data to monthly data
# For monthy mean the unit is mm/8day (the average 8-day mean rate of evapotranspiration over all 8-day intervals in a month)
ET_monthly_mean_scaled = ET_scaled.resample(time='ME').mean()
PET_monthly_mean_scaled = PET_scaled.resample(time='ME').mean()
ET_Q_monthly_mean_scaled = mod16netcdf['ET_QC_500m'].resample(time='ME').mean()

# For monthy sum the unit is mm/month (the sum of 8-day rate of evapotranspiration over a month)
ET_monthly_sum_scaled = ET_scaled.resample(time='ME').sum()
PET_monthly_sum_scaled = PET_scaled.resample(time='ME').sum()
ET_Q_monthly_sum_scaled = mod16netcdf['ET_QC_500m'].resample(time='ME').sum()

# Create a new Dataset to hold all the variables
ds = xr.Dataset({
    'ET_monthly_mean_scaled_mmper8day': ET_monthly_mean_scaled,
    'PET_monthly_mean_scaled_mmper8day': PET_monthly_mean_scaled,
    'ET_monthly_sum_scaled_mmpermonth': ET_monthly_sum_scaled,
    'PET_monthly_sum_scaled_mmpermonth': PET_monthly_sum_scaled,
    'ET_Q_monthly_mean_scaled': ET_Q_monthly_mean_scaled,
    'ET_Q_monthly_sum_scaled': ET_Q_monthly_sum_scaled 

})

ds

Now as a final step I created a function to upscale the modified data resolution to 0.05 °, 0.25 °, and 0.05 ° to then save it into a new .nc file for each resolution using the below code:

def upscale_dataset(dataset, target_res):
    """
    Upscale all variables in the dataset to a coarser resolution.
    
    Parameters:
    dataset (xarray.Dataset): The input dataset with variables to upscale.
    target_res (float): The target spatial resolution (e.g., 0.25 or 1.0 degrees).
    
    Returns:
    xarray.Dataset: The upscaled dataset with all variables at the new resolution.
    """
    # Get the current resolution and coordinate steps
    lat_res = np.abs(dataset.lat[1] - dataset.lat[0])
    lon_res = np.abs(dataset.lon[1] - dataset.lon[0])
    
    # Calculate the number of original grid cells per target resolution
    lat_factor = int(target_res / lat_res)
    lon_factor = int(target_res / lon_res)
    
    # Dictionary to hold the upscaled variables
    upscaled_vars = {}
    
    # Loop over all variables in the dataset
    for var_name in dataset.data_vars:
        # Apply coarsen and mean to each variable
        upscaled_vars[var_name] = (
            dataset[var_name]
            .coarsen(lat=lat_factor, lon=lon_factor, boundary="trim")
            .mean()
        )
    
    # Create a new dataset with upscaled variables
    dataset_upscaled = xr.Dataset(
        upscaled_vars,
        coords={
            'lat': upscaled_vars[var_name].lat,
            'lon': upscaled_vars[var_name].lon,
            'time': dataset.time
        }
    )
    
    return dataset_upscaled

# Upscale to 0.05-degree resolution
evtp_ds_005 = upscale_dataset(ds, 0.05)

evtp_ds_005

But when I try to save evtp_ds_005 into an .nc file I get an error saying:
MemoryError: Unable to allocate 40.1 GiB for an array with shape (1059, 3912, 2596) and data type float32

I tried to subset the data but also got the same error even when using one subset:

subset_1 = evtp_ds_005.isel(time=slice(0, 100))
subset_1.to_netcdf('EVTP_upscaled_0.05deg_subset_1.nc')

I also tried chunks but the error remained.

Sakina-A · September 18, 2024, 7:12am

The below pic shows the dataset I want to save as .nc

rabernat · September 18, 2024, 1:07pm

Could you provide a link to the file?

Sakina-A · September 19, 2024, 4:20am

Sure, the file is around 13 GB:

dcherian · September 26, 2024, 5:30pm

You’ll need to specify chunks at read time: e.g. {"time": 1}.

Also see xarray-regrid: Regridding utilities for xarray — xarray-regrid 0.4.0 documentation for an easy way to do the coarsening

Sakina-A · September 27, 2024, 9:06am

Hey @dcherian
Using chunks at read time worked perfectly! Thank you so much.

Topic		Replies	Views
Scale_factor and add_offset xarray.to_netcdf Science	2	1860	November 29, 2021
Chunk size for reading writing netcdfs Data	0	553	September 28, 2021
Hitting memory limit converting CMIP6 to numpy array Pangeo Cloud Support	2	1372	August 21, 2020
Memory requirements tor converting a netcdf multifile dataset to zarr Data	3	844	May 18, 2022
Unclear behavior of NetCDF4 files loaded with intake-xarray Data	6	694	June 17, 2022

MemoryError when trying to save a dataset to a NetCDF file

Related topics