I am interested in using a particle tracking software (OpenDrift) into a pangeo cloud environment.
There are quite a few stages before this could be optimised. At the moment I am trying to manage the output of the software into a xarray.to_zarr
approach. Basically, the software is using an append netcdf approach to write bits of buffer into a file, thus limiting the quantity of memory used.
I used this reference which, in my opinion, is a bit incomplete here there is a similar question here in the forum.
I want to reproduce these examples by initialising a dataset, make the variables empty dask.arrays
, activate the store with compute=False, make the variable dask arrays, write bits of buffer in regions by time slices and release it after.
This method is interesting once you actually release the dask.arrays
after you wrote into the zarr. But I have no way of making sure that this is the case.
There are lots of documents on how to read and manipulate data with xarray/zarr but I would really value a tutorial on the best practice to build a dataset so that it follows the philosophy of Pangeo cloud environment.
Here an idea of how the thing is coded at the moment:
times= np.arange(start_time,end_time, timedelta) #a np.datetime64 array of my experiment
slice = n_timesteps
prop1 = dask.array.zeros( # empty dask array
len(times),
dtype= np.float,
chunks={'time':slice}),
prop2 = dask.array.zeros( # empty dask array
len(times),
dtype= np.float,
chunks={'time':slice}),
etc
xr_dataset = xr.Dataset({"prop": 'time', prop},
coords={'time': ('time', times), { # some CF compliant
'units':'seconds since 1970-01-01 00:00:00',
'standard_name': 'time',
'long_name': 'time',
}
}
)
for prop in props: #names of arrays of variables
xr_dataset[prop] = prop1,2,...
write_buffer(slice):
for prop in props:
xr_dataset[prop].isel(time=slice) =small_arrays[prop]
xr_dataset.isel(time=slice).to_zarr(path, region={'time':slice})
def loop_of_iterations_going_through_time(time_slice):
compute stuff and make small_sized_arrays_of_variables
write_buffer(slice, small_arrays)
loop_of_iterations_going_through_time(time_slice)
xr_dataset.close()
How can I insure that I can release the part of xr_dataset written to zarr at the end of buffer?