Hi all !
thank you for making this knowledge source available, very useful. I’ve heard about pangeo since I’m a user of CNES facilities.
Shortly said, when I try to fill a single zarr store from multiple netcdf files using concurrent.futures, it fails to write the store correctly, ending up in an error about different dimension size when I try to load the zarr store.
I read some related topics here and there, including in this forum, yet not succeeded in apply the solution to my specific problem. My apologies if this question is already answered.
Below, I’ll give you the generic context then explain my issue.
Here is the context of my question:
S3 altimetry files are distributed as netcdf files, one file per orbit.
The structure of a netcdf file is quite simple: some one-dimensional variables with a time dimension.
The dimension of the time varies from one file to another.
My whole environment relies on xarray so I would like to remaing using xarray for I/O.
Each file is easily read with xarray.open_dataset
<xarray.Dataset>
Dimensions: (time: 2553)
Coordinates:
* time (time) datetime64[ns] 2017...
lat (time) float64 ...
lon (time) float64 ...
Data variables: (12/53)
surf_type_01 (time) float32 ...
dist_coast_01 (time) float64 ...
range_ocean_rms_01_ku (time) float32 ...
Loading a large bunch of individual netcdf files with open_mfdataset is sometimes very long so I’m trying to improve the performances using zarr
To do so, I’m trying to write each individual netcdf files in a single zarr store using concurent.futures.
I found somewhere
- how to initialize the store with a compute=false
- how to set the append mode and declare the dimension to expand
- that it could be necessary to use a Zarr synchronizer
ZARR_SYNC=zarr.ProcessSynchronizer('zarr/sync_zarr.sync')
def load_netcdf_write_zarr_1store(fname):
ds = xr.open_dataset(os.path.join(datadir,fname))
fname = 'zarr_store_from_netcdf.zarr'
fname = os.path.join(zarrdir,fname)
ds.to_zarr(fname,mode='a',append_dim='time',synchronizer=ZARR_SYNC)
return 1
flist = glob.glob(os.path.join(datadir,'S3A_SR_2_WAT____*_*_*_*_01[0]_???_*'))
flist.sort()
print('#',len(flist))
# The values of this dask array are entirely irrelevant; only the dtype,
# shape and chunks are used
fname = flist[0]
ds = xr.open_dataset(os.path.join(datadir,fname))
# Now we write the metadata without computing any array values
fname = 'zarr_store_from_netcdf.zarr'
fname = os.path.join(zarrdir,fname)
ds.to_zarr(fname,compute=False,synchronizer=ZARR_SYNC)
# parallel
with_threads_start = time.time()
with concurrent.futures.ProcessPoolExecutor() as executor:
out = executor.map(load_netcdf_write_zarr_1store,flist)
print("With threads time:",time.time() - with_threads_start)
everything seems fine when this code is executed over about 1000 netcdf files.
but when I try to load the file using xarray.open_zarr, it seems that the zarr store is not properly filled
ValueError: conflicting sizes for dimension 'time': length 31616 on 'wtc' and length 39802 on {'time': 'att_ku'}
So here it is. This is probably a common situation but I could not find a way to fix this.
I’ve seen some topics related to ‘region’ option but since the time dimension varies from one file to another + I use concurrent writing, I can’t see how it would apply here.
Any suggestions is more than welcomed, thank you !