Hi there,
I’m new to Pangeo and Dask activities - so any feedback/tips are welcome! I am looking for Best Practice for Writing Large datasets to tif @TomAugspurger
I’m trying to save a large xarray dataset (multiple masks of a 15GB raster) to a .tif file, using rioxarray & rasterio, and am currently forced to separate out bands to different tif files. It currently takes a long time (~10 minutes per band)
I’ve read around a little recommending Zarr as a big data storage format, however - some of the downstream operations (e.g. reprojecting from disk) rely on the .tif format (see rioxarray’s WarpedVRT)
I’m working on Amazon Sagemaker with my own deployment of a Docker container. When doing this operation, I’m on a ml.m5.4xlarge (16vCPUs, 64GiB). I have access to clusters if need be (via Coiled), but am not sure how to leverage Dask to utilise the cluster well.
Am happy to spill more details but wanted to gauge interest