Writing large datasets to tif files - best practice?

apurba-biswas · August 1, 2022, 5:08pm

Hi there,

I’m new to Pangeo and Dask activities - so any feedback/tips are welcome! I am looking for Best Practice for Writing Large datasets to tif @TomAugspurger

I’m trying to save a large xarray dataset (multiple masks of a 15GB raster) to a .tif file, using rioxarray & rasterio, and am currently forced to separate out bands to different tif files. It currently takes a long time (~10 minutes per band)

I’ve read around a little recommending Zarr as a big data storage format, however - some of the downstream operations (e.g. reprojecting from disk) rely on the .tif format (see rioxarray’s WarpedVRT)

I’m working on Amazon Sagemaker with my own deployment of a Docker container. When doing this operation, I’m on a ml.m5.4xlarge (16vCPUs, 64GiB). I have access to clusters if need be (via Coiled), but am not sure how to leverage Dask to utilise the cluster well.

Am happy to spill more details but wanted to gauge interest

Topic		Replies	Views
Synchronizer for Zarr + Dask on Kubernetes Data	10	1845	January 16, 2024
Problem plotting large memory dataset Visualization	7	140	April 28, 2025
Blog post: Processing a 250 TB dataset with Xarray, Dask, and Coiled Cloud	0	454	September 5, 2023
Advice on writing many slices from one remote zarr xarray to another Data	4	561	January 15, 2022
Parallel COG / TIFF Storage Data	0	465	August 14, 2023

Writing large datasets to tif files - best practice?

Related topics