Extremly slow write to S3 bucket with xarray.Dataset.to_zarr

Hi Masha and Ryan, hi everyone,

I’ve been following Pangeo discussions regularly over the past two years. Several times I wanted to comment and maybe I should have.The first time was on the famous rechunking issue. What a thread! Today, this thread is the one for my coming out.

First of all, I totally agree with Ryan answer #1: writing 12;8 GB of data should take a few seconds not one hour. I saw Masha’s problem as a nice homework. It took me 30’ to set up a script, optimize it a little bit, to finally reach the speed of 10.3s to write the 12.8 GB (5,000 x 800 x 800 float32). Reading 100 time-series (only 5,000 instants each) take 5.16 s. 20,000 snapshots will simply be four times longer, viz ~ 21 s. On the reading I’m achieving your speed. On the writing I’m much faster. Even faster than Ryan who announced 48s, but only for 1,000 snapshots.

Okay, what tools am I using on what architecture ? Well, these are my tools, not Pangeo. They are in pure Python, there are still in a prototype stage, but I believe they may be of interest for some of the community. There are adapted for the Lustre architecture. I know Lustre is not the kind of storage that receives the favours of Pangeo but still, a lot of clusters do have a Lustre filesystems attached. I designed these tools to achieve fast reading access to large datasets. When I mean large, I mean large. Our experiment is … 1.6 PB. Yes, you read well, petabytes. The policy on Lustre is to have few large files rather than lots of small ones. So zarr, if I understand correctly, is not really an option. Well, I should probably continue in another thread.

To summarize: 10.3 s to write 12.8 GB of data, 5.3 s to read 1,000 time series of 5,000 points each. I’d be glad to see someone going faster ! this would worth a beer.

Cheers,

Guillaume

1 Like