Optimal object size in the cloud

It seems that a consensus has formed that ~100mb (+/- something big) is the optimal size for objects, or reads, from object stores in the cloud… I am wondering what data there is that supports that consensus…???

Thanks in advance…

@tedhabermann This is a great question that comes up fairly often (I was just asked this by a USGS colleague and a Google search landed me here!)!

I found several suggestions of optimal chunk sizes:

But like you, I couldn’t find actual data to back up these suggestions! I vaguely remember someone presenting data on this though. @rabernat? @dcherian? @andersy005 ?

1 Like

I think this largely depends on the size of the overall dataset that is composed of the objects, or in the end just the number of objects this will lead to.

As @rabernat says in the 3rd link:

However, the tradeoff is that a huge number of chunks translates to a huge number of files / objects to manage. For a 1 TB in 4 MB chunks, there are 250_000 chunks to keep track of! This creates a big overhead on the object store if you ever want to delete / move / rename the data.

This is true for object stores or file systems, and then for distributed frameworks such as Dask or Spark.

In the link from Amazon, they talk about

concurrent requests for byte ranges of an object at the granularity of 8–16 MB

which doesn’t mean than an entire object should be that small. But it does tell us that this is probably a good minimum size for an object. And that ideally we should be able to read an object by chunks of 8 or 16 MB.

So I’d say that those recommendations are converging:

  • Start with an object size at 8-10MB minimum. This may be OK for datasets up to 1 TB.
  • Increase it if your Dataset is really big, up to something like 1 or 2GB per object. Or more if you’ve got a lot of RAM for your Dask workers or equivalent and a really really big dataset.

We wrote some rules of thumb for Dask chunks here:
https://blog.dask.org/2021/11/02/choosing-dask-chunk-sizes

This is probably mainly true for objects in the cloud too.

3 Likes

Thanks @geynard for chiming in – I had not seen that Dask blog post! Nice!

1 Like