Cloud is particularly useful if you have a lot of people who want to access the same BIG datasets and do distributed processing on them. The direct cost comparison of a hard drive on your desk vs S3 is really apples-to-oranges. A fast SSD works great if you are the only person who needs to access the data and you are accessing it from a computer that is plugged directly into the hard drive. For feeding data to ML training on a single machine, I’m sure there there is nothing faster than downloading the data onto a fast SSD.
But if you have a team of 100 people across the world who need to access the data simultaneously, you need 100 hard drives and 100 copies of the data. That’s where S3 starts to look more attractive.
Cloud storage goes over the network, so its single-machine throughput is usually limited by network bandwidth. 700 MB/s is a decently fast network speed. Maybe you could tweak something to do better, but not by an order of magnitude. The performance benefits of cloud storage are only evident when you move into distributed mode. We gave some results about this in the paper linked below.
This figure shows the throughput from GCS (and other storage options) as a function of the number of distributed Dask workers. With modest levels of parallelism (~20 workers), we can easily get to 5 GB/s throughput, comparable to the fastest SSDs. That’s because the distributed nature of the I/O overcomes the network bottleneck for a single machine.
In Pangeo, we tend to be focused on the case where there is a big dataset that is shared by lots of people (CMIP6 is the prime example). But this certainly isn’t the scenario for every data science team.