My name is Saverio, I work as a Data Engineer at Delft University of Technology, The Netherlands. This is my first post!
As I am working with data generated from various sensors (disdrometers, microwave radiometers, cloud radars, ecc.), I’m looking for some possibilities to optimally store these data. At the moment, I am simply storing it in a NAS (which is accessed through SFTP), but this obviously offers very limited possibilities for analysis…
I would like to ask the community some suggestions for a database, data lake, or filesystem for this kind of format.
Hi @saveriogzz, Welcome to the community. Have you considered storage on a commercial cloud bucket? Python interfaces built on fsspec (such as s3fs for Amazon) make it quite convenient to read and write from these endpoints.
but this obviously offers very limited possibilities for analysis…
As you mention analysis, one great advantage of cloud storage is the ability to spin up cloud compute resources adjacent to your data as needed, and then scale them down when the analysis is complete (i.e., “elastic scaling”).
Out of curiosity, what format(s) are your data stored in? This may also have implications for speed and ease of analysis on the cloud. Many in this community have had success with the cloud-optimized Zarr format, as described in this paper: Cloud-Native Repositories for Big Scientific Data - Authorea
Following up on the comments of @saveriogzz, if you have a lot of NetCDF4 data, and you want to make it available on the Cloud in a performant way, you need to chunk it appropriately (e.g. 10-150MB chunk size or so). You can use Rechunker to do this efficiently in Python. Rechunker writes Zarr, but if you want to distribute as NetCDF4 you can convert Zarr back to NetCDF4 using Xarray and create a ReferenceFileSystem JSON file to allow for even faster access via Python.
At the moment the data is stored as binary and later on converted into netcdf.
About your point: unfortunately, there is a bit of reluctancy in using cloud solutions; this is the main reason I was looking for (on-premise) solutions like databases, which would enhance some searching and archiving functionalities.
@saveriogzz, if the data is in NetCDF4, you can also access the files directly from an HTTP Server (as long as it supports byte-range requests) using the fsspec HTTPFileServer Class
Here’s an example:
Here I’m using the h5netcdf engine, which supports the file-like objects that fsspec produces, and which works for netcdf4 files. I’m not sure whether some other engine would also allow netcdf3 files to work using this approach.
Other options for non-Cloud serving of NetCDF data would include Unidata’s THREDDS Data Server and Xpublish.
@saveriogzz, I certainly don’t want to steer the thread off-topic (hoping others with on-premise solutions will chime in!), but when you get a moment, I am curious if you could expand on this a bit further
there is a bit of reluctancy in using cloud solutions
Are there a few bullet points you could share as to where this “reluctancy” stems from? What are the primary concerns? I don’t expect we’ll sway anyone on your team for this specific project, but in the bigger picture, this type of feedback is so valuable for those of us working on cloud storage. It’s important we’re aware of the specifics of this no doubt well-founded skepticism, so we can work to address its underlying causes going forward.
In my little non-profit, we’ve been using about 10 TB of Zarr data in Google Cloud, and been pretty pleased with the experience (and, so far, it’s been free because we’re using cloud credits). We’re training deep learning models to forecast solar electricity generation using satellite imagery and numerical weather predictions.
But we’re seriously thinking of moving our machine learning R&D work to our own hardware once we run out of free cloud compute credits. This is for several reasons:
Cost! The cost of storing 10 TB of data for a month or two in the cloud is equivalent to buying the hard drive outright! (Although, admittedly, this isn’t a fair comparison because it ignores utility bills, labour, etc.) And we can’t move the data to “cold storage” because we’re constantly training ML models on the full 10 TB.
Storage performance: At best, I’ve been able to get about 700 MB/s from a cloud storage bucket to a single VM, and that was after quite a lot of tinkering to do stuff in multiple threads per process, and multiple processes. A single NVMe PCIe4 SSD can read data tens times faster than that! And, when training ML models in the cloud, keeping the GPU fed with data from disk is often the bottleneck.
Compute performance: The best CPUs Google Cloud allows to be connected to GPUs are 24-core 6th gen Xeons running at 2 GHz. They were released in 2017, and are significantly less powerful than today’s CPUs.
Ease of development: I love fsspec and gcsfs. But they do have some quirks which simply don’t exist when you’re reading from a “real” POSIX filesystem.
That said, the cloud definitely has some advantages over on-prem, so we haven’t made up our minds yet! (no need to diagnose and replace that broken stick of RAM!)
We may be able to help you at Oracle for Research - we are looking for researchers to partner with and we have options for free storage of large environmental data sets. Let me know if we can help you and if you are interested in finding out more.
We can address most of your concerns - we have access to way faster virtual networking and attached NVMe block storage, much larger processors and GPU options, up to 100 core ampere CPUs, and other much more modern processors
Hi @cisaacstern , sorry the late answer!
The points are very well summarized by @jack_kelly in his post in here. First and foremost cost is being the reason. As he also pointed out, the apparent smaller cost of having hardware on-premise is just the tip of the iceberg…!
I would also add that one other point for being reluctant is that Cloud skills, despite becoming more and more popular and required, are not yet the bread and butter of every researcher/data manager/data engineer; i.e. if using cloud, there will be the need of someone knowing cloud.
I will keep you posted in the development and the decisions about the project.
Thanks for joining the conversation!!
Cloud is particularly useful if you have a lot of people who want to access the same BIG datasets and do distributed processing on them. The direct cost comparison of a hard drive on your desk vs S3 is really apples-to-oranges. A fast SSD works great if you are the only person who needs to access the data and you are accessing it from a computer that is plugged directly into the hard drive. For feeding data to ML training on a single machine, I’m sure there there is nothing faster than downloading the data onto a fast SSD.
But if you have a team of 100 people across the world who need to access the data simultaneously, you need 100 hard drives and 100 copies of the data. That’s where S3 starts to look more attractive.
Cloud storage goes over the network, so its single-machine throughput is usually limited by network bandwidth. 700 MB/s is a decently fast network speed. Maybe you could tweak something to do better, but not by an order of magnitude. The performance benefits of cloud storage are only evident when you move into distributed mode. We gave some results about this in the paper linked below.
This figure shows the throughput from GCS (and other storage options) as a function of the number of distributed Dask workers. With modest levels of parallelism (~20 workers), we can easily get to 5 GB/s throughput, comparable to the fastest SSDs. That’s because the distributed nature of the I/O overcomes the network bottleneck for a single machine.
In Pangeo, we tend to be focused on the case where there is a big dataset that is shared by lots of people (CMIP6 is the prime example). But this certainly isn’t the scenario for every data science team.
Also worth considering the scenario where you rent the compute node and attach a fast SSD in the cloud for your model training. Even if you conclude that the “single machine with a fast SSD” is right for your use case, it still might not make sense to purchase your own hardware, particularly given the speed of innovation around GPUs / TPUs etc.
Backup costs are of course yours when you are local, too - so 10TB is 20TB, 30TB, etc. - although can of course be cheaper storage than cutting edge Nvme.