I’m building out an internal API that will serve temperature data at various altitudes on a predetermined grid (x deg lat x y deg long x z km alt). I have data at cadence of n minutes for this grid.
I’ve worked with this kind of data - what would be an ideal format to store such data, such that I can use metadata to pull the exact partition that the latitude, longitude, altitude, and timestamp can be found?
I considered GRIB2 files - partitioned by timestamp and altitude, resulting in small files with temp_timestamp_altitude.grib as names… the API backend that is able to quickly query AWS s3 for the right file. However, this results in 1440/n timestamps * y altitudes files, which is quite clunky.
Would someone here have recommendations for a cloud optimized format that can be used to back a sub-500ms response time API?
NOTE: eventually, we’ll extend the API to accept lat/long/alt values that do not sit exactly on the grid points, so we’ll find the closest and then interpolate values.
I think the formula that our community has converged around for optimal performance and flexibility is Zarr. Combined with XPublish’s EDR plugin, this gives you a great architecture to achieve your needs. In fact, @jhamman and Alex Kerney recently gave a presentation on this exact subject at the Pangeo showcase.
If you’re interested in a hosted solution, rather than something you have to build and manage yourself, our company Earthmover offers a subscription-based managed service using this type of architecture.