interesting! I definitely miss multi-resolution in Zarr, but also Zarr is not so well supported outside of Python. It’s fine obviously if you can tool folks up with a Python stack, but we support other users too. I’m not entirely into treating a set of files of daily data as an array with degenerate rectlinear coords when a list of files with a date stamp is entirely fine and works fast cross language using analogous idioms, and we have existing workflows for visualization, extraction, aggregation, etc. We’re also trying to support those workflows, things we already do well. Getting away from NetCDF is a great first step, but Zarr doesn’t have the generic appeal that COGs do when we’re already heavily invested in GDAL as a foundation.
Do you think multi-resolution Zarr is going to catch on as a standard?
It’s funny that you characterize CoGs as more generic than Zarr. CoG is explicitly and narrowly defined as a format for imagery. What about data that isn’t well modeled as imagery, like climate model outputs? What about timeseries analysis? What about more flexible chunking schemes for high-dimensional data? Zarr is much more flexible and generic than CoG as a general-purpose data container. This is also why it’s harder to make it “just work” automatically in every situation.
Just stacking up CoGs in time is not, in my opinion, the optimal solution for cloud native data cube analytics. Yes, CoGs have better support across the GIS stack. But Zarr also works in many languages.
Zarr is absolutely already catching on as a standard. Many of the most innovative groups are already using it heavily in production.
This community (Pangeo) is quite invested in Zarr and will continue to advance the format and its interoperability across the ecosystem.
I agree about the genericity. I did not characterize COGs are more generic than Zarr.
I used “generic” really specifically: “generic appeal” - which is from our perspective i.e. to belabor my perhaps too short sentence: “doesn’t have the generic appeal to us” I’m sorry that wasn’t clear enough. I think I say too much most of the time.
Hi, many thanks for trying. We are always happy for real world use cases and feedback and would thus love to look into your example in more, but this will need some days. From looking at your example, I would start with using open_dataset() with the ‘chunks’ argument (default is ‘None’, which skips using dask) but size and structure of the input file is not clear from the NB.
I can certainly not predict the future, but we definitely see in the Earth Observation community that zarr gains considerable traction beyond the Python data science community.
It may have gone unnoticed here but the future, nominal file format for the Sentinel products will be zarr. Product specification and sample products can be found here. Likewise, SNAP, the standard software for exploitation of EO products from ESA, uses zarr as the standard file format since v9.0.0.
I agree! I never said it wouldn’t. What I said was that we have workflows that are better suited for now and the forseeable future without reformatting to actual or virtualized Zarr. i don’t understand why that is being reframed. Consider to review what I said, vs what Ryan said I said. (please)
Degenerate rectilinear coords is a showstopper currently and there are other problems like actual technical functional availability in languages that aren’t Python or javascript or C++ or Rust. I want those to be solved and I’m exploring how they might be.
Thanks all for the really intersting discussions here. I’m under the impression by reading at all this that there might be two worlds (again) that might not have the same needs and so the same tools. Raster imagery on one side (originaly Tiff or alike), and climate models or data on the other side (e.g. NetCDF like).
Maybe @maxrjones it would be really nice to caracterize the tools by the initial target in your diagram?
Can these worlds be unified? Through Zarr v3 and GeoZarr? And what are the advances in GeoZarr specs?
I think they can be unified. From what I can tell by reading these threads and the github, GeoZarr is not set in stone but I think we can figure it out. This is a good thread on some of the challenges
Showing some prototype implementations that can be iterated on and discussed seems like the current next step.
roundtripping the CRS data after writing the GeoZarr and reading back to the original Xarray object seems like the next step. This round tripping demonstration has been discussed in a few places or even implemented but not in a generic way within xarray or it’s extensions.
This approach for ironing out the spec has been brought up in a couple threads and I think it’s a good one. It seems like what’s a limiting factor is people power to provide implementations that can be discussed and iterated on.
I saw that November 6th there is a meeting to discuss the GeoZarr spec: GeoZarr Spec Steering Working Group - HackMD. I’ll attend, hope to see many others! I’d like to contribute to protoyping GeoZarr. As a first step I am working with a very sparse 2Gb dataset of Sentinel-2 raster chips across Europe, Eurosat, and trying to make a prototype that addresses points 1. and 2. above.
We’ve added a new large-scale notebook example to the xcube repository! This example demonstrates how to use the resample_in_space() method to reproject the ESA CCI Land Cover classification dataset for all of Europe.
The dataset is stored in an AWS S3 bucket in xcube’s multi-resolution Zarr format and is accessed as chunked xarray.Datasets which allows to perform the entire operation lazy using Dask. Check it out and let us know what you think!