OPeNDAP vs. direct file access

Actually, it can read specific chunks from a collection of files, as described above.

But of course I agree this doesn’t detract in any way from the utility of formats like Zarr and TileDB.

2 Likes

This would all make a great topic for the OGC Testbed!

@rabernat I hope you did not take this badly, I’m struggling with my English to say things appropriately…

Thanks anyway @rabernat and @rsignell for the precision, this is much welcome! But I think that I may have misunderstood some elements here.

Isn’t it the approach of @rsignell and @martindurant, where chunks are individual netcdf files, and the metadata through fsspec? Maybe you mean that the main source of metadata is not the only one and not complete? As with the STAC + CoG approach, where I don’t think all the metadata are in the STAC catalog, some remain in the CoG file?

And so with the other approaches, appending would be adding some individual NetCDF files or CoG and updating the main metadata.

Anyway, I concur with you and @rsignell, Cloud native formats like Zarr offer great value!

1 Like

I’ll add a point of clarification: the current functioning of fsspec-reference-maker is to read the metadata (tags, units, etc.) once, and the chunk offsets, which may be one-chunk-per-file, once, and thereafter zarr can read the chunk directly without having to read or otherwise scan the target files.

So you get the following advantages:

  • you can express a collection of files as a single dataset
  • you can directly access the chunks of a given file directly, without metadata scanning, which can mean a big speedup
  • zarr allows efficient subselection of the data, only read what you need (caveat @rsignell 's chunking point above)
  • loading of many pieces happens concurrently, which can be a large speedup for small chunks
  • the analysis flow fits in well with parallel/distributed computing tools like Dask, which want the metadata up front and a number of very similar tasks differentiated only by the arguments, in this case path and offset.

You may also want to open the adjacent idea of using the original loading package (i.e., not numcodecs, but say cfgrib) for the many-small-files problem ( Idea: Read Grib2 directly from object storage using the Zarr library? · Issue #189 · ecmwf/cfgrib · GitHub ). My work doesn’t do this, yet, but the comparison is interesting.

1 Like

Furthermore, its reasonable and expected that the metadata will be the same across all the files of an ensemble dataset - so avoiding “opening” each file with HDF/h5py is really desirable.

1 Like

@geynard , you’ve got it right IMHO.

It would be easy to update the fileReferenceSystem metadata with new chunks as new obs or forecast files come in.

@martindurant and I discussed yesterday that we should create just such an evolving fileReferenceSystem metadata file for the HRRR (3km cumulus-resolving) forecast model that would construct (virtually) a “best time series” for the last 7 days and forecast period from the collection of forecast files.

Not too bad with the current JSON spec, but you do need to read and rewrite the whole thing, of course. We have talked a little about making a new compact, binary repr for the reference metadata, but come to no solution yet. Zarr itself might be a good solution, since we are relying on it anyway.

Fantastic discussion here. I’ve come rather late but will add my 2 cents anyway, and sorry if most of this is covered above.

I would second TileDB or Zarr as a great storage solution. If you are interested my colleagues has written a piece about our experience with TileDB. It’s a little out of date now and TileDB is progressing at pace and looking to level up in the GIS space and I think better xarray etc compatibility will be coming soon.

However, working at a large, risk-averse organisation that uses many legacy tools (and cutting edge ones) I’ve often found it’s either practically or politically impossible to convert data into one of these newer formats.

@rsignell and @martindurant’s work in this area is really relevant as it gives a way of keeping your data as is, saving time, cost and political turmoil whilst getting most if not all of the benefits. That work has moved on since I last checked in on it so I will need to re-consider how it affects our work.

My colleague and I have work on a related approach but that doesn’t require reading any of the files in advanced.

We’ve not written up the work yet but in short:

  • We take a collection of NetCDF files (or could be other formats) that adhere to a known naming convention (this is usually the case with most large datasets I’ve worked with).
  • Using Zarr and Xarray create a layer that maps between indexes in time and space and convert them to a filename
  • Downloads that file into memory (with an optional on-disk cache)
  • Serves that file (or part of) up as ‘chunk’

We’ve wrapped up this abstraction layer in an intake dataset and driver such that you can install with pip or conda i.e. conda install -c conda-forge -c informaticslab intake_informaticslab

If your curious to see it in action there is a binder: Binder

Pros:

  • Don’t have to read all the data in advance, we are considering applying this to our tape archive, it’s likely only 2% of the data in there will ever be read (but we don’t know which 1%) so converting or any pre-read is likely to be inefficient.
  • Works on moving data sets. I’ve written about some of the challenges with Zarr for moving data . Our data set is likely to be a rolling representation of the last X days, Zarr struggles with this currently.
  • Truly concurrent write. Zarr does a great job of concurrent write when managed by a single process. In our system we have millions of files a day coming in, frequently concurrently and out of order, we process these each in a separate process. If this was a traditional Zarr representation of a xarray datasets there would be issue with this.
  • By using intake we can swap out and change about the implementation as needed.

Cons:

  • It’s not any sort of standard, we’ve created and will need to maintain (or not) a library to turn our millions of NetCDF files into and ARCO dataset. We can’t ‘outsource’ this to the Zarr or xarray devs as it’s inherently linked to our dataset.
  • Chunk sizes are tied to however the source files are stored, this could be very inefficient.

As I say the above is a work in progress but happy to share if you wish to know more.

Good luck @Aaron_Kaplan I’m really keen to hear how you get on.

Just the thing! Yes, we also want to solve the many-small-files case, which is not necessarily the same as the many-chinks case. So yes, it may well make sense to grab whole files and treat them as a single chunk each of a larger dataset. It will be interesting to see how intake_informaticslab can be extended to different path schems and file types (such as grib2, which must be loaded from a local file).

I’ll chime in on the repo.

Thanks for sharing Theo! Sounds very cool!

I don’t understand this comment. Could you elaborate? I read the linked blog post but didn’t quite get the crux point. Distributed, async concurrent writing was really the main motivation for creating Zarr in the first place. If you are just updating the arrays (and not changing the shape / metadata), and you align your writes with chunk boundaries, you can scale as far out as you need.

Do you know about the new region option in xarray.to_zarr. This did not exist when your blog post was first published. Does that solve some of the issues you found with concurrent zarr writes?

I posted an issue to invite you or someone from the group to talk about this at a pangeo meeting - I think many would find it interesting, and it would reach more people that this thread.

Love to! Replied on issue, just a matter of when and where.

@rabernat I maybe muddling Zarr and Xarray with Zarr, or just muddled, but…

I have array A shape [1,100,100] chunk [1, 100, 100]

It gets regularly appended with new chunks B,C,D:

A + BA([2,100,100])

The updates are ordered but may come in out of order (D, C, B) and simultaneous to each other.

The issue is due to the centralised metadata. Going back to A, suppose D and B come in in parallel:

Process handling D:
Pd.1: Load A
Pd.2: Append D
Pd.3: Update A metadata to shape [4,100,100] (we need to leave gaps for the data we know is coming in so it’s not [2, 100, 100])

Process handling B:
Pb.1: Load A
Pb.2: Append B
Pb.3: Update A metadata to shape [2,100,100]

If the processes are in parallel it may be that step Pb.1 (load A in order to append B) happens before Pd.3 (save new shape inlight of D) and Pb.3 (update shape in light of B) happens after Pd.3 (update shape in light of D) then you can end up with the metadata reporting the shape would be [2,100,100] and the update from D is effectively ghosted.

I found this issue is exacerbated when you are writing large datasets with multiple parameters and shared coordinates, live whilst the data is being accessed (so can be accessed between updating the metadata for one Zarr and another). It’s possible to get the data array out of sync with coordinate arrays and other such, particularly in an eventually consistent data storage.

Or so I think!!!

Does that make sense, am I confused?