OPeNDAP vs. direct file access

Hi Pangeo! I work on the IRI/LDEO Data Library. The group I belong to has been hosting hundreds of mostly climate-related datasets in the Data Library for 20+ years. We’re looking into replacing some of our in-house software with the tools you’ve been developing and integrating.

I’m not deep in the Pangeo weeds yet, but from the outside it looks like the Pangeo approach to data access doesn’t make a clear separation of interface from implementation. As of today, to read a Zarr dataset from the Pangeo catalog a client has to have the current version of the Zarr library. If tomorrow you were to switch from Zarr to TileDB, or even upgrade from the current version of Zarr to some future version that’s not backwards-compatible, then all the binders that have been written to work with today’s setup would stop working. I fear that this could hinder progress after a while.

In Closed Platforms vs. Open Architectures for Cloud-Native Earth System Analytics, Ryan and Joe acknowledge that

In geoscience, we have had an excellent remote-data-access protocol for a long time: the “Open-source Project for a Network Data Access Protocol” or OPeNDAP.

but their position is that the OPeNDAP protocol, while it’s nice for smaller datasets,

was simply not designed with petascale applications and massively parallel distributed processing in mind.

While it may be true that the designers of OPeNDAP didn’t explicitly have this kind of processing in mind, it’s not obvious to me why it wouldn’t be possible to use it at PB scale. I certainly agree that existing implementations of the protocol don’t scale well, but that’s a different assertion with different consequences. I’m interested in revisiting the question of whether OPeNDAP could be a viable implementation-agnostic interface to Pangeo data.

I have more to say on the subject, but I know that by reading a couple of blog posts I’m only seeing the tip of the iceberg of what’s been happening in this community. Before I write more, does anyone have more recent or more detailed information about the problems involved with scaling OPeNDAP? Or on why that wouldn’t be desirable even it were technically posible?

2 Likes

Aaron welcome to the forum!

If you’re looking for a high level overview of how we think cloud data architecture should look, you will find it in this recent paper (in press for CiSE)

It’s true that we advocate that cloud data should be stored directly in object storage and that users have direct access to the object storage, rather than going through an API. The reason for this is simple: doing it this way allows you to take advantage of the fantastic scalability and reliability of cloud object storage (e.g. S3). If you put the data behind an API (like opendap), you have to deploy those API servers and pay for their compute time. This is both a performance bottleneck and big a extra cost.

I would argue that the file formats we are using are stable and reliable. (Zarr will soon be approved by OGC.) So I’m not sure that this situation is a as fragile as you claim. Furthermore, we often use a layer of abstraction between the raw file format and the python client: intake. Intake allows us to define a separate path namespace for accessing data (allowing us to move the underlying data without breaking user code). It also acts as a broker between data formats and client libraries, giving hints about how to open the files. So in theory we could actually change formats without breaking user code. The main drawback I see with intake is that it is python only. That’s one reason we are trying to move to STAC.

All that said, I love Opendap and think it still has a big role to play. It would be awesome to run opendap on top of an object store / Zarr data library, as a way to provide access to legacy clients. The easiest way to implement that would probably be via pydap

This would be a great project for someone interested in contribution to the Pangeo cloud data ecosystem.

On the topic of pydap performance, I know that Dan Duffy’s group at NASA has been working on developing a high performance opendap server. I’m not sure how far they have gotten. I don’t think I they are on this forum, but I could send an email to put you in contact.

To summarize, it should be technically possible, and Opendap is still a great access protocol. But having the client talk directly to object storage will be very hard to beat in terms of both cost and performance. So let’s have both options!

Thanks for the welcome and for your detailed reply.

I wasn’t aware of the proposed OGC standard. I see that the spec defines a storage format, without reference to any particular tool implementation (such as the zarr python library), which is great.

But even if I take your word for it that there will never be a need for a new version of Zarr that’s not backwards-compatible with older clients, what about TileDB, or some other great format that hasn’t even been thought up yet? All else being equal, I would like to preserve our (the Data Library’s) freedom to adopt new, better storage options in the future without breaking existing applications.

As an API that intermediates between the client and the data, intake is definitely better than nothing. Assuming the intake API is stable, we can write a notebook that reads a Zarr store via intake, then change the on-disk data representation from Zarr to TileDB, and the notebook code won’t have to be modified. However, if we package the notebook’s dependencies using a mechanism like binder (and we certainly should), we can’t include dependencies that we don’t even know yet that we’re going to need, so changing the data storage format is likely to break existing binders.

It’s certainly easier and cheaper to rely on the cloud provider’s storage API than it is to create and administer your own high-performance OPeNDAP server. If I eventualy give in to Zarr-over-S3 as the API of our Data Library, this will probably be the reason. But before giving up on the idea of a storage-format-agnostic API, I would like to find out more precisely how big the difference in cost and complexity is. In particular, I’m curious whether we could use a “serverless” platform (AWS Lambda, Google Cloud Functions) to create an HTTP interface that can scale up and down with demand, and that isn’t onerous for the operator to maintain.

I’m very much interested in talking to the group at NASA that you mentioned. Please do put us in touch.

1 Like

Email sent to you + Dan Duffy.

I really support the idea of having OpenDAP as an access option for a cloud-based data store.

If I go to the IRIDL today, I see various options for access
https://iridl.ldeo.columbia.edu/SOURCES/.NOAA/.ESRL/.PSD/.rean20thcent/.V1/.monthly/.monolevel/.analysis/datasetdatafiles.html

In a hypothetical future cloud-based data library, I would like to see several different options as well: OpenDAP (DODS), Zarr, NetCDF download, etc., plus documentation about which option is best for different scenarios.

We have also been interested in the idea of serving data via multiple protocols independent of the backend storage format/service. For this reason, we developed Xpublish, a simple restful interface that serves data described in Xarray containers via a flexible http server api. Our first test was to use this to serve Zarr datasets stored locally as netcdf files. Since then though, @benbovy has done some really interesting work extending the framework to allow for any number of additional formats to be served. We haven’t implemented a opendap router yet but it has been discussed: Extending xpublish with new routers · Issue #50 · xarray-contrib/xpublish · GitHub

1 Like

@jhamman thanks for the pointer to Xpublish. I see that besides OPeNDAP support, another possible future feature that’s mentioned is “publish on-demand or derived data products.” That would be of interest to us as well, as these are features the current Data Library software supports.

How much throughput have you managed to get from Xpublish?

@rabernat as you probably know, we store each dataset in only one format; we support download in multiple formats by translating on the fly. To add Zarr to the list of supported download formats, I would want to see that we can translate a dataset that’s stored in some other format (e.g. netCDF, TileDB) on the fly to something that looks like Zarr to the client. This is the reverse of what Xpublish currently does, if I understand correctly. Any idea how hard that would be?

We use Xarray as our Swiss Army knife of format conversion.

We use Xarray as our Swiss Army knife of format conversion.

OK, but then how much of the S3 API do I need to implement in order to serve the result to the client?

I thought zarr-python had a dependency on s3fs, but I double-checked just now and I see that I was mistaken. The tutorial example does use s3fs, though. To fetch zarr from a server that’s not S3-compatible, is it as simple as telling fsspec to use http instead of s3? Are there disadvantages to doing that?

Correct. No disadvantages afaik. This is exactly what xpublish does: serve Zarr over http. Here is the example from the xpublish docs:

import xarray as xr
import zarr
from fsspec.implementations.http import HTTPFileSystem

fs = HTTPFileSystem()
http_map = fs.get_mapper('http://0.0.0.0:9000')

# open as a zarr group
zg = zarr.open_consolidated(http_map, mode='r')

# or open as another Xarray Dataset
ds = xr.open_zarr(http_map, consolidated=True)

@martindurant could comment better on the differences.

FWIW, for a cloud-based data library, I would be using Zarr as the on-disk format and then transcoding to other formats as needed.

Over the summer, I was part of a team that wrote a report for the European Commission about how to “cloudify” their Global Land Monitoring Service. This could be a useful reference.

We made this figure of a hypothetical architecture diagram.

I don’t think we’ve done a proper benchmarking / tuning exercise yet. I think it will really depend on how the web server is configured. The web server itself is a fast-api application that can be deployed in various high performance configurations (e.g. gunicorn).

Actually, this is exactly what Xpublish would support. Serving data that is stored in any format Xarray can read (e.g. netcdf) as Zarr (or potentially OpenDAP) on the client side. The conversion is done in the various Xpublish routers.

Actually, this is exactly what Xpublish would support. Serving data that is stored in any format Xarray can read (e.g. netcdf) as Zarr (or potentially OpenDAP) on the client side.

Oh, that’s great.

NetCDF is a storage format, and OPeNDAP is a network protocol that uses netCDF as its data representation. What I think I’m hearing here is that Zarr defines both a storage format and (perhaps implicitly) a network protocol. The simplest way to implement the protocol is to store Zarr files in cloud storage, but that’s not the only way to do it–Xpublish implements the protocol without using the storage format on disk. Is that accurate?

Does pangeo-data have an explicit retention policy? How long do you expect the cloud-hosted datasets to remain available? Would you ever consider re-encoding existing Zarr datasets in something else like TileDB, forcing clients to adapt, or are you committed to keeping them in Zarr format until they’re no longer needed?

Now you are asking question far beyond what we have thought through. This project started from the performance perspective: how fast and smooth can we make our data analysis experience? Only now, as we contemplate providing an actual production-quality data service–are we starting to confront this type of issue. So we really welcome the involvement of folks like you and others with experience running these types of services.

My personal view, from my limited perspective, involves the following ides:

  • the on-disk format should be a fast, cloud-optimized analysis-ready thing like Zarr or TileDB, rather than a collection of many “granules” (small netCDFs, COGs, etc.)
  • with that foundation, there can be many services on top–transcoders to different formats, APIs, etc.
  • intake is also a useful abstraction layer that allows you to possibly change the underlying format and have things “still work” but WITHOUT having to squeeze all access through an API endpoint

Further complicating the discussion is this very cool new discover by @rsignell and collaborators about how to make vanilla HDF5 / NetCDF go as fast as Zarr on the cloud:

That is potentially a game changer. If it had existed three years ago, we might never have even gotten involved with Zarr. :upside_down_face: This partially solves the performance problem, but it still doesn’t solve the “analysis ready data cube” part of the problem, unless you’re willing to put an entire big dataset into a single NetCDF file. (That would be a pain to create, manage, and move around.)

@martindurant has proposed a simple ReferenceFileSystem JSON format where you can specify the URL for each chunk. So you don’t need to write a giant NetCDF file. A collection will work nicely.

This opens up exciting possibilities.

One is creating many different virtual datasets from the same collection of files. For example, time series at specific forecast taus from a collection of overlapping forecast data.

Another is switching the file type from Zarr to TileDB and the user accessing the ReferenceFileSystem wouldn’t even know!

So using the ReferenceFileSystem approach not only gives us the opportunity to access files like NetCDF and GRIB2 in a cloud-optimized way, but like Intake, it provides a layer of abstraction on top that could shield users from format changes below the hood.

1 Like

That is good part part of the motivation. Also HDF/netCDF is pretty heavy-weight a dependency for non pangeo types of people, whereas zarr and fsspec are simple enough that you could do the same in a number of languages over straight HTTP and access multiple file-types with the one library. A lot of work remains to be done, though!

This is all very helpful.

Answering one of my own questions, here are a couple more potential advantages to using cloud storage directly, compared to having an OPeNDAP server between the data and the client:

  • Cloud storage allows for the “requester pays” model, OPeNDAP doesn’t (unless you implement accounting and billing yourself, which would involve a huge amount of extra technical and administrative complexity).
  • If you get free storage through the provider’s public dataset program, they typically cover bandwidth costs, whereas if you deliver data through a server you might have to pay for bandwidth.

After reading about reading netcdf through the zarr library, I now understand that Zarr offers potential replacements for not two but three different elements of our existing Data Library system:

  1. A data encoding format for representing chunks of data on disk. In the existing DL we use a variety of different formats, including netcdf.
  2. A metadata format for representing as a single entity a dataset that is stored on disk as multiple chunk files. In the existing DL we use a custom, purpose-built language (Ingrid).
  3. A network protocol for fetching data. In the existing DL we use OPeNDAP.

Regarding 1, if we have a choice between netcdf and zarr’s chunk encoding, I think the argument in favor of zarr is simplicity of implementation?

For 2, I’m sold on the advantages of Zarr over Ingrid (but TileDB seems interesting as well–we have use cases where a sparse matrix representation would be useful).

For 3 I feel like there are pretty good arguments on each side.

1 Like

This is a great summary Aaron. At this point we are on the same page I think! :blush:

It’s important to note that there doesn’t just have to be a single choice here. With a solid foundation of a cloud-optimized format in object storage, it should be relatively easy to put as many different flavors of API in front of it as you want. In addition to OPeNDAP, there are other APIs, such as all the OGC stuff. (I personally have never managed to grok OGC APIs, but a lot of people seem to care about it.) This seems like a great project:

TileDB is definitely worth looking into. The TileDB embedded format (the open source part) is very powerful and technically superior to Zarr in many ways. There is also “TileDB Cloud,” a proprietary paid service.

TileDB cloud’s access control features address some of the “huge amount of extra technical and administrative complexity” you alluded to.

FWIW, here’s what we wrote comparing TileDB and Zarr in the Copernicus report:

While the TileDB core specification and library are open source, its community and development practices differ considerably from Zarr. TileDB is developed by a venture-backed private company, TileDB Inc., based in Boston, MA, USA. While user and community input are clearly considered in design questions, ultimately, this company controls the format and related libraries–there are no known independent implementations of TileDB. Further, the free, open-source “TileDB Embedded” product coexists with a closed-source, commercial product called “TileDB Cloud”. TileDB Cloud provides advanced data-access-control capabilities together with a serverless distributed analysis platform for TileDB data.

The choice of Zarr vs. TileDB is essentially a tradeoff between the more advanced features available in TileDB vs. the potential risk of lock-in to a single company for the service’s core data format.

Yeah, in principle, but I think some combinations make more sense than others. If we settle on TileDB as the storage format, would it really be a good idea to serve the data via the zarr protocol? Even if it’s technically possible using something like xpublish, could it be done efficiently and with high througuhput? And is there a strong argument in that case for supporting both the Zarr protocol and OPeNDAP, rather than just OPeNDAP?

FWIW, here’s what we wrote comparing TileDB and Zarr in the Copernicus report

I saw that, and I didn’t fully understand it. If the open source community is capable of developing and maintaining Zarr, why wouldn’t it also be capable of maintaining a fork of TileDB if the company ever went in a direction that the community didn’t like? Perhaps the answer is that TileDB is much more complex, such that it would be unsustainable without a corporate sponsor, whereas Zarr is simple enough to live on with minimal investment?

1 Like

Hey all,

Just wanted to say this is a really, really interesting discussion!!

About the assertion of @rabernat:

that he just more or less contratics right after, with the help of @rsignell and @martindurant, I think the summary of @Aaron_Kaplan is perfect (with some additions):

  1. Data encoding for chunks, that must be easily readable in the cloud,
  2. Metadata format for representing a complete Dataset.

With Zarr, you can have both points at once.

With NetCDF and @rsignell and @martindurant solution, you’ve got both, but I’m not sure if it would work for every NetCDF dataset. I was under the impression that sometimes, even reading one netCDF file in the cloud can be suboptimal due to the arbitrary metadata placement in the file.

For earth imagery, you’ve got CoG for 1. and STAC for 2. And here I think individual CoGs are maid to be accessed with S3 like APIs.

So for netCDF collections, we may just need an established standard like STAC, but also a specification on how netCDF chunks are created to optimized the reads operations.

@geynard, you are correct that our approach doesn’t work for every NetCDF dataset.

It only currently works currently if the compression and filtering schemes used in the NetCDF file are also available in the numcodecs package, which Zarr uses.

And it only works performantly for specific use case if the chunks are a reasonable size (10-200MB) and have appropriate shape. For example, nothing will allow you to efficiently extract a time series from a (time=20000, y=4000, x=4000) cube of data if your chunk shape is (time=1, y=4000, x=4000). (See Read multiple tiff image using zarr - #2 by rsignell)

1 Like

I think we are something important here. The datasets we are talking about are typically ~10 TB in size. The “reference filesystem” takes a single NetCDF file and makes it work similar speed as an equivalent Zarr. These doesn’t change the fact that we would still typically have hundreds or thousands of NetCDF files. Assembling these all into a coherent datacube would require reading the “reference filesystem” for every file and then checking the coordinates alignment (e.g. whatever xr.open_mfdataset does). This will make it expensive to do the dataset initialization.

The alternative is creating a single 10TB NetCDF file. However, that creates another problem: writing! How do you append a new variable or new record along a dimension to such a dataset? The “reference filesystem” approach does not help with writing at all. And what if it gets corrupted? Then your whole dataset is lost.

Zarr or TileDB both solve this by breaking out the dataset into multiple objects / files, with a single source of metadata. They can hold arbitrarily large arrays with a constant initialization cost. Appending to the dataset is a simple as adding more chunks to a directory.

Regarding the chunk alignment problem that @rsignell brought up, it was discussed at length in Best practices to go from 1000s of netcdf files to analyses on a HPC cluster?. Our soultion was the rechunker package:

https://rechunker.readthedocs.io/

Rechunker can read NetCDF sources, but it only writes Zarr, for the specific reason that parallel distributed writes to NetCDF in the cloud remains nearly impossible. In fact, distributed writing was the original motivation for Alistair to develop Zarr in the first place!

So while the reference filesystem approach is a great development, I don’t think it obviates the value of the cloud-native formats.