In-situ ocean observation data - where & how?

Aloha!

If you are like me in the past then you often pull down your in-situ ocean data to a local copy on your desktop or HPC centre. You have wget & FTP scripts running all over the place filling up directories with NetCDF files. Just for a start, version control and tracking updates and provenance can be painful. And then there is all that space and duplication.

I’m ignorant as to what in-situ ocean observation data might be available behind an S3 gateway somewhere? I am aware that lots of data is available on OPeNDAP / thredds servers - NOAA NODC for example - https://data.nodc.noaa.gov/thredds/catalog.html But my experience using xarray with OPeNDAP has been mixed, at best. I do recognise that there may be many factors that impact or degrade OPeNDAP performance, not least of which may be my connection speed to the rest of the world.

Is there already a resource somewhere that tries to document where best to remote access public in-situ ocean datasets? And best practice for doing so, maybe with examples? (I note @rabernat has this )

I may be missing some obvious resources? If not, is this something some of us could work on?

PS - if someone has already put all of the NOAA NODC data into ZARR format let us know! :smile:

1 Like

You might send this to Don Setiawan at UW: landungs@uw.edu. He is working with the Regional Cabled Array component of OOI and has built an interface package called yodapy for Your Ocean Data Access PYthon. yodapy lives at github.com/cormorack/yodapy. Further: I’ve written a tutorial on using yodapy (on the pangeo JupyterHub): github.com/robfatland/chlorophyll. The central notebook is focused on shallow profiler chlorophyll data, so about one sample per second.

The final mile of getting datasets in S3 in zarr format is still in progress. I’d like the Education/Outreach part of Pangeo to help make this easy to do. But for now my stopgap is to order data from OOI and store what I’m working with ‘locally’ in the JHub pod.

Let me know if you find any of this helpful even though I doubt it is a complete solution for you.

1 Like

Another possible option might be pyoos, though to be honest, I haven’t used it.

1 Like

Several people have tried to put ARGO data into Zarr in the cloud. Nick Mortimer of CSIRO (need to get him on this forum) and also someone from France, can’t remember exactly who.

If you’re not dealing with ARGO or gliders, most normal hydrographic data just isn’t very big. FTP servers and other legacy stuff tends to be sufficient. It’s more a problem of discovering and merging data.

2 Likes

Appreciate the replies. I’m not sure my OP was well posed - it was reaction to trying to directly access NOAA NODC data in a notebook without the “wget/FTP yourself a local copy and update/curate it yourself going forward”. As @rabernat notes, discovery and merging ocean obs data are key issues - as is this twitter thread: https://twitter.com/rabernat/status/1201914930997465097?s=20

To let you know: at Ifremer we’re working hard to soon provide users with a friendly/pangeo-compatible access to ocean in-situ data like Argo, but not only.
This is part of 2 european projects (EA-RISE and BLUE-CLOUD) and will be powered by the Coriolis data center.
As Ryan pointed out, this is not a problem limited by the size of the dataset, but by their structure and manipulation complexity.
Solutions, use cases and info will be available at http://github.com/euroargodev in 2020.

In the mean time and if you’re not afraid with web APIs, Argo data measurements are available at http://www.ifremer.fr/erddap/tabledap/ArgoFloats.html

2 Likes

Guillaume do you have a simple example, say a Jupyter notebook demonstrating use of your ARGO API? I am only familiar with the Coriolis manual map interface; which is great but programmatic would be pleasant.

As promised, here is an update:

If you’re interested in accessing Argo data in python you can have a look to the argopy library we’re building at: https://github.com/euroargodev/argopy

It provides easy data fetchers (from erddap and soon from a local copy of the dataset) that consume xarray objects as simple as:

    from argopy import DataFetcher

    argo_loader = DataFetcher()
    argo_loader = DataFetcher(backend='erddap')
    argo_loader = DataFetcher(cachedir='tmp')

    argo_loader.region([-85,-45,10.,20.,0,1000.]).to_xarray()
    argo_loader.region([-85,-45,10.,20.,0,1000.,'2012-01','2014-12']).to_xarray()

    argo_loader.profile(6902746, 34).to_xarray()
    argo_loader.profile(6902746, np.arange(12,45)).to_xarray()
    argo_loader.profile(6902746, [1,12]).to_xarray()

    argo_loader.float(6902746).to_xarray()
    argo_loader.float([6902746, 6902747, 6902757, 6902766]).to_xarray()
    argo_loader.float([6902746, 6902747, 6902757, 6902766], CYC=3).to_xarray()

Data are return as a collection of measurements. If you want to transform it as a more familiar collection of profiles, use the xarray argo accessor:

ds = argo_loader.region([-85,-45,10.,20.,0,1000.]).to_xarray()
ds = ds.argo.point2profile()

Some notebook examples are available at: https://github.com/euroargodev/erddap_usecases

1 Like

This is great. I have been working with OOI Regional Cabled Array profiler data and ARGO as a complementary resource so I’ll see about incorporating this. thanks -r

Thanks I have joined!

Had a quick look very interesting. It would be great to combine with Zar and Intake. From my tests with Zarr it looked like all agro data could be slipped into a single zarr store of <20Gb. I have looked at https://github.com/biofloat/biofloat as well. It would be great to wrap this as an intake catalogue so that we remove the details or file and fetch from the end-user.

Here’s a medium about where I got too

Serving curated Argo data in zarr and integrated in an intake catalogue is on the agenda of the Argo data centers.
In the mean time, you can play around with:

fs = gcsfs.GCSFileSystem(project='alert-ground-261008', token='anon', access='read_only')
gcsmap = fs.get_mapper('argodata/sdl/GLOBAL_ARGO_SDL2000')
ds = xr.open_zarr(gcsmap)
1 Like

Thanks yep will have a look

FYI, in recent versions of fsspec, you should be able to simply do

ds = xr.open_zarr(fsspec.get_mapper('gs://argodata/sdl/GLOBAL_ARGO_SDL2000'))

Which is a bit more concise.

If anyone would like to create an intake entry for the existing datasets in http://catalog.pangeo.io/, it’s as simple as making a PR to this repo:

1 Like