In-situ ocean observation data - where & how?

Aloha!

If you are like me in the past then you often pull down your in-situ ocean data to a local copy on your desktop or HPC centre. You have wget & FTP scripts running all over the place filling up directories with NetCDF files. Just for a start, version control and tracking updates and provenance can be painful. And then there is all that space and duplication.

I’m ignorant as to what in-situ ocean observation data might be available behind an S3 gateway somewhere? I am aware that lots of data is available on OPeNDAP / thredds servers - NOAA NODC for example - https://data.nodc.noaa.gov/thredds/catalog.html But my experience using xarray with OPeNDAP has been mixed, at best. I do recognise that there may be many factors that impact or degrade OPeNDAP performance, not least of which may be my connection speed to the rest of the world.

Is there already a resource somewhere that tries to document where best to remote access public in-situ ocean datasets? And best practice for doing so, maybe with examples? (I note @rabernat has this )

I may be missing some obvious resources? If not, is this something some of us could work on?

PS - if someone has already put all of the NOAA NODC data into ZARR format let us know! :smile:

1 Like

You might send this to Don Setiawan at UW: landungs@uw.edu. He is working with the Regional Cabled Array component of OOI and has built an interface package called yodapy for Your Ocean Data Access PYthon. yodapy lives at github.com/cormorack/yodapy. Further: I’ve written a tutorial on using yodapy (on the pangeo JupyterHub): github.com/robfatland/chlorophyll. The central notebook is focused on shallow profiler chlorophyll data, so about one sample per second.

The final mile of getting datasets in S3 in zarr format is still in progress. I’d like the Education/Outreach part of Pangeo to help make this easy to do. But for now my stopgap is to order data from OOI and store what I’m working with ‘locally’ in the JHub pod.

Let me know if you find any of this helpful even though I doubt it is a complete solution for you.

1 Like

Another possible option might be pyoos, though to be honest, I haven’t used it.

1 Like

Several people have tried to put ARGO data into Zarr in the cloud. Nick Mortimer of CSIRO (need to get him on this forum) and also someone from France, can’t remember exactly who.

If you’re not dealing with ARGO or gliders, most normal hydrographic data just isn’t very big. FTP servers and other legacy stuff tends to be sufficient. It’s more a problem of discovering and merging data.

1 Like

Appreciate the replies. I’m not sure my OP was well posed - it was reaction to trying to directly access NOAA NODC data in a notebook without the “wget/FTP yourself a local copy and update/curate it yourself going forward”. As @rabernat notes, discovery and merging ocean obs data are key issues - as is this twitter thread: https://twitter.com/rabernat/status/1201914930997465097?s=20

To let you know: at Ifremer we’re working hard to soon provide users with a friendly/pangeo-compatible access to ocean in-situ data like Argo, but not only.
This is part of 2 european projects (EA-RISE and BLUE-CLOUD) and will be powered by the Coriolis data center.
As Ryan pointed out, this is not a problem limited by the size of the dataset, but by their structure and manipulation complexity.
Solutions, use cases and info will be available at http://github.com/euroargodev in 2020.

In the mean time and if you’re not afraid with web APIs, Argo data measurements are available at http://www.ifremer.fr/erddap/tabledap/ArgoFloats.html

2 Likes