Coordinating efforts on Lagrangian filtering for SWOT

Background

This thread was motivated by an exchange on twitter:

This exchange reminded me that that are quite a few people working on the SWOT filtering problem, with a variety of approaches and models.

Within this broader topic, many of us are working on applying “Lagrangian filtering”, a computationally intensive technique for separating waves and other high-frequency motions from the low-frequency background flow. In the oceanography we often use this beautiful 2017 paper by Shakespeare and Hogg as a reference for the method. Earlier uses of this framework go back to Polzin and Lvov (2011) and Nagai et al (2015).

For many SWOT-related projects, the Lagrangian filtering is only the beginning; the end goal is to use theory and / or data-driven methods to remove the IGW signal from the SWOT measurements, leaving behind the transport-relevant part of the flow. However, the Lagrangian filtering step is very computationally intensive, involving modeling a huge volume of particles, applying signal processing in the Lagrangian frame, and then interpolating back to Eulerian space. There is an opportunity to work together to bring Pangeo tools to bear on this problem.

Proposal

I know lots of the groups involved are eager to take a collaborative approach to this problem.
It is very challenging, and perhaps we can be more effective by working together on certain aspects.
Also, there is a practical concern: we have students and postdocs who are expecting to publishing novel research in this direction, and so it may be wise for project mentors to coordinate efforts to minimize accidental duplication / overlap.

Let’s use this forum as a place to openly discuss these related efforts, our scientific goals, and technical approaches. I would love to find new avenues for collaboration and interoperability between our efforts. COVID has forced us all to rethink how we do scientific collaboration, since we are no longer very constrained by geostrophic proximity.

If you’re interested in trying this approach, I’d propose that each group make a post outlining their goals, their data / models, and their computational tools. I’ll start now with a post on behalf of our our SWOT sub-team.

1 Like

NYU / Columbia SWOT Sub-team: Inferring Ocean Transport from SWOT

Our team consists of myself and @cspencerjones at Columbia, plus Shafer Smith and a TBN graduate student at NYU.

Project Goals and Approach

Our project title “Inferring Ocean Transport from SWOT,” summarizes our central aim. Our outlines three specific goals related to this theme:

  • inferring lateral surface velocities from SSH at scales where geostrophy accounts for only part of the signal
  • filtering the transport-active balanced component of the velocity field, in order to give meaningful estimates of the surface eddy kinetic energy and infer lateral stirring rates
  • connecting vertical fluxes of tracers to estimates of transport-active lateral velocities, and ultimately directly to SSH.

Building on earlier (yet still unpublished) work by former student Anirban Sinha, we will be trying out some neutral-network-based approaches to the filtering problem.

Key to all of this is to have a “truth” signal of raw vs. filtered SSH and near-surface velocities. We plan to use model data apply the Lagrangian filtering technique of Shakespeare and Hogg. We would like to produce a quality synthetic dataset that can be used broadly by the SWOT community and beyond, similar to what Nicolas described in his tweet above.

Data / Models

We will be using the MITgcm LLC4320 runs as our testbed. These data include hourly snapshots of 3D velocities and velocities. Eventually, we will also use SWOT simulator to produce synthetic SWOT observations.

We are also doing some experiments in an idealized high-resolution channel setup to test the methods, before moving to the big model.

Tools and Methodology

We are planning to use MITgcm in offline mode to advect the Lagrangian particles on the Pleiades supercomputer. We are doing this because:

  • This operates well with the existing LLC4320 data on Pleiades, requiring minimal data transformation / preparation
  • It scales up very well, since MITgcm is parallelized with MPI. We expect we can handle 100 billion particles.

We will run the particles at 15m depth in 2D. This will also give us an opportunity to follow up on the earlier work by Yu et al. comparing with drifters. (Aurelien Ponte, @selipot, and others were involved in that.)

Once the particle trajectories are output, we will convert them to an analysis optimized format (either Parquet or Zarr). We will analyze them using the Pangeo stack, including xrft for the spectral filtering part. We have not yet figured out the best way to do the regridding.

Once we have produced the raw + filtered datasets, we hope to share them with the community. At that point we will move on to the more scientifically exciting part of assessing different data-driven methods for reconstructing the transport-relevant velocities from SWOT SSH.

Possible Collaboration Points

I think, but am not certain, that there are several other groups trying to apply Lagrangian filtering in a similar way. It would be great to share technical notes, and possibly even code, for doing this. I’d also love to get some of the data in the cloud via Pangeo so we can more fluidly collaborate.

I would also love to be involved in defining “challenge” problems for machine learning, as Nicholas described in his tweet. I think the data we plan to produce will be very appropriate for this. By coordinating efforts, I think we can get more visibility for this from the ML community. For example, we could create an “Ocean ML Hub,” similar to Radiant Earth’s MLHUB:

Most of all, I really want to take a collaborative approach and make sure that we don’t actually get into a situation where we are competing with other teams who I would much rather be working together with. :blush:

1 Like

Looks like a fascinating project @rabernat. I am thinking that the “total” velocity estimates from drifters will become again useful for ground truthing model results or compare to velocity estimates derived from SWOT data? I missed an opportunity to propose that to the SWOT call but maybe for the future.

1 Like

We would absolutely love to collaborate with you on that aspect, although I know you might not be able to devote much time to it without funding. (Perhaps we could even try to find some supplementary funding somewhere for this specific task.)

I heard via email from Nicolas and Julien…they want to get involved but won’t have time to engage much on this forum for at least a few weeks. No rush…we will be standing by whenever folks are ready to discuss.

Thanks a lot Ryan for kick-starting this much-needed coordination effort.

We have a number of ongoing and future activities at Ifremer that are relevant to this.

A Ph.D. student (Zoé) is currently investigating the signature of internal tides on surface Lagrangian trajectories in idealized numerical simulations (that of Nico’s tweet actually).
The study is not using lagrangian filtering per se, but is obviously very connected.
We’re are wrapping up this study this fall and should submit a manuscript by the end of the year.

We are currently producing a dataset of surface drifter trajectories from llc4320 surface velocity fields.
Data production should be concluded by the end of the month.
Technically, this production relies on parcels with dask parallelization on overlapping tiles.
It is our intention to communicate about this in the months to come and investigate whether our development could be useful for the community.
Our intention was also to share this dataset but given the fact that your group @rabernat has access to the full dataset (and can advect drifters at depth vs at the surface for example) and that you are planning on performing Lagrangian filtering, this may be a worthless effort.

This dataset will be used for two studies:

  • Xiaolong (Yu) will be extending his study with the model lagrangian vs model eulerian comparison
  • Zoé will be exploiting the results she obtained in an idealized setting to a realistic one.

Your thread has to my opinion a broader scope than that of Lagrangian filtering alone: that of extracting information from SWOT data and drifter trajectories.
For llc4320 drifter tracks, we are actually not intending to apply Lagrangian filtering even though the processing we are implementing often bear similarities.
I would thus say there are not one but at least two relevant datasets (trajectories vs Lagrangian filtered dataset).

1 Like

@rabernat - thanks for taking the lead on this. A quick update from me on recent work that has been going on – mostly led by Callum Shakespeare. Callum will likely chime in with more information.

The main goal right now is to create a tool which is easier to apply than the original matlab scripts that Callum hacked together in 2017. Angus Gibson has been working on a method based on OceanParcels. It is python-based, freely accessible, has decent documentation and is in the process of being written up in a paper. Have a look here if you are interested:

Right now, we think this algorithm is ready for others to use and give feedback or contribute to the code.

One question is whether a python-based tool is suitable of the LLC4320 type data. I think it might be - because of the method used. In short, the idea is that, for a given time and at every model gridpoint, we seed a particle and advect it forward & backwards by X days, then apply the filter to fields on that Lagrangian particle. This gives filtered output at the time of seeding (but not at any other time). This is well-explained in the documentation. But the point is that you can apply the filter to small regions, and in a per-timestep fashion, which essentially makes it embarrassingly parallel. @rabernat - I would be very interested in seeing if this method could handle the LLC data??

1 Like

Following from Andy’s introduction, I thought I’d just note some of my observations with working in the dask/parcels area, and where things could possibly be improved. These only pertain to the problem of actually obtaining the Lagrangian-space data through particle advection. Once we have these tracks, we can pass them into the filter of our choice, either a spectral method, or using scipy’s signal processing and filter design functionality.

As mentioned, seed a particle at every gridpoint at which we want filtered data. I have played with running the parcels advection through dask workers, but I’m not particularly happy with the way I’m doing it (see here).

In the problem of distributed advection using parcels’ deferred fields, we have two types of tasks: the actual advection tasks, which run some number of advection timesteps for a group of particles; and the loading of chunked data from a source xarray dataset. Focusing first on the advection tasks, they’re quite tightly-integrated: parcels maintains a sparse chunk map of data required to interpolate the particle data. Every time a parcels kernel requests interpolation, it first checks whether it has loaded that chunk, and if not it will force a load (by inserting the numpy array for that chunk into a dictionary, essentially).

So, because of the tight integration of parcels kernel execution, we can’t cleanly use them directly as distributed tasks. I ended up creating the xarray dataset, and the parcels FieldSet as part of the advection task itself, since we can’t share this between workers. It’s pretty likely that we load the same chunk on multiple workers once the particles move from their initial positions.

I think this is where we would be able to get the best improvements: instantiate parcels (and our data) only once, and push the distributed work further down. I thought that perhaps the recent addition of dask-aware interpolation to xarray might help here, but this doesn’t support some of the more complex (e.g. curvilinear, B/C-grid) interpolation present in parcels.

1 Like

Thanks all for this very interesting discussion. Within my team, we don’t work on Lagrangian filtering per se, but of course we do support a lot of the Parcels code development. Really useful to read what limitations in the code you are facing, @angus-g and @apatlpo. Let us know if there’s anything we can help with from our side.

1 Like

Thanks everyone for the replies! I want to note that Julien and Nicolas plan to contribute as well but are currently tied up with other stuff. So we should definitely wait for their inputs.

In the meantime, a few in-line replies from me

Definitely not a worthless effort! But coordinating these sorts of expensive / difficult calculations is precisely the goal of this thread. My view is that multiple implementations using different tools is quite valuable, if only as a check that the methods converge to the same answer. We would be happy to host your data in Pangeo Cloud if / when you are ready to share it with a broader community. That will help facilitate intercomparison. With our Pleiades-based calculations, we could try to compare several different depths, so we can make a direct comparison with the parcels results. But you are also far ahead of us. We don’t even have our Pleiades allocation yet.

Yes, very good points. My thinking is that we would start with the data perparation and move on to the more “interesting” inference problems. It’s useful to know you are not intending to apply Lagrangian filtering, as this is a small but important distinction in scope.

Thanks so much @AndyHoggANU and @angus-g for sharing this. It seems like such a powerful and useful approach. Exactly the sort of tools I’m hoping we can collaborate around.

I want to understand you approach better. My thinking about how to do this problem on a large scale was basically a two-step process:

  • Advect Lagrangian particles and write the particle data to disk
  • Process the particle data (filtering, whatever) and write some new results to disk (e.g. filtered, interpolated velocities)

It sounds to me that you are doing something more clever, which is to essentially do the filtering “online” during the particle advection, such that you never have to write the full Lagrangian dataset to disk, but instead only output the final filtered results. Is this correct?

That approach really only works with parcels, where you can more easily plug in custom python code.

The primary challenge of the LLC data is the complex topology of the connections between faces. That’s really the reason we went with the MITgcm offline advection; because it wrote the velocity data, it can easily read it back and handle this topology. We have also implemented these connections in XGCM; if one wanted to support this type of topology in parcels, that might be a good starting point.

Another advantage of MITgcm offline is that it really does scale and parallelize well on supercomputers.

Since we are stuck using MITgcm, we are also stuck with using the two-step method for now. However, one possible way forward, if we are willing to neglect the arctic, would be to us xmitgcm to transform the data on-the-fly to a rectangular pseudo-lat-lon grid (basically just keep the LL part of LLC). That’s described a bit here:

It’s embarassingly parallel from a computational perspective, but as @angus-g correctly notes, managing the data is another matter. Especially with datasets like LLC4320, it’s likely that I/O is the rate limiting factor. There are lots of things we could experiment with here. One idea is to use a distributed in-memory database like redis to cache the chunks. That way, the expensive I/O operation (read from disk / object storage) only has to be done once.

Another option might be to look at other workflow frameworks, like Prefect, that might be better suited than Dask to this problem.

I’m going to look through the filtering code you shared more closely and perhaps I’ll have some ideas.

Yes, that is correct. This method overcomes a problem that you will soon strike. In the 2D velocity field at a given level, there is divergence and convergence. So, over time, irrespective of how many particles you start with, the particle density thins out in some regions, and you have no Lagrangian trajectories to work with. So, the time over which you integrate Lagrangian trajectories has to be relatively short (I would say weeks, certainly not a year). For this reason, having a smaller number of particles but ensuring their density is even at the mid-point of the trajectory is more efficient…

I will let @angus-g comment on whether his implementation can handle the topology of the LLC grid…

1 Like

@AndyHoggANU @rabernat Just to follow up on Andy’s comment. It is possible to seed the particles in such a way to try and counteract the divergence of the 2D flow with satisfactory results. An example is in our recently published paper here: https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1029/2020JC016106. But, it takes much trial and effort to get the seeding correct, which takes a lot of human time and effort, and the resolution is not guaranteed. Also, you need more particles in total so you can “over-seed” in regions of divergence. Therefore, ultimately, we jettisoned that method in favour of the method Andy describes which guarantees no loss of resolution, and the ability to obtain wave fields at any point in space and time independently. If you have 10 mins spare, I talk about the new method here: https://www.youtube.com/watch?v=VVqCDeL8AcQ&feature=youtu.be