HLS time series using xarray best practices

rebekah · October 7, 2024, 3:11pm

Hi there - this is my first question here and am really excited to find a smaller discussion site for this type of work.

I have been working on a workflow using STAC and xarray within Google Colab’s environment for the past few months. I am pulling 6 months of imagery (7g), processing (cloud removal, resampling), calculating veg indices and pulling data for polygons across my study area. The problem is, I keep 1) getting disconnected from Colab, and 2) running out of RAM (even with high RAM).
I’m trying to accelerate the workflow so idle time is shorter and so the processing isn’t as computationally expensive, but I’m not sure if I understand how xarray handles stored functions.
for example:
full = full.groupby(“time”).where(full!=0).resample(time=‘1W’).median()
runs in just a few seconds. but this:
full = full.groupby(“time”).where(full!=0).resample(time=‘1W’).median().compute()
takes 10-15 minutes.

I have multiple instances of the +/- compute option inc, and I understand without the compute function xarray “holds” that compute until called, but I can’t seem to understand how to really optimize the time it takes to compute and really using the xarray functionality to it’s full extent.
The same type of question also applies to the veg indices. If I do not call the compute function, and then calculate my vegetation indices, will they be calculated including the compute function?

I feel like the processing time is really high for not even starting my analysis, so I’m concerned that I’m doing it wrong. Any tips you might have would be great.

rabernat · October 7, 2024, 3:46pm

Welcome Rebekah!

Can you share your full code and / or link to public Colab notebook?

rebekah · October 7, 2024, 4:27pm

Hey Ryan, here’s a link: Google Colab

Some is edited out because of data sharing agreements.

TomNicholas · October 7, 2024, 5:40pm

That notebook requires requesting access FYI

KBodolai · October 8, 2024, 1:01pm

Welcome Rebekah!

A few things that I’ve noticed, some of them won’t affect performance very much, but they will make the overall code nicer and maybe save you some operations

If you intend do use the sentinel and the landsat data together, you can just search for both collections at the same time (which would work unless you’re trying to pull some bands unique to only one of the two satellites, like the S2 red edge bands.

stac_items = catalog.search( #actual search in the CMR for HLS objects, with our 4326 bbox, according to our start and end date
    collections=[s, l],
    bbox=bbox_4326,
    query = {"eo:cloud_cover":{"lt":30}},
    datetime=date_range,
).item_collection()

However, there seems to be currently a bug with the LPCLOUD HLS collection, and it currently only returns the first page of the search (see this earthdata forum post), so it may be best to stick with searching them separately for now! Another potentially good option would be to combine the stac items before reading them in.

Something that may affect computation time - the time dimension. stackstac doesn’t do any handling of the time dimension for you, so it would generate a unique time for each item you read, e.g. for just the landsat part in your notebook, stackstac gives me this:

But odc.stac (using the functionality to group by solar day, gives me less than half the time points. Basically because it’s ‘combined’ acquisitions in the same landsat path that are collected on the same day, instead of creating unique entry in the time dimension for each of them. This avoids having a lot of nans everywhere in your array.

ds = odc.stac.load(stac_items2, bands=BANDS, 
                   groupby='solar_day', 
                   chunks={'x':4096, 'y':4096}, 
                   crs=EPSG, 
                   resolution=30, 
                    bbox=bbox_4326)

To double on that point, the difference here seems to be this - for stackstack, in my example image below, T1…6 all have unique date times, and create an ‘entry’ in the time dimension. For odc.stac as called above, you would have two days, one for T1…3 and the other for T4…6.

This has some performance implications, as seen in the screenshot below (although not rigorously tested!). It also helps keep the dask graph controlled, which starts to matter when you try to do very large areas in distributed systems.

I think the performance aspect of this really shines as we add more operations, e.g. below you can see the timing difference of getting a single weekly mosaic from your combined weekly data, vs if you did a similar thing with odc, which is about 2x faster now:

Perhaps you can simplify some of the index creation by defining a dataset. (either through odc-stac or doing ds = hls_stack.to_dataset(dim='band'), then you could easily add indices like:

ds['ndvi'] = (ds['nir'] - ds['red']) / (ds['nir'] + ds['red'])
ds['evi'] = ...

This probably doesn’t affect performance at all, but it’s much easier to manage. And if you compute this, it will also do all your indices. Before computing it’s always good to check whether it actually fits into memory.

In terms of computation - I’m not familiar with how colab is set up now and how much more you get when you have a paid subscription, but currently I’m getting 2 CPUs and about 13GBram. This is just not that much, and there’s a fair deal of data to be processed. The ideal way to speed up this is having access to some dask cluster, where you can just parallelize your chunked workload between several machines. There’s an art on how to optimize this further (in terms of chunk sizes and so on).

Something that can potentially work well is to figure out where you need to persist some computed results to avoid doing them several times. For instance, you could try exporting your weekly medians to zarr once they are computed (and if the resulting array is small enough that you can feasibly put it somewhere!), and then just read from there.

What do you mean by this?:

I have multiple instances of the +/- compute option inc

I have some more thoughts, but this is long enough, and I’m very keen to see other people’s thoughts and how more people can come up with ways of doing this!

I hope this helps!

rebekah · October 8, 2024, 2:31pm

Kristian thank you!
I did discover the LPCLOUD HLS bug a few weeks ago and that method was my shortest workaround - I still have the original search query with combined datasets to hope it gets fixed soon.
I will pull in odc.stac, the time dimension has been giving me trouble with indexing in some of the computation! In that case, the flatten function isn’t needed anymore.
I keep switching back and forth from a data array to a dataset because I’ve been seeing benefits to both. The issue I also keep encountering is STAC brings the bands in as coordinates rather than data variables, which takes some minor restructuring but enough of a restructure to take up some time. But the data array seemed to be a little easier to chunk?

My lab pays for the colab subscription so I have access to a little more processing a TPU and 50g RAM. I would like to pull it out and use some of the cluster computing my university has access to but I want to make sure its more functional.

I have multiple instances of the +/- compute option inc

Sorry - confusing sentence, I mean between calculating median values and outliers etc I have multiple instances where using the compute function could be applied and have been trying to decide where I should apply it vs storing the operation.

KBodolai · October 9, 2024, 11:10am

I think the one advantage of having it as a DataArray is if you wanted chunks with the bands in them. I’ve worked quite a lot with HLS data and done some experiments to see whether this can be faster (effectively avoiding data transfers between workers in the distributed case), and I think in practice I haven’t seen any improvements. In terms of chunking, they should both be similarly easy to chunk (e.g. the code snippet above loading Dataset with odc-stac is chunked in space).

Unfortunately, I don’t think the TPU would be any help here, but lovely to have more RAM and a bit more processing power.

I think the important think here is making sure you’re not computing the same thing several times. A neat way could be to define a dask.delayed function that given a chunk of your data, computes all the statistics you want at once. If it fits, you can also persist the data just before you do all the statistics, and it should avoid it computing things twice (e.g. all the cloud masking and weekly aggregation).

For instance, if you did something like this(highly summarized!)

# preprocessing bit
ds = cloud_mask(ds)
ndvi  = compute_ndvi(ds)
ndwi = compute_ndwi(ds)

ndvi = ndvi.median(dim='time').compute()
ndwi = ndwi.median(dim='time').compute()

you would load the data and mask out the clouds twice, once for each index, but you can get around it by persisting the dataset in between (of course, if it fits on your memory, or in your cluster memory):

# preprocessing bit
ds = cloud_mask(ds)
ds = ds.persist()
ndvi  = compute_ndvi(ds)
ndwi = compute_ndwi(ds)

ndvi = ndvi.median(dim='time').compute()
ndwi = ndwi.median(dim='time').compute()

This way, all the loading and the cloud masking is performed and stored in the (distributed) memory, so when we compute the ndvi and the time aggregation, we don’t need to perform those ops again.

rebekah · October 9, 2024, 3:38pm

Hey Kristian,

I’ve worked quite a lot with HLS data and done some experiments to see whether this can be faster (effectively avoiding data transfers between workers in the distributed case), and I think in practice I haven’t seen any improvements.

That’s very helpful to know. I was having a hard time parsing apart what the pros/cons were between datasets and data arrays. The odc method is definitely the way to go for me.

If it fits, you can also persist the data just before you do all the statistics, and it should avoid it computing things twice (e.g. all the cloud masking and weekly aggregation).

Perfect!! That’s exactly what I was looking for. I had a feeling processing-wise that it was repeat processing some of the cloud masking and weekly aggregation multiple times throughout. I knew there was a way around that. Thank you!

KBodolai · October 14, 2024, 10:30am

Glad that it helped, did you manage to get it to behave in the end? I’d love to see what you came up with in the end!

I just thought of another great bit of documentation: Dask Best Practices — Dask documentation. It may not be a game-changer if you’re not running things in a very parallelized way - but if you do end up running it on a cluster, there’s some great tips in there.

Also potentially very useful for people stumbling upon this on the future.

Topic		Replies	Views
Compute time series for 70,000 locations (Speed up the processing)	12	302	October 7, 2024
Large-scale data processing benchmarks for Xarray-Beam	6	1572	June 13, 2022
HPC Time series processes Science	5	1024	February 10, 2020
Optimizing climatology calculation with Xarray and Dask Science	33	3948	December 6, 2024
Struggling with large dataset loading/reading using xarray Science	39	15142	February 16, 2023

HLS time series using xarray best practices

Related topics