Welcome Rebekah!
A few things that I’ve noticed, some of them won’t affect performance very much, but they will make the overall code nicer and maybe save you some operations
- If you intend do use the sentinel and the landsat data together, you can just search for both collections at the same time (which would work unless you’re trying to pull some bands unique to only one of the two satellites, like the S2 red edge bands.
stac_items = catalog.search( #actual search in the CMR for HLS objects, with our 4326 bbox, according to our start and end date
collections=[s, l],
bbox=bbox_4326,
query = {"eo:cloud_cover":{"lt":30}},
datetime=date_range,
).item_collection()
However, there seems to be currently a bug with the LPCLOUD HLS collection, and it currently only returns the first page of the search (see this earthdata forum post), so it may be best to stick with searching them separately for now! Another potentially good option would be to combine the stac items before reading them in.
- Something that may affect computation time - the time dimension.
stackstac
doesn’t do any handling of the time dimension for you, so it would generate a unique time for each item you read, e.g. for just the landsat part in your notebook, stackstac gives me this:
But odc.stac
(using the functionality to group by solar day, gives me less than half the time points. Basically because it’s ‘combined’ acquisitions in the same landsat path that are collected on the same day, instead of creating unique entry in the time dimension for each of them. This avoids having a lot of nans everywhere in your array.
ds = odc.stac.load(stac_items2, bands=BANDS,
groupby='solar_day',
chunks={'x':4096, 'y':4096},
crs=EPSG,
resolution=30,
bbox=bbox_4326)
To double on that point, the difference here seems to be this - for stackstack
, in my example image below, T1…6 all have unique date times, and create an ‘entry’ in the time dimension. For odc.stac
as called above, you would have two days, one for T1…3 and the other for T4…6.
This has some performance implications, as seen in the screenshot below (although not rigorously tested!). It also helps keep the dask graph controlled, which starts to matter when you try to do very large areas in distributed systems.
I think the performance aspect of this really shines as we add more operations, e.g. below you can see the timing difference of getting a single weekly mosaic from your combined weekly data, vs if you did a similar thing with odc, which is about 2x faster now:
- Perhaps you can simplify some of the index creation by defining a dataset. (either through odc-stac or doing
ds = hls_stack.to_dataset(dim='band')
, then you could easily add indices like:
ds['ndvi'] = (ds['nir'] - ds['red']) / (ds['nir'] + ds['red'])
ds['evi'] = ...
This probably doesn’t affect performance at all, but it’s much easier to manage. And if you compute this, it will also do all your indices. Before computing it’s always good to check whether it actually fits into memory.
In terms of computation - I’m not familiar with how colab is set up now and how much more you get when you have a paid subscription, but currently I’m getting 2 CPUs and about 13GBram. This is just not that much, and there’s a fair deal of data to be processed. The ideal way to speed up this is having access to some dask cluster, where you can just parallelize your chunked workload between several machines. There’s an art on how to optimize this further (in terms of chunk sizes and so on).
Something that can potentially work well is to figure out where you need to persist some computed results to avoid doing them several times. For instance, you could try exporting your weekly medians to zarr once they are computed (and if the resulting array is small enough that you can feasibly put it somewhere!), and then just read from there.
What do you mean by this?:
I have multiple instances of the +/- compute option inc
I have some more thoughts, but this is long enough, and I’m very keen to see other people’s thoughts and how more people can come up with ways of doing this!
I hope this helps!