Simple high level requirement: create a tabular dataset for machine learning by intersecting EO raster data with ground truth in shapefile geometries. An in-core dataset can usually be created this way (it’s not necessarily a massive dataset once you pull the data you want).
I’ve got something working. I’m okay creating an xarray dataset of stacked EO arrays (e.g. each one named _). I can also load the shapefile as a geoDataFrame in geopandas and then pass it to make_geocube to create a rasterized xarray dataset of the features from the shapefile (which would be ground truth labels). I can then add this rasterized ground truth dataset to the EO dataset using xarray merge. Then I cast it all to a dask dataframe, dropna (on the output label), and save to parquet. This creates a nice tabular dataset with each feature and ground truth label as a column and the pixels as rows, with only the valid data (where there was a ground truth polygon).
But when you have a large area with sparse polygons (widely separated on the ground), you waste a lot of time processing the chunks that you know are empty of useful data. It feels like the best idea would be to just iterate over the polygons (rows in the shapefile) but this is where I could use some advice. What would be a good design pattern here? It feels like I could create a Python list of xarray datasets of each rasterized polygon, merge each with the EO rasters (keeping just the extent of the rasterized polygon), and then iterate over that list of small datasets (that focus on each area of interest) and output them to parquet (or whatever). And then I come up against how to do that efficiently with dask. I first, naively, thought I could create a function that casts a dataset to dask dataframe, drops na, and then output to parquet, and dispatch this to workers using dask’s client.map. That gave me an error about serializing async objects (or somesuch, sorry for the hand-wavy error summary). I can simply iterate over each of my small xarray datasets in the list and pass them to my helper function. I’ve done this and each dataset goes like greased lightning once it’s serialized and sent to the workers. And there’s the problem. The serialization overhead for each small dataset is horrendous and the whole process took twice as long it did before (when crunching through and filtering out vast swaths of NA).
So what’s a suggested design pattern here? Some sort of dataset whereby I can optimally process just the bounding boxes for the known polygons. Put all in a single dataset and add a dimension that’s an ID for each one? I’m fairly recently new to all of geospatial, xarray, dask so there seems to be a million moving pieces at the moment. Thanks for reading and I look forward to getting to know this community more!