Main task:
I have ~100,000 points with xy coordinates distributed almost all over the African continent (bound = (-20, -35, 52, 30)). I am using quarterly bands values as predictor variables calculated from sentinel-2 images for 2022. I want to extract pixel values to the points for species distribution modelling.
Step 1: Set up a dask cluster for parallel computing.
cluster = dask_gateway.GatewayCluster()
client = cluster.get_client()
cluster.adapt(minimum=8, maximum=100)
Step 2: Data access in planetary computer
catalog = pystac_client.Client.open(
“https://planetarycomputer.microsoft.com/api/stac/v1”,
modifier=planetary_computer.sign_inplace
)
bbox = gp.GeoDataFrame(
geometry = gp.GeoSeries([box(13.00, -21.50, 35.20, -20.00)]),
crs = “epsg:4326”)
search = catalog.search(
collections=[“sentinel-2-l2a”],
bbox=bbox.total_bounds.tolist(),
datetime=“2022-01-01/2022-12-31”,
query={“eo:cloud_cover”: {“lt”: 10}},
)
items = search.item_collection()
Step 3: Lazy load of the data
ds = stac_load(
items,
crs=epsg,
resolution=10,
bands=[“red”, “green”, “blue”, “nir”],
chunks={“x”: 2048, “y”: 2048},
bbox=bbox.total_bounds.tolist()
)
Step 4: Extracting the pixel values to the points
bgPoints = gp.read_file(“~/ndvi_sdm/files/bg_points.gpkg”)
x = xr.DataArray(bgPoints [“geometry”].x) ## for data in .gpkg format
y = xr.DataArray(bgPoints [“geometry”].y)
data_extract = ds.sel(x=x, y=y, method=“nearest”, drop=True)
bgData = data_extract.compute()
This is where I am having trouble with extracting data. If the region of interest (bbox) is small, then there is no issue. However, if the bbox is large, as in this case, computation time is high and I get the error as shown below and says failed to reconnect to scheduler and closes client.
I am relatively new to dask computing and xarray. I would like to know how can I efficiently extract pixel values for multiple points distributed over larger areas? Is there any other way of taking a bigger size bounding box while reducing computing time? My way takes more than a month just to get data for my area of interest which is not sustainable in the long run because I am also planning to build a species distribution model for monitoring the change in species distribution each year.