Blog post: Cloud‑native pipelines for scientific data processing with prefect and dask

Hey everyone!

I hope you don’t mind the self-promo too much, but I published recently an extended entry-level tutorial on building an end-to-end pipeline using Prefect and Dask, which uses xarray and Zarr on AWS S3, for converting raw hydroacoustic data from NOAA NCEI archive.

The data we used it’s a small subset from the 2019 Fall Bottom Trawl Survey, conducted by the Northeast Fisheries Science Center – a fisheries independent, multi-species survey that provides the primary scientific data for fisheries assessments in the U.S. mid-Atlantic and New England regions.

The raw data was recorded using an EK60 scientific echosounder which is very common narrow-band split beam sonar. The survey was conducted on-board the Henry B. Bigelow.

For processing the raw acoustic data, we used echopype, the primary Python based open-source tool used in fisheries acoustics data analysis.

More info on the dataset: NOAA Northeast Fisheries Science Center. 2019. ‘EK60 Water Column Sonar Data Collected During HB1906’. NOAA National Centers for Environmental Information. https://doi.org/10.25921/vt45-sa66

Thanks for reading!

6 Likes

Thanks for sharing @andreirusu! Reminds me a lot of the early architecture of Pangeo-Forge when we used Prefect

ps. I think you missed a chance calling output data sonzarr :wink:

2 Likes

That’s a good point @jhamman!

Pangeo-Forge does look quite similar, I guess Prefect (presumably v2) was used in the bakery?
I’m not a strong advocate of Prefect, but the folks at UW who are developing echopype are also using it (v3), and I was curious to hear why it was dropped, if you don’t mind sharing.

For us, a big bonus for using Prefect is their slick UI which we can fork and extend, and also the prefect-dask runner seems to be good enough.

Pangeo forge switched to Apache Beam for the execution engine. I’m not sure it was worth it in the end but its water under the bridge!