Serverless Pangeo

We’ve discussed serverless compute here and there over the years, but I thought it might be nice to start a dedicated topic. Here’s a timeline of some relevant activities I know of:

November 2017: AWS Workshop on Massive parallel data processing using pywren and AWS Lambda, featuring a Landsat8 NDVI processing example notebook! This used the pywren.default_executor(). (It requires python 2.7 and pywren is been supplanted by lithops). (Thanks to @RichardScottOZ for the discovering this)!

April 2018: Post by @jacobtomlinson in on Exploring Dask and Distributed on AWS Lambda
where he discusses the 250MB memory limit for Lambda and other problems. He concludes:

… the future of distributed computation will make use of highly abstracted function execution services like Lambda. However with the tools we have today there are too many issues to get this working without having major limitations.

May 2020: @tomwhite creates Dask Executor Scheduler, a Dask scheduler that uses concurrent.futures.Executor to run tasks. He then used this for a zarr rechunking workflow using pywren and GCP (and also using rechunker).

November 2020: The GitHub project pywren-ibm-cloud becomes Lithops

January 2021: Blog post by IBM discussing integration of Lithops with IBM Cloud: Using Serverless to Run Your Python Code on 1000 Cores by Changing Two Lines of Code.

It would be cool to have an updated workshop showing how this stuff can work for Pangeo workflows. Perhaps it would just take updating the old AWS 2017 workshop to use Python 3 and lithops, using the Dask Executor Scheduler?

I look forward to seeing what people have to say who actually understand this stuff. :slight_smile:

3 Likes

There are things like github.com/developmentseed/titiler - for example COG Talk — Part 2: Mosaics. This blog is the second in a series… | by Vincent Sarago | Development Seed | Medium

It always great to learn more use cases where Lithops can fit. I would love to hear more about your use cases and see if I can assist to leverage Lithops for your workloads. That would be very interesting to me.

One of our recent projects is with EMBL (European Molecular Biology Laboratory ), where Lithops used (GitHub - metaspace2020/Lithops-METASPACE: Lithops-based Serverless implementation of the METASPACE spatial metabolomics annotation pipeline) to deploy their workloads against serverless backend. Prior using Lithops, their main challenge was that amount of compute resources are only known in runtime. So it was almost impossible to setup a compute cluster and then run a workload, since in runtime it could be turned out that cluster too small or too large. As Lithops truly serverless, it can request needed resources dynamically in runtime, as much as needed so their problem was resolved.

Next week i am giving a talk at CNCF about Lithops (CNCF Live Webinar: Toward Hybrid Cloud Serverless Transparency with Lithops Framework | Cloud Native Computing Foundation) and I give plenty of demos with real usage examples. If you can’t attend, the talks are recorded and will appear on Youtube at some point. This might give you more background on the project and how it can be used.

I have some experience mixing xarray + dask with Google Cloud Dataflow for coarse-graining atmospheric model output. It was amazing seeing 100s of workers spin-up, but it requires some setup cost. Because the workers take so long to spin-up, it was not possible to iteratively develop these pipelines quickly in the typical pangeo style. It was more of a serious software engineering task requiring unit tests/clean-code etc or an extremely careful eye. I am not sure if this is intrinsic to all “serverless” platforms or specific to the APIs provided by apache beam.

For some reason, dataflow reminds of elaborate mail sorting machines like this: High Speed Letter Sorter - The NPI Maxim for High Speed Letter Sorting - YouTube. It can process a high volume of items, but I don’t need one at my house

A good path forward might be writing general purpose transformation scripts backed by dataflow. For example, the “rechunker” tool recently developed by @rabernat et. al. also works with Apache Beam thanks to @shoyer : rechunker/beam.py at master · pangeo-data/rechunker · GitHub.

@Gil_Vernik, thanks so much for the update on Lithops. I wasn’t able to attend your webinar earlier this week but would be very interested in watching the recording when it’s available. Should I just keep searching every few days on CNCF [Cloud Native Computing Foundation] - YouTube ?

The 250MB limit for Lambda is no longer applicable. Current limit is 10GB as of Dec 2020 (up from 3GB) and 6 vCPU cores. Surely this is enough for most tasks? Serverless seems ideally suited for some of these Notebook-style workflows that sit mostly idle until someone clicks a cell.

1 Like

PS. Link: AWS Lambda now supports up to 10 GB of memory and 6 vCPU cores for Lambda Functions

here is the link https://youtu.be/-uS-wi8CxBo

by the way, there are other cloud provides who also offer serverless without “limits”… more run time, more memory, etc.

This is another cool new “serverless” thing:

I just talked to Google about Apache Beam - there is apparently Dataflow geobeam.

GoogleCloudPlatform/dataflow-geobeam (github.com)

Which is all I know until I do some reading - but sounds interesting at least.

https://medium.com/weareservian/gcp-dataflow-with-python-for-satellite-image-analysis-5b9b8a81ff74

Can you share please your use case? out of curiosity, what is the problem / software you try to benefit from serverless? I am looking for nice use cases where serverless can fit, so if you have some ideas and it’s open source, etc. , can you share it here please?