Pangeo Bacalhau Backend

My name is Wes Floyd and I’m a Product Manager at Protocol Labs creators of IPFS, Filecoin) working on a new open source project, Bacalhau, to build a “lowest cost, open, and reproducible” compute platform for researchers. Think of it as a simple, collaborative way to do the “big data processing” part of your pipeline at much lower cost.

For example, we’re focused on helping migrate workloads like the EUREC4A project to IPFS/Filecoin for lowest cost storage and Bacalhau for lowest cost, open data pipeline processing. Here is a video overview of our mission

Can I get your feedback on whether this type of open compute platform would be helpful for the Pangeo (and Pangeo Forge) community? If so, we could start orienting our development efforts to serve as a backend, e.g. a Pangeo Forge Bakery (similar to AWS, Azure, and Google Cloud)

Cheers

2 Likes

Hi @wesfloyd, welcome to this forum!

I’ve heard before of IPFS, but never had a chance to dig deeper on this technology which sounds really interesting for sharing scientific content.

I’ve watched the first part of the video you linked above (I think the second half is not focused on Protocol Labs softwares), I think Protocol Labs goals and vision are shared with those of the Pangeo community. But I’m still not sure of what to do with and what enables Bacalhau, and how to link it with Pangeo ecosystem or Pangeo Forge.

If we make the analogy with Pangeo reference infrastructures, I would say that:

  • IPFS would be seen as a cloud object storage, or as an HPC filesystem.
  • Bacalhau would be the computing infrastructure part, so something like Kubernetes. But is it closer to Kubernetes, plain VMs handling like AWS EC2, a job queuing system, or even Dask? Or maybe a bit of all that?

Sticking with what we often call Pangeo platforms, built for example on Google GKE and GCS, with Jupyterhub and Dask on top of it, this is currently our

Okay, I’m not sure about the cost, but Pangeo is clearly there at least partially to simplify and collaboratively do “big data processing”.

So do you propose to test deploying Dask Clusters and Notebooks (or in the case of Pangeo Forge: Prefect) on top of IPFS and Bacalhau? This would certainly be interesting to try if possible.

Or do you propose to replace Dask/Prefect by Bacalhau if using your solutions?

But I think this raises plenty of questions: how is accessed IPFS (interfaces), what are it’s performances if accessed from outside Bacalhau, what’s the cost of all that, is it performant enough for big scientific data processing even within Bacalhau, etc.?

@geynard I appreciate your feedback and questions. Here are a few points to add context:

Data: IPFS is being used more broadly in life sciences as an easy to use medium for sharing large datasets across institutions. Here’s an example with the Eurec4a project via Max Planck and other universities: Data on IPFS — How to EUREC⁴A . IPFS is more of a read/write (POSIX) style filesystem, however there are many tools written on top, such as Filebase that have implemented S3 style object storage on top of IPFS.

Compute: We envision Bacalhau as a lowest cost cloud infrastructure (compute / big data processing) platform for researchers (similar to how Filecoin is a lowest cost long term storage layer). Bacalhau does not implement Kubernetes, however, it does allow for running any number of Docker containers. After a high level review of the Pangeo Forge architecture, I expect Bacalhau would function as another Bakery option for researchers (Core Concepts — Pangeo Forge documentation).

No Cost: Because Bacalhau is in our testnet phase - we are offering compute at “no cost” to researchers who can help provide feedback on the platform.

Orchestration: Prefect (or a similar DAG/workflow engine) would be a reasonable fit for Bacalhau, however, I’d like to find a user first that could take advantage of the free compute before we invest more time in the integration effort.

Let me know if you’re interested in discussing together further - we would love to support the Pangeo community with free/lowest cost compute if we can find early users to partner and provide feedback.

(Had to break up the post into two since I’m a new user and limited to two hyperlinks per post :wink: )

Please see a simple hello world demo for Bacalhau here: Bacalhau Demo July 1st - YouTube

Please reach out to us via our Slack channel if you are interested in participating: Slack

The page on Data on IPFS is really interesting, I see that EURECA uses Intake and Zarr! It also gives a good summary of what IPFS is about. If we want to use IPFS, do we need to had our own IPFS nodes to it?

With a quick glance at the Bacalhau video, it looks like a batch submission system based on Docker which takes Input and output on IPFS. I’ve not been involved in Pangeo Forge, but it looks it heavily relies on Prefect for orchestrating things, and I’m under the impression that it also uses Dask under the hood. Maybe @charlesbluca or @cisaacstern can confirm?

I think what has the greatest value here is IPFS, but if we can also plug a Pangeo Forge backery, that would probably show your software is mature enough to host all the Pangeo ecosystem. There are probably smaller steps to do first:

  • Try IPFS and Zarr access from other than Bacalhau compute resources.
  • Try to use Bacalhau as a compute platform with Dask, by developing something like dask-bacalhau to deploy Dask Cluster (you can have a look at Deploy Dask Clusters — Dask documentation to understand what I mean). I’m not sure what would be network performances between two bacalhau jobs?
  • Try to use Prefect with Bacalhau.

Anyway, even if all that looks interesting, the problem is what you really need:

I’d like to find a user first that could take advantage of the free compute

That is the complicated part, not sure if anyone here has time or sufficient interest to do this. Personally I would be more interested in testing IPFS than Bacalhau for now.

But we should maybe let someone from the Pangeo steering council chime in if interested.