Wednesday February 22nd 2023: D’explorer Explore cloud datasets from your notebooks

DOI

Pangeo Showcase Talk by Ramon Ramirez-Linan at Navteca

Bio
Ramon Ramirez-Linan (@rezuma) has a computing engineering degree by University of Sevilla (Spain). He has been working on cloud infrastructure projects for the last 10+ years for NASA and NOAA on projects like NOAA CLASS Cloud Pilot Project, NASA Climate in a box, NASA HQ Manage Cloud Environment (HQMCE), NASA Next Generation Application Platform (NGAP) and the NASA Science Manage Cloud Environment (SMCE).

Abstract
Navteca has been helping NASA and other organizations with the deployment of Jupyterhub connected to HPC clusters in the cloud. We call this platform Open Science Studio. Through the collaboration with NASA scientists working on the Earth Information System pilot project and other projects that use the Open Science Studio, we have been collecting requirements and feedback from the scientists related to making the platform more useful for their purposes. Some of this feedback has been translated into a new Jupyterlab extension which provides one or more new capabilities. One of these new capabilities is what we will present to the Pangeo group. The new Jupyterlab extension allows Jupyter Notebook users to interact with datasets stored into different cloud object storage systems, including AWS S3, Azure Blob and GCP Storage. The extension allows users to do things like download and upload files to these cloud storage systems, access publicly available datasets from AWS, Azure and Google, add specific buckets to a list of favorites or even make cross-account buckets available in the user interface.

2 Likes

Sorry I missed this one. I have thoughts… Is this a good place to discuss?

2 Likes

That’s what the forum is for! Discuss away!

On the file explorer extension.

I would like to point out a couple of repos of previous work on this topic:

Both of these allow direct opening of notebooks/files in jlab and write too.

The idea of having favourites and splitting out known interesting buckets of data are great. How were the buckets curated? Just AWS’s “public data” listing?

I would suggest, though, that focussing on “download” of files is a problem. Users will generally not need whole files, and particularly on typical jhub deployment does not have much available disc space. In the pangeo context, it would be better I think to try to give quick views of files, and copy the full URL into code cells for full analysis in the notebook. Anyone want to make an xarray view panel?

Could I have links to the repos mentioned, please?

Navteca JupyterLab Extensions: BExplorer and Pasarela

Ramon has said BExplorer was motivated due to limitations with the ibm s3 browser extension. A key feature of Navteca’s extension is to handle datasets that may require multiple credentials.

I agree that focussing on on “download” would be limiting. My understanding is by releasing this as open-source, Navteca is inviting the community to extend and adapt the extension to include autogeneration of code-snippets that would allow access through fsspec.

Thanks - I see this is pretty early stage for the project, so if all those boxes in the README get filled, it should be very promising.

For pangeo, I would LOVE if you could get a preview/HTML repr of known xarray filetypes (.zarr, .nc, .hdf5, .tiff …).

1 Like

This is awesome feedback @martindurant I was wondering if a preview of this type of files would be beneficial, will definitely add it to the roadmap. We will also add FITS format since we are getting also a wishlist from the astrophysics community

So, the main issues with the IBM S3 extension that compelled us to work on this were:

  • The IBM S3 extension will show all of the buckets on the AWS accounts connected to the credentials provided (via Keys or IAM Role), even if you dont have read access to them.
    We usually deployed 3 buckets for the scientists on our Jhub (Or Open Science Studio) deployments, but some accounts have dozens of buckets, so it was confusing to navigate through all those buckets.
  • Additionally, the IBM extension doesnt allow to add cross-account buckets, which was a critical requirements for us since our scientists need to see buckets that are part of the NASA Earth Data holdings
  • Finally, we did a version of the extension that used the IBM S3 (it was called S3+) but decided to build a new extension to make the code more portable to other Object storage down the road.

PS: Open Science Studio is what we call our deployment of Jhub + HPC using Universal Control Plane

The view panel is a great idea.
We use an AWS Github repo where they have all the public datasets and metadata associated with them.
In relation to downloading the data. I agree that most users are not going to need to download files with the BExplorer, but we support many “flavors” of users, so of them do like that capability.

How do you determine if you have access to a bucket?

Hello @martindurant! thank you very much for the great ideas and feedback.

We are using boto3’s list_buckets and list_objects_v2 methods to know which buckets a users has access to based on the credentials provided. This is for what we call private buckets. For the public ones we don’t do such checking because that operation takes a little bit of time because of the amount of buckets (around a thousand) and that little bit of time to check each one compounds and even though we do have a cache and tested doing those checks while loading the extension it makes it slower. Perhaps there is a better approach to this.

I hope this answers your questions appropriately

Please feel free to reach out to us and/or check our repo here

Thank you very much

Thank you @martindurant for the feedback and ideas.

I just created the feature request here

Feel to comment there if you want to provide more details or thoughts about it.

Thank you very much again
Have a great day