Hi everyone,
Following on the discussion on Universities and HPC and the outcome of the European Pangeo meeting last year, I’ve spent a few days with french computing center users or admins in the last few weeks. @fbriol has presented Pangeo benefits in the context of ocean science, and I’ve given talks on Dask on HPC, and on Pangeo community more broadly. All were very well received. Below are some points that I noted and that may interest the community.
First, there are lots of discussions on interactive HPC, or at least how to provide easier and more agile access to top tier national computing resources. Historically, getting computing time there is a very painful process, and you better have a well designed code to run. But GENCI organization that designs the three most powerful french public HPC systems is trying to adapt to what they call the AI community, which will benefit all scientists doing data analysis in general. So basically, they’re trying to simplify processes, but they’re also working on Jupyter like interfaces. I’ve engaged them officialy to discuss if our community could help by provinding feedbacks or tooling. They’re currently working with a french university which has developed an interesting interface on top of Jupyter for providing resources (see this), I don’t know if they’re planning to provide standard Jupyterhub access. One of the main problem they have is securing the access from web interface to resources in a restricted access area, do we have some experience here, maybe with NCAR Jupyterhub?
A lot of interest has been shown on the Zarr file format and in the differences of performance we observed using NetCDF or HDF files at scale. Various community came to me asking for more information on what is Zarr, why is it more performant, and if it could fit their purpose. It’s always surprising to me how computing centres are still discovering data driven analysis workload with non HPC standard pattern (lots of jobs, lots of created files, big files with random IO …).
GPUs are obviously an important subject, more in the sense on how to use them to do traditional HPC computation. KeOps library was presented with it’s Python binding, nothing on Cupy or RAPIDS strangely. It still looks difficult to use GPU for doing something different from Deep Learning.
Many discussions on data handling too: Data Management Plan, FAIR Data, how to provide access to the data, centralised access vs federation of centers, data discoverability… Maybe someone here has something to say, regarding datasets such as the incomming CMIP6? STAC and Intake are definitely important here also.
There is also a great deal of interest from the research community in what they call Virtual Research Environment, or Science Platform. Pangeo platform qualifies to be one of those, but other examples I’ve seen often provides more components or services, assuming their communities don’t always know how to code.
In the infrastructure part, computing centers are essentially build either with HPC platforms and/or with Open Stack clusters. But that’s probably some obvious statement by now.