I am receiving some cryptic error messages on ocean.pangeo.io this morning. When I try to open a file, I receive the error “File Load Error: Unhandled Error”. This error appears to be related to a lack of disk space in my directory. Here is the terminal output for a “df -h .” command, reformatted for convenience:
Thanks for your message Dan! (And welcome to Pangeo!)
It looks like our shared 1TB disk for home directories is full. This is not something we have planned for or have a systematic way to resolve. (Another reason why ocean.pangeo.io remains more of a demonstration than a stable long-term platform from which to conduct research.)
It would be nice if we could figure out how to enforce quotas.
@jhamman just pointed me here, and I did a bit of research on how you can enforce quotas on NFS file systems.
Traditionally, you would back your NFS with a filesystem that supports quotas (often XFS), and use that to enforce quotas. XFS is particularly popular since it supports quotas per directory (which is what we want) rather than just per-user.
Most managed NFS stores (EFS, Google Filestore, etc) do not allow us to set these options, however. So we will have to run our own NFS server. Which isn’t hard, but not something you really wanna do. If so, I’d prefer we run it in our kubernetes cluster itself.
Back it with a EBS volume / Google Cloud Persistent disk formatted as XFS (specifying fsType: xfs in that PersistentVolumeClaim)
Then turn on xfs quotas (this functionality needs to be exposed in the helm chart, even though it exists in the project)
Use dynamic provisioning for the user pods, with StorageClass set to NFS
This would provision a PVC for each user, and set a disk size quota we ask for. I think we can change the quota later, but that would need to be checked. This NFS server would then run on a core node - if the node goes down you’ll have downtime. This needs to be kept in mind when you are upgrading nodes / moving nodepools.
It would be awesome to find someone with time to try it out and report things back.
Thanks so much for the suggestions @yuvipanda ! The plan you laid out looks reasonable.
An alternative idea we have been tossing around is to find a way for users to basically “bring your own home directory” via existing cloud storage services like google drive, dropbox, etc. The problem I see with the NFS server approach is that your files are still tied to one specific hub. But we anticipate users bouncing around between many different hubs, depending on what data they want to work with / who is paying for the compute. This, coupled with the ability to bring your own environment (like on binder), would create a very lightweight, totally generic hub with a minimum maintenance burden for the admins.
@jhamman found some proof of concept for this here:
My general approach to diagnosing things like this in the past was to get root access to the NFS system – usually by piggy-backing on an existing VM in one of our clusters – then running some variant of df -h, listing directory volumes by user id. I’ve only done this two or three times so don’t have a more established pattern than that.