I/O errors: out of disk space?

Hello,

I am receiving some cryptic error messages on ocean.pangeo.io this morning. When I try to open a file, I receive the error “File Load Error: Unhandled Error”. This error appears to be related to a lack of disk space in my directory. Here is the terminal output for a “df -h .” command, reformatted for convenience:

(notebook) jovyan@jupyter-0000-2d0002-2d8701-2d4506:~$ df -h .
Filesystem: 10.171.161.186:/test/home/ocean.pangeo.io/0000-2d0002-2d8701-2d4506 Size: 1007G
Used: 956G
Avail: 0
Use%: 100%
Mounted on: /home/jovyan

I am only using ~60 MB on disk. Is it possible for folks to free up some disk space, please? Thanks very much!

Thanks for your message Dan! (And welcome to Pangeo!)

It looks like our shared 1TB disk for home directories is full. This is not something we have planned for or have a systematic way to resolve. (Another reason why ocean.pangeo.io remains more of a demonstration than a stable long-term platform from which to conduct research.)

It would be nice if we could figure out how to enforce quotas.

1 Like

Thanks, Ryan! Lots of space has been freed up, and the server is working for me now.

@jhamman just pointed me here, and I did a bit of research on how you can enforce quotas on NFS file systems.

Traditionally, you would back your NFS with a filesystem that supports quotas (often XFS), and use that to enforce quotas. XFS is particularly popular since it supports quotas per directory (which is what we want) rather than just per-user.

Most managed NFS stores (EFS, Google Filestore, etc) do not allow us to set these options, however. So we will have to run our own NFS server. Which isn’t hard, but not something you really wanna do. If so, I’d prefer we run it in our kubernetes cluster itself.

The NFS Server Provisioner seems to have all the things we need to get this to work.

We can :slight_smile:

  1. Install it (with this helm chart)
  2. Back it with a EBS volume / Google Cloud Persistent disk formatted as XFS (specifying fsType: xfs in that PersistentVolumeClaim)
  3. Then turn on xfs quotas (this functionality needs to be exposed in the helm chart, even though it exists in the project)
  4. Use dynamic provisioning for the user pods, with StorageClass set to NFS

This would provision a PVC for each user, and set a disk size quota we ask for. I think we can change the quota later, but that would need to be checked. This NFS server would then run on a core node - if the node goes down you’ll have downtime. This needs to be kept in mind when you are upgrading nodes / moving nodepools.

It would be awesome to find someone with time to try it out and report things back.

1 Like

Thanks so much for the suggestions @yuvipanda ! The plan you laid out looks reasonable.

An alternative idea we have been tossing around is to find a way for users to basically “bring your own home directory” via existing cloud storage services like google drive, dropbox, etc. The problem I see with the NFS server approach is that your files are still tied to one specific hub. But we anticipate users bouncing around between many different hubs, depending on what data they want to work with / who is paying for the compute. This, coupled with the ability to bring your own environment (like on binder), would create a very lightweight, totally generic hub with a minimum maintenance burden for the admins.

@jhamman found some proof of concept for this here:

We (@orianac on ocean.pangeo.io) ran into this issue again today. I’ve increased the size of the NFS server for now but we really need to come up with a plan of action to address the root cause here.

We hit the same issue again.

I’ve now bumped us to 2.2 TB :scream:. This is bad.

How do you diagnose usage user-by-user? I can’t figure out how to mount / browse the filesystem at its root directory.

My general approach to diagnosing things like this in the past was to get root access to the NFS system – usually by piggy-backing on an existing VM in one of our clusters – then running some variant of df -h, listing directory volumes by user id. I’ve only done this two or three times so don’t have a more established pattern than that.