Us-central1 pangeo hub down?

It does not represent a solution to our blocked data-access, but since I am able to see the data-paths, it gives me confidence in the data still being there, which is what I was worried about the most.

  • Hopefully there exists a way to regain access to folks data quickly!
  • Getting the same hub back online, just for a week or two, would save our research project of a set-back. Probably other folks projects too? We just need to run one computation and store its output, in order to complete. I am hoping there exists some way to re-fund it. crossing fingers

Does someone in the Steering Council know if this is possible or not?
And if not, we would appreciate some ideas into how we can move forward preferably without having to download and move data, if possible.

As of this morning, there is a multipartner email thread making progress on getting access restored.

The current objective is to restore the access to the hub for at least a week so that those impacted have a chance to migrate their critical data.

Besides @ofk123 and @AndMei , if there are others who have been impacted by the loss of the Pangeo hub, please chime in on this thread.

Hello and thank you for all your efforts on all of this! I wanted to add that I ( and several others in my group) have current work on Pangeo without any backup and we would be grateful for the chance to recover the data there!

Hi @jmunroe, our project group is discussing how to complete computations. Would it be possible to restore the hub with the same capacity it had last week, long enough for our computations to complete? If so we want to contribute to funding it.

Our parallelization is sort of tailor-written for the resources on Pangeo’s US Central, and our ~300GB of data is stored in an adjacent bucket. So a restoration would help us avoid migrating our data and finding a different computation-resource.

For the heaviest part of our remaining computation, we scale to ~1500 workers, each with 1 CPU and 7 GB RAM (the initial US Central configuration). I have not calculated how long, but likely ~5-10 hours in total. So a week should potentially be sufficient for us to complete our work.

Thanks massively for this, I’ll be ready to migrate my data when you let us know. Thanks again to all for their efforts.

Here’s an update–we have identified an interim funding source, and CUIT is in the process of reactivating the accounts.

This is terrific news. Thanks for being so proactive with this Ryan!

The hub is now back up!

Almost. I can now connect, but upon starting the smallest server available it eventually times out with the error:

“2024-10-16T09:30:36Z [Warning] MountVolume.SetUp failed for volume “prod-home-nfs” : mount failed: exit status 1 Mounting command: /home/kubernetes/containerized_mounter/mounter Mounting arguments: mount -t nfs -o noatime,soft 10.229.44.234:/homes/prod /var/lib/kubelet/pods/56cb0143-d1d4-4038-a4ac-57046a02c03d/volumes/kubernetes.io~nfs/prod-home-nfs Output: Mount failed: mount failed: exit status 32 Mounting command: chroot Mounting arguments: [/home/kubernetes/containerized_mounter/rootfs mount -t nfs -o noatime,soft 10.229.44.234:/homes/prod /var/lib/kubelet/pods/56cb0143-d1d4-4038-a4ac-57046a02c03d/volumes/kubernetes.io~nfs/prod-home-nfs] Output: mount.nfs: Connection timed out”

I’ll keep trying, but I’m not sure it is fully working yet!

I am just deploying a change to the homepage to alert people to the timeline so I can look into this. I was going off our monitoring system that reported the URL resolving again.

Full error log here:

" Event log

Server requested

2024-10-16T09:26:19Z [Warning] 0/3 nodes are available: 1 node(s) had untolerated taint {ToBeDeletedByClusterAutoscaler: 1729070778}, 2 node(s) didn’t match Pod’s node affinity/selector. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.

2024-10-16T09:26:27Z [Normal] pod triggered scale-up: [{https://www.googleapis.com/compute/v1/projects/pangeo-integration-te-3eea/zones/us-central1-b/instanceGroups/gke-pangeo-hubs-cluster-nb-small-c97e04c1-grp 0->1 (max: 100)}]

2024-10-16T09:27:34Z [Normal] Successfully assigned prod/jupyter-recalculate to gke-pangeo-hubs-cluster-nb-small-c97e04c1-vhww

2024-10-16T09:27:34Z [Normal] Successfully assigned prod/jupyter-recalculate to gke-pangeo-hubs-cluster-nb-small-c97e04c1-vhww

2024-10-16T09:30:36Z [Warning] MountVolume.SetUp failed for volume “prod-home-nfs” : mount failed: exit status 1 Mounting command: /home/kubernetes/containerized_mounter/mounter Mounting arguments: mount -t nfs -o noatime,soft 10.229.44.234:/homes/prod /var/lib/kubelet/pods/56cb0143-d1d4-4038-a4ac-57046a02c03d/volumes/kubernetes.io~nfs/prod-home-nfs Output: Mount failed: mount failed: exit status 32 Mounting command: chroot Mounting arguments: [/home/kubernetes/containerized_mounter/rootfs mount -t nfs -o noatime,soft 10.229.44.234:/homes/prod /var/lib/kubelet/pods/56cb0143-d1d4-4038-a4ac-57046a02c03d/volumes/kubernetes.io~nfs/prod-home-nfs] Output: mount.nfs: Connection timed out

Spawn failed: Timeout"

Thanks for being proactive on this!

gulp

Does this mean the server is up but the data doesn’t exist or is not linked?

I think it’s a networking issue that I don’t know how to debug. I tried the browser debugging stuff, and it didn’t help. So I suspect something has changed to affect the network.

ETA: I’ve emailed Columbia IT again.

Just adding that the connection-timeout also occurs on my end.

Event log
Server requested
2024-10-16T19:37:44Z [Warning] 0/2 nodes are available: 2 node(s) didn't match Pod's node affinity/selector. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling.
2024-10-16T19:37:51Z [Normal] pod triggered scale-up: [{https://www.googleapis.com/compute/v1/projects/pangeo-integration-te-3eea/zones/us-central1-b/instanceGroups/gke-pangeo-hubs-cluster-nb-medium-d51fa3b8-grp 0->1 (max: 100)}]
2024-10-16T19:38:39Z [Normal] Successfully assigned prod/jupyter-ofk123 to gke-pangeo-hubs-cluster-nb-medium-d51fa3b8-vlq6
2024-10-16T19:38:41Z [Normal] Cancelling deletion of Pod prod/jupyter-ofk123
2024-10-16T19:44:45Z [Warning] MountVolume.SetUp failed for volume "prod-home-nfs" : mount failed: exit status 1 Mounting command: /home/kubernetes/containerized_mounter/mounter Mounting arguments: mount -t nfs -o noatime,soft 10.229.44.234:/homes/prod /var/lib/kubelet/pods/6cf22ac1-1cda-4927-aabf-2350da6fc999/volumes/kubernetes.io~nfs/prod-home-nfs Output: Mount failed: mount failed: exit status 32 Mounting command: chroot Mounting arguments: [/home/kubernetes/containerized_mounter/rootfs mount -t nfs -o noatime,soft 10.229.44.234:/homes/prod /var/lib/kubelet/pods/6cf22ac1-1cda-4927-aabf-2350da6fc999/volumes/kubernetes.io~nfs/prod-home-nfs] Output: mount.nfs: Connection timed out
Spawn failed: pod prod/jupyter-ofk123 did not start in 600 seconds!

Does the previous method you tried for accessing the buckets work now that the billing account has been reactivated? (I realise this is not a solution for accessing anything that was in your home directory.)

Great, thanks for doing so. Fingers crossed for a quick response!

Thanks, yes, now I am able to access data from EOSC. There is no longer an OSError .

But the same timeout occurs on the US Central hub today.

Event log
Server requested
2024-10-17T14:08:19Z [Warning] 0/2 nodes are available: 2 node(s) didn't match Pod's node affinity/selector. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling.
2024-10-17T14:08:21Z [Normal] pod triggered scale-up: [{https://www.googleapis.com/compute/v1/projects/pangeo-integration-te-3eea/zones/us-central1-b/instanceGroups/gke-pangeo-hubs-cluster-nb-medium-d51fa3b8-grp 0->1 (max: 100)}]
2024-10-17T14:09:09Z [Normal] Successfully assigned prod/jupyter-ofk123 to gke-pangeo-hubs-cluster-nb-medium-d51fa3b8-87gb
2024-10-17T14:09:12Z [Normal] Cancelling deletion of Pod prod/jupyter-ofk123
2024-10-17T14:12:14Z [Warning] MountVolume.SetUp failed for volume "prod-home-nfs" : mount failed: exit status 1 Mounting command: /home/kubernetes/containerized_mounter/mounter Mounting arguments: mount -t nfs -o noatime,soft 10.229.44.234:/homes/prod /var/lib/kubelet/pods/bdf5296f-2a93-410e-98bc-9d06b2865333/volumes/kubernetes.io~nfs/prod-home-nfs Output: Mount failed: mount failed: exit status 32 Mounting command: chroot Mounting arguments: [/home/kubernetes/containerized_mounter/rootfs mount -t nfs -o noatime,soft 10.229.44.234:/homes/prod /var/lib/kubelet/pods/bdf5296f-2a93-410e-98bc-9d06b2865333/volumes/kubernetes.io~nfs/prod-home-nfs] Output: mount.nfs: Connection timed out
Spawn failed: Timeout

Hopefully you get some answers from the IT-department soon. Thanks again.

Yes, I’m still waiting on information regarding the NFS server for home directories