Us-central1 pangeo hub down?

ofk123 · October 14, 2024, 3:55pm

It does not represent a solution to our blocked data-access, but since I am able to see the data-paths, it gives me confidence in the data still being there, which is what I was worried about the most.

Hopefully there exists a way to regain access to folks data quickly!
Getting the same hub back online, just for a week or two, would save our research project of a set-back. Probably other folks projects too? We just need to run one computation and store its output, in order to complete. I am hoping there exists some way to re-fund it. crossing fingers

Does someone in the Steering Council know if this is possible or not?
And if not, we would appreciate some ideas into how we can move forward preferably without having to download and move data, if possible.

jmunroe · October 14, 2024, 4:31pm

As of this morning, there is a multipartner email thread making progress on getting access restored.

The current objective is to restore the access to the hub for at least a week so that those impacted have a chance to migrate their critical data.

Besides @ofk123 and @AndMei , if there are others who have been impacted by the loss of the Pangeo hub, please chime in on this thread.

lilmi · October 14, 2024, 5:26pm

Hello and thank you for all your efforts on all of this! I wanted to add that I ( and several others in my group) have current work on Pangeo without any backup and we would be grateful for the chance to recover the data there!

ofk123 · October 14, 2024, 9:22pm

Hi @jmunroe, our project group is discussing how to complete computations. Would it be possible to restore the hub with the same capacity it had last week, long enough for our computations to complete? If so we want to contribute to funding it.

Our parallelization is sort of tailor-written for the resources on Pangeo’s US Central, and our ~300GB of data is stored in an adjacent bucket. So a restoration would help us avoid migrating our data and finding a different computation-resource.

For the heaviest part of our remaining computation, we scale to ~1500 workers, each with 1 CPU and 7 GB RAM (the initial US Central configuration). I have not calculated how long, but likely ~5-10 hours in total. So a week should potentially be sufficient for us to complete our work.

AndMei · October 15, 2024, 5:44am

Thanks massively for this, I’ll be ready to migrate my data when you let us know. Thanks again to all for their efforts.

rabernat · October 15, 2024, 6:43pm

Here’s an update–we have identified an interim funding source, and CUIT is in the process of reactivating the accounts.

AndMei · October 15, 2024, 9:05pm

This is terrific news. Thanks for being so proactive with this Ryan!

sgibson91 · October 16, 2024, 9:14am

The hub is now back up!

AndMei · October 16, 2024, 9:34am

Almost. I can now connect, but upon starting the smallest server available it eventually times out with the error:

“2024-10-16T09:30:36Z [Warning] MountVolume.SetUp failed for volume “prod-home-nfs” : mount failed: exit status 1 Mounting command: /home/kubernetes/containerized_mounter/mounter Mounting arguments: mount -t nfs -o noatime,soft 10.229.44.234:/homes/prod /var/lib/kubelet/pods/56cb0143-d1d4-4038-a4ac-57046a02c03d/volumes/kubernetes.io~nfs/prod-home-nfs Output: Mount failed: mount failed: exit status 32 Mounting command: chroot Mounting arguments: [/home/kubernetes/containerized_mounter/rootfs mount -t nfs -o noatime,soft 10.229.44.234:/homes/prod /var/lib/kubelet/pods/56cb0143-d1d4-4038-a4ac-57046a02c03d/volumes/kubernetes.io~nfs/prod-home-nfs] Output: mount.nfs: Connection timed out”

I’ll keep trying, but I’m not sure it is fully working yet!

sgibson91 · October 16, 2024, 9:35am

I am just deploying a change to the homepage to alert people to the timeline so I can look into this. I was going off our monitoring system that reported the URL resolving again.

AndMei · October 16, 2024, 9:42am

Full error log here:

" Event log

Server requested

2024-10-16T09:26:19Z [Warning] 0/3 nodes are available: 1 node(s) had untolerated taint {ToBeDeletedByClusterAutoscaler: 1729070778}, 2 node(s) didn’t match Pod’s node affinity/selector. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.

2024-10-16T09:26:27Z [Normal] pod triggered scale-up: [{https://www.googleapis.com/compute/v1/projects/pangeo-integration-te-3eea/zones/us-central1-b/instanceGroups/gke-pangeo-hubs-cluster-nb-small-c97e04c1-grp 0->1 (max: 100)}]

2024-10-16T09:27:34Z [Normal] Successfully assigned prod/jupyter-recalculate to gke-pangeo-hubs-cluster-nb-small-c97e04c1-vhww

2024-10-16T09:30:36Z [Warning] MountVolume.SetUp failed for volume “prod-home-nfs” : mount failed: exit status 1 Mounting command: /home/kubernetes/containerized_mounter/mounter Mounting arguments: mount -t nfs -o noatime,soft 10.229.44.234:/homes/prod /var/lib/kubelet/pods/56cb0143-d1d4-4038-a4ac-57046a02c03d/volumes/kubernetes.io~nfs/prod-home-nfs Output: Mount failed: mount failed: exit status 32 Mounting command: chroot Mounting arguments: [/home/kubernetes/containerized_mounter/rootfs mount -t nfs -o noatime,soft 10.229.44.234:/homes/prod /var/lib/kubelet/pods/56cb0143-d1d4-4038-a4ac-57046a02c03d/volumes/kubernetes.io~nfs/prod-home-nfs] Output: mount.nfs: Connection timed out

Spawn failed: Timeout"

Thanks for being proactive on this!

sgibson91 · October 16, 2024, 9:45am

gulp

AndMei · October 16, 2024, 9:52am

Does this mean the server is up but the data doesn’t exist or is not linked?

sgibson91 · October 16, 2024, 9:53am

I think it’s a networking issue that I don’t know how to debug. I tried the browser debugging stuff, and it didn’t help. So I suspect something has changed to affect the network.

ETA: I’ve emailed Columbia IT again.

ofk123 · October 16, 2024, 7:53pm

Just adding that the connection-timeout also occurs on my end.

Event log

Server requested
2024-10-16T19:37:44Z [Warning] 0/2 nodes are available: 2 node(s) didn't match Pod's node affinity/selector. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling.
2024-10-16T19:37:51Z [Normal] pod triggered scale-up: [{https://www.googleapis.com/compute/v1/projects/pangeo-integration-te-3eea/zones/us-central1-b/instanceGroups/gke-pangeo-hubs-cluster-nb-medium-d51fa3b8-grp 0->1 (max: 100)}]
2024-10-16T19:38:39Z [Normal] Successfully assigned prod/jupyter-ofk123 to gke-pangeo-hubs-cluster-nb-medium-d51fa3b8-vlq6
2024-10-16T19:38:41Z [Normal] Cancelling deletion of Pod prod/jupyter-ofk123
2024-10-16T19:44:45Z [Warning] MountVolume.SetUp failed for volume "prod-home-nfs" : mount failed: exit status 1 Mounting command: /home/kubernetes/containerized_mounter/mounter Mounting arguments: mount -t nfs -o noatime,soft 10.229.44.234:/homes/prod /var/lib/kubelet/pods/6cf22ac1-1cda-4927-aabf-2350da6fc999/volumes/kubernetes.io~nfs/prod-home-nfs Output: Mount failed: mount failed: exit status 32 Mounting command: chroot Mounting arguments: [/home/kubernetes/containerized_mounter/rootfs mount -t nfs -o noatime,soft 10.229.44.234:/homes/prod /var/lib/kubelet/pods/6cf22ac1-1cda-4927-aabf-2350da6fc999/volumes/kubernetes.io~nfs/prod-home-nfs] Output: mount.nfs: Connection timed out
Spawn failed: pod prod/jupyter-ofk123 did not start in 600 seconds!

sgibson91 · October 17, 2024, 7:08am

Does the previous method you tried for accessing the buckets work now that the billing account has been reactivated? (I realise this is not a solution for accessing anything that was in your home directory.)

AndMei · October 17, 2024, 11:56am

Great, thanks for doing so. Fingers crossed for a quick response!

ofk123 · October 17, 2024, 2:09pm

Thanks, yes, now I am able to access data from EOSC. There is no longer an OSError .

ofk123 · October 17, 2024, 2:37pm

But the same timeout occurs on the US Central hub today.

Event log

Server requested
2024-10-17T14:08:19Z [Warning] 0/2 nodes are available: 2 node(s) didn't match Pod's node affinity/selector. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling.
2024-10-17T14:08:21Z [Normal] pod triggered scale-up: [{https://www.googleapis.com/compute/v1/projects/pangeo-integration-te-3eea/zones/us-central1-b/instanceGroups/gke-pangeo-hubs-cluster-nb-medium-d51fa3b8-grp 0->1 (max: 100)}]
2024-10-17T14:09:09Z [Normal] Successfully assigned prod/jupyter-ofk123 to gke-pangeo-hubs-cluster-nb-medium-d51fa3b8-87gb
2024-10-17T14:09:12Z [Normal] Cancelling deletion of Pod prod/jupyter-ofk123
2024-10-17T14:12:14Z [Warning] MountVolume.SetUp failed for volume "prod-home-nfs" : mount failed: exit status 1 Mounting command: /home/kubernetes/containerized_mounter/mounter Mounting arguments: mount -t nfs -o noatime,soft 10.229.44.234:/homes/prod /var/lib/kubelet/pods/bdf5296f-2a93-410e-98bc-9d06b2865333/volumes/kubernetes.io~nfs/prod-home-nfs Output: Mount failed: mount failed: exit status 32 Mounting command: chroot Mounting arguments: [/home/kubernetes/containerized_mounter/rootfs mount -t nfs -o noatime,soft 10.229.44.234:/homes/prod /var/lib/kubelet/pods/bdf5296f-2a93-410e-98bc-9d06b2865333/volumes/kubernetes.io~nfs/prod-home-nfs] Output: mount.nfs: Connection timed out
Spawn failed: Timeout

Hopefully you get some answers from the IT-department soon. Thanks again.

sgibson91 · October 17, 2024, 4:14pm

Yes, I’m still waiting on information regarding the NFS server for home directories

Topic		Replies	Views
Cleaning out the pangeo-data google cloud storage bucket Cloud	27	2649	February 5, 2020
Migration of ocean.pangeo.io User Accounts Cloud	25	2271	September 27, 2020
Setting up a US Electricity System data deployment Cloud	9	1449	August 17, 2020
Access to some Pangeo GCS Bucket to push data from CNES Cloud	4	709	September 29, 2019
Pangeo Batch workflows Cloud	20	2576	November 3, 2021

Us-central1 pangeo hub down?

Related topics