It does not represent a solution to our blocked data-access, but since I am able to see the data-paths, it gives me confidence in the data still being there, which is what I was worried about the most.
Hopefully there exists a way to regain access to folks data quickly!
Getting the same hub back online, just for a week or two, would save our research project of a set-back. Probably other folks projects too? We just need to run one computation and store its output, in order to complete. I am hoping there exists some way to re-fund it. crossing fingers
Does someone in the Steering Council know if this is possible or not?
And if not, we would appreciate some ideas into how we can move forward preferably without having to download and move data, if possible.
Hello and thank you for all your efforts on all of this! I wanted to add that I ( and several others in my group) have current work on Pangeo without any backup and we would be grateful for the chance to recover the data there!
Hi @jmunroe, our project group is discussing how to complete computations. Would it be possible to restore the hub with the same capacity it had last week, long enough for our computations to complete? If so we want to contribute to funding it.
Our parallelization is sort of tailor-written for the resources on Pangeo’s US Central, and our ~300GB of data is stored in an adjacent bucket. So a restoration would help us avoid migrating our data and finding a different computation-resource.
For the heaviest part of our remaining computation, we scale to ~1500 workers, each with 1 CPU and 7 GB RAM (the initial US Central configuration). I have not calculated how long, but likely ~5-10 hours in total. So a week should potentially be sufficient for us to complete our work.
I am just deploying a change to the homepage to alert people to the timeline so I can look into this. I was going off our monitoring system that reported the URL resolving again.
2024-10-16T09:26:19Z [Warning] 0/3 nodes are available: 1 node(s) had untolerated taint {ToBeDeletedByClusterAutoscaler: 1729070778}, 2 node(s) didn’t match Pod’s node affinity/selector. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.
I think it’s a networking issue that I don’t know how to debug. I tried the browser debugging stuff, and it didn’t help. So I suspect something has changed to affect the network.
Just adding that the connection-timeout also occurs on my end.
Event log
Server requested
2024-10-16T19:37:44Z [Warning] 0/2 nodes are available: 2 node(s) didn't match Pod's node affinity/selector. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling.
2024-10-16T19:37:51Z [Normal] pod triggered scale-up: [{https://www.googleapis.com/compute/v1/projects/pangeo-integration-te-3eea/zones/us-central1-b/instanceGroups/gke-pangeo-hubs-cluster-nb-medium-d51fa3b8-grp 0->1 (max: 100)}]
2024-10-16T19:38:39Z [Normal] Successfully assigned prod/jupyter-ofk123 to gke-pangeo-hubs-cluster-nb-medium-d51fa3b8-vlq6
2024-10-16T19:38:41Z [Normal] Cancelling deletion of Pod prod/jupyter-ofk123
2024-10-16T19:44:45Z [Warning] MountVolume.SetUp failed for volume "prod-home-nfs" : mount failed: exit status 1 Mounting command: /home/kubernetes/containerized_mounter/mounter Mounting arguments: mount -t nfs -o noatime,soft 10.229.44.234:/homes/prod /var/lib/kubelet/pods/6cf22ac1-1cda-4927-aabf-2350da6fc999/volumes/kubernetes.io~nfs/prod-home-nfs Output: Mount failed: mount failed: exit status 32 Mounting command: chroot Mounting arguments: [/home/kubernetes/containerized_mounter/rootfs mount -t nfs -o noatime,soft 10.229.44.234:/homes/prod /var/lib/kubelet/pods/6cf22ac1-1cda-4927-aabf-2350da6fc999/volumes/kubernetes.io~nfs/prod-home-nfs] Output: mount.nfs: Connection timed out
Spawn failed: pod prod/jupyter-ofk123 did not start in 600 seconds!
Does the previous method you tried for accessing the buckets work now that the billing account has been reactivated? (I realise this is not a solution for accessing anything that was in your home directory.)