Trying to open Gateway Cluster

Hi! I’m trying to open a dask GatewayCluster in a Jupyter notebook that I opened through the Pangeo JupyterHub but I’ve been encountering an error:

from dask_gateway import GatewayCluster
cluster = GatewayCluster()
/srv/conda/envs/notebook/lib/python3.8/site-packages/dask_gateway/client.py:21: FutureWarning: format_bytes is deprecated and will be removed in a future release. Please use dask.utils.format_bytes instead.
  from distributed.utils import LoopRunner, format_bytes
...
...
ClientConnectorError: Cannot connect to host proxy-public:80 ssl:default [Connect call failed ('10.12.0.239', 80)]

I realize that the JupyterHub has been transferred to 2i2c, so I wonder if there are any new settings I should be including in my code when using it? Thanks!

Hi @jdldeauna, thanks for opening this up. You shouldn’t have to provide any extra settings - the 2i2c engineering team are investigating the cause of this.

We have identified this as an issue with our network policies incorrectly refusing access to dask pods when a cluster is requested. We have temporarily suspended our policies while we find a long-term fix for this issue. You should find that you are now able to create a cluster.

4 Likes

Sarah, thanks for the quick response! We really appreciate that this was resolved quickly!

1 Like

It’s working now, thank you so much!

1 Like

Hi all, I’m also having issues opening gateway cluster as of this morning. My error message is the following:

OSError: Timed out trying to connect to gateway://traefik-prod-dask-gateway.prod:80/prod.66465ac7268248a49b4808a93f0d55b0 after 120 s

I’m not sure if I’m having the same issue as Sarah or something different is happening for me.

Kind regards
Ulla

Hi @ubbu36, I just logged into us-central1-b.gcp.pangeo.io and got a cluster up fine

I can see other dask schedulers and workers in the Kubernetes cluster too. Can you provide more detail about the error you’re seeing please?

Ah, now I see something when trying to run cluster.get_client that looks similar to your issue

This is the code I’m executing:

from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import xarray as xr
import zarr
import gcsfs

from dask_gateway import Gateway
from dask.distributed import Client

gateway = Gateway()
cluster = gateway.new_cluster()

cluster.scale(10)
client = Client(cluster,timeout=“120s”)
cluster

My whole error looks like this:

TimeoutError Traceback (most recent call last)
/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/comm/core.py in connect(addr, timeout, deserialize, handshake_overrides, **connection_args)
282 try:
→ 283 comm = await asyncio.wait_for(
284 connector.connect(loc, deserialize=deserialize, **connection_args),

/srv/conda/envs/notebook/lib/python3.8/asyncio/tasks.py in wait_for(fut, timeout, loop)
500 await _cancel_and_wait(fut, loop=loop)
→ 501 raise exceptions.TimeoutError()
502 finally:

TimeoutError:

The above exception was the direct cause of the following exception:

OSError Traceback (most recent call last)
/tmp/ipykernel_576/2661577020.py in
13
14 cluster.scale(10)
—> 15 client = Client(cluster,timeout=“120s”)
16 cluster

/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/client.py in init(self, address, loop, timeout, set_as_default, scheduler_file, security, asynchronous, name, heartbeat_interval, serializers, deserializers, extensions, direct_to_workers, connection_limit, **kwargs)
756 ext(self)
757
→ 758 self.start(timeout=timeout)
759 Client._instances.add(self)
760

/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/client.py in start(self, **kwargs)
938 self._started = asyncio.ensure_future(self._start(**kwargs))
939 else:
→ 940 sync(self.loop, self._start, **kwargs)
941
942 def await(self):

/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
324 if error[0]:
325 typ, exc, tb = error[0]
→ 326 raise exc.with_traceback(tb)
327 else:
328 return result[0]

/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/utils.py in f()
307 if callback_timeout is not None:
308 future = asyncio.wait_for(future, callback_timeout)
→ 309 result[0] = yield future
310 except Exception:
311 error[0] = sys.exc_info()

/srv/conda/envs/notebook/lib/python3.8/site-packages/tornado/gen.py in run(self)
760
761 try:
→ 762 value = future.result()
763 except Exception:
764 exc_info = sys.exc_info()

/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/client.py in _start(self, timeout, **kwargs)
1028
1029 try:
→ 1030 await self._ensure_connected(timeout=timeout)
1031 except (OSError, ImportError):
1032 await self._close()

/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/client.py in _ensure_connected(self, timeout)
1088
1089 try:
→ 1090 comm = await connect(
1091 self.scheduler.address, timeout=timeout, **self.connection_args
1092 )

/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/comm/core.py in connect(addr, timeout, deserialize, handshake_overrides, **connection_args)
305 await asyncio.sleep(backoff)
306 else:
→ 307 raise OSError(
308 f"Timed out trying to connect to {addr} after {timeout} s"
309 ) from active_exception

OSError: Timed out trying to connect to gateway://traefik-prod-dask-gateway.prod:80/prod.66465ac7268248a49b4808a93f0d55b0 after 120 s

1 Like

I have once again temporarily suspended our network policy until we can add another rule for this case. So your cluster should work now.

Apologies for this. Each new error message (while seemingly the same) is telling us which part of our internal network the user pods need access to to work with dask, and we have to add a new rule for each component.

Hi sgibson91

Yup, it is working now -

Thank you so much for your quick and efficient resolve of the issue!

Ulla

1 Like