How to run code using GPU on Pangeo? saying "libdevice not found at ./libdevice.10.bc"

I am learning to run a CNN model using GPU on pangeo. The code is okay for CPU, but when I transfer to GPU, there is a an error:

2023-09-10 19:25:40.201197: I tensorflow/compiler/xla/service/service.cc:169] XLA service 0x55d38463fc80 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-09-10 19:25:40.201221: I tensorflow/compiler/xla/service/service.cc:177] StreamExecutor device (0): Tesla T4, Compute Capability 7.5
2023-09-10 19:25:40.207278: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var MLIR_CRASH_REPRODUCER_DIRECTORY to enable.
2023-09-10 19:25:40.234569: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:530] Can’t find libdevice directory ${CUDA_DIR}/nvvm/libdevice. This may result in compilation or runtime failures, if the program we try to run uses routines from libdevice.
Searched for CUDA in the following directories:
./cuda_sdk_lib
/usr/local/cuda-11.2
/usr/local/cuda
.
You can choose the search directory by setting xla_gpu_cuda_data_dir in HloModule’s DebugOptions. For most apps, setting the environment variable XLA_FLAGS=–xla_gpu_cuda_data_dir=/path/to/cuda will work.
2023-09-10 19:25:40.235392: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:274] libdevice is required by this HLO module but was not found at ./libdevice.10.bc
2023-09-10 19:25:40.235697: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:362 : INTERNAL: libdevice not found at ./libdevice.10.bc
2023-09-10 19:25:40.235726: I tensorflow/core/common_runtime/executor.cc:1197] [/job:localhost/replica:0/task:0/device:GPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INTERNAL: libdevice not found at ./libdevice.10.bc
[[{{node StatefulPartitionedCall_17}}]]
2023-09-10 19:25:40.260111: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:274] libdevice is required by this HLO module but was not found at ./libdevice.10.bc
2023-09-10 19:25:40.260398: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:362 : INTERNAL: libdevice not found at ./libdevice.10.bc
2023-09-10 19:25:40.283884: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:274] libdevice is required by this HLO module but was not found at ./libdevice.10.bc
2023-09-10 19:25:40.284216: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:362 : INTERNAL: libdevice not found at ./libdevice.10.bc

based on the error information, it seems that it can’t find libdevice directory ${CUDA_DIR}/nvvm/libdevice, and particular the “libdevice not found at ./libdevice.10.bc”

So, is there any suggestion about how to solve this problem, or is there any documentation on how to run GPU on Pangeo? Thank you!

I found that a possible reason for this problem is tensorflow/keras > 2.10 requires the cuda-compiler package. For the docker we are using, we have tensorflow 2.12.1, and CUDA 11.6.

In a solution I found online, they recommend the installation of nvcc (cuda complier) :
possible solution

Install NVCC

conda install -c nvidia cuda-nvcc=11.3.58

Configure the XLA cuda directory

mkdir -p $CONDA_PREFIX/etc/conda/activate.d
printf ‘export XLA_FLAGS=–xla_gpu_cuda_data_dir=$CONDA_PREFIX/lib/\n’ >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
source $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh

Copy libdevice file to the required path

mkdir -p $CONDA_PREFIX/lib/nvvm/libdevice
cp $CONDA_PREFIX/lib/libdevice.10.bc $CONDA_PREFIX/lib/nvvm/libdevice/

But in our case, I failed to install nvcc since it requires Cuda>=12.0, and our version is 11.6, and I don’t have permission to upgrade the cuda. Is it possible to add nvcc to our enviroment by the administrator? Thank you!

Hi @leiyan90, you’re right that cuda-nvcc needs to be manually installed, see the last bullet point under ‘Other Notes’ at GitHub - pangeo-data/pangeo-docker-images at 2023.08.29. Solution is as you said, to do something like conda install -c nvidia cuda-nvcc==11.6.*.

We’ve decided not to add cuda-nvcc in Oct 2022, because of potential CUDA driver and CUDA library (e.g cuda-nvcc) version incompatibilities, especially with older Kepler generation GPUs like K80s, see Document ML-image tag/GPU type/CUDA compatibility table · Issue #390 · pangeo-data/pangeo-docker-images · GitHub and remove cuda-nvcc and document ptxas by ngam · Pull Request #398 · pangeo-data/pangeo-docker-images · GitHub (and the links within that thread) for more information.

Since it’s been almost a year already, if there aren’t people using K80 GPUs anymore, we could probably drop support for K80s that have a maximum supported CUDA driver version of 470.57, and move to installing cuda-nvcc by default to prevent surprising errors like what you detailed above (those on K80s can still pin their docker images to old versions). But we’ll need to check with @yuvipanda or someone at 2i2c who is managing these JupyterHubs to see what GPUs are being used across all nodes right now.

Hi @weiji14 , Thanks a lot for your reply. I have tried your suggestion and successfully installed cuda-nvcc. But the problem is not solved, still saying “libdevice not found at ./libdevice.10.bc”

actually, I can find the libdevice.10.bc at several places:
(notebook) jovyan@jupyter-leiyan90:~$ find / -name “libdevice.10.bc” 2>/dev/null
/srv/conda/envs/notebook/lib/python3.10/site-packages/jaxlib/cuda/nvvm/libdevice/libdevice.10.bc
/srv/conda/envs/notebook/lib/libdevice.10.bc
/srv/conda/envs/notebook/lib/nvvm/libdevice/libdevice.10.bc
/srv/conda/envs/notebook/nvvm/libdevice/libdevice.10.bc
/srv/conda/pkgs/cuda-nvcc-11.6.124-hbba6d2d_0/nvvm/libdevice/libdevice.10.bc

After I installed the nvcc, it seems that we have created libdevice folder in the cuda directory. Do you have any suggestions? thank you!

Could you post the output of nvidia-smi? It might be that you need to match the version of cuda-nvcc with the cudatoolkit version. Have you tried conda install -c nvidia cuda-nvcc==12.* also?

-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.108.03   Driver Version: 510.108.03   CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   31C    P8     9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

This is the output of nvidia-smi, and I have tried the “conda install -c nvidia cuda-nvcc==12.*” also, but it does not work.

Hi @weiji14 , following your last suggestion, I checked the compatibility of Tensorflow and cuda, and found that in the official website, the tensorflow 2.12.1 is compatible with cuda 11.8 and cudnn 8.6. compatibility

But in our platform,

print("Is GPU available:", tf.config.list_physical_devices('GPU'))
print("TensorFlow version:", tf.__version__)
print("CUDA version:", tf.sysconfig.get_build_info()["cuda_version"])
print("cuDNN version:", tf.sysconfig.get_build_info()["cudnn_version"])

we got :

Is GPU available: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
TensorFlow version: 2.12.1
CUDA version: 11.2
cuDNN version: 8

So I have tried to either upgrade or downgrade the cuda and cudnn, but it doesn’t work…
Do you have any suggestions? thank you!

@weiji14 Everyone is on T4s only now, no more K80s. I think it’s ok to drop K80 support.

1 Like

@leiyan90, forgot to ask, are you running the ml-notebook docker image locally, on a university HPC, or a 2i2c managed JupyterHub (and if so, which one)? I’ve tried running the following locally on my laptop which has a GPU, following instructions at How to launch a notebook using these images — Pangeo Docker Stacks documentation (note the --gpus flag):

docker pull quay.io/pangeo/ml-notebook:2023.09.11
docker run -it --rm --gpus all -p 8888:8888 pangeo/ml-notebook:2023.09.11 jupyter lab --ip 0.0.0.0

Then in JupyterLab, I opened a new notebook and ran:

import tensorflow as tf

tf.ones([10, 5]) * 2

which gave me some info level debug messages, but otherwise ran fine:

2023-09-13 05:27:36.285612: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-09-13 05:27:36.285944: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-09-13 05:27:36.286082: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-09-13 05:27:36.325505: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-09-13 05:27:36.325686: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-09-13 05:27:36.325816: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-09-13 05:27:36.325931: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 6351 MB memory:  -> device: 0, name: NVIDIA RTX A2000 8GB Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.6

<tf.Tensor: shape=(10, 5), dtype=float32, numpy=
array([[2., 2., 2., 2., 2.],
       [2., 2., 2., 2., 2.],
       [2., 2., 2., 2., 2.],
       [2., 2., 2., 2., 2.],
       [2., 2., 2., 2., 2.],
       [2., 2., 2., 2., 2.],
       [2., 2., 2., 2., 2.],
       [2., 2., 2., 2., 2.],
       [2., 2., 2., 2., 2.],
       [2., 2., 2., 2., 2.]], dtype=float32)>

Notice that the ‘device’ was created properly, as far as I can tell. I haven’t installed cuda-nvcc manually in this case.

Could you post a short sample Tensorflow/JaX code to reproduce the libdevice not found at ./libdevice.10.bc error? The CUDA versions shouldn’t matter as long as you’re using CUDA 11.x (which has forward and backwards compatibility) and a newer GPU like Tesla T4. I’m wondering also if there’s some misconfiguration that needs to be handled by your sysadmin (if running on a HPC or JupyterHub).

Hi @weiji14 , I am running the docker image on a 2i2c managed JupyterHub in Columbia University.

first I simply use the code,

import tensorflow as tf
tf.ones([10, 5]) * 2

and I got

2023-09-13 15:31:31.167786: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-09-13 15:31:33.875247: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-09-13 15:31:33.917185: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-09-13 15:31:33.917541: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-09-13 15:31:33.918801: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-09-13 15:31:33.919063: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-09-13 15:31:33.919305: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-09-13 15:31:34.740501: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-09-13 15:31:34.740964: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-09-13 15:31:34.741357: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-09-13 15:31:34.741621: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13793 MB memory:  -> device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5
<tf.Tensor: shape=(10, 5), dtype=float32, numpy=
array([[2., 2., 2., 2., 2.],
       [2., 2., 2., 2., 2.],
       [2., 2., 2., 2., 2.],
       [2., 2., 2., 2., 2.],
       [2., 2., 2., 2., 2.],
       [2., 2., 2., 2., 2.],
       [2., 2., 2., 2., 2.],
       [2., 2., 2., 2., 2.],
       [2., 2., 2., 2., 2.],
       [2., 2., 2., 2., 2.]], dtype=float32)>

it seems the device is created and it ran the code with the simple code. Next, I will give a sample to build a simple CNN model (a very simplified model from my original code):

import numpy as np
import tensorflow as tf
from tensorflow.keras import models, layers, optimizers

def create_model():
    model = models.Sequential(name="ENSO_pretrain")

    model.add(layers.Conv2D(5, (4, 8), activation="tanh", padding="same", input_shape=(32, 32, 3))) 
    model.add(layers.MaxPool2D((2, 2)))

    model.add(layers.Flatten())
    model.add(layers.Dense(5, activation="tanh"))
    model.add(layers.Dense(5))

    return model

model = create_model()
model.summary()

# create random dataset
num_samples = 100
cmip_all_train = np.random.random((num_samples, 32, 32, 3))
nino34_train = np.random.random((num_samples, 5))
cmip_all_test = np.random.random((num_samples, 32, 32, 3))
nino34_test = np.random.random((num_samples, 5))

# train model
adam_optimizer = optimizers.Adam(learning_rate=1e-3)
model.compile(optimizer=adam_optimizer, loss='mse')
model.fit(cmip_all_train, nino34_train, validation_data=(cmip_all_test, nino34_test), epochs=1, batch_size=400)

and it does not give the libdevice not found at ./libdevice.10.bc error but give the errors below:

2023-09-13 16:01:28.123639: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:429] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2023-09-13 16:01:28.123688: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at conv_ops.cc:1068 : UNIMPLEMENTED: DNN library is not found.
2023-09-13 16:01:28.123712: I tensorflow/core/common_runtime/executor.cc:1197] [/job:localhost/replica:0/task:0/device:GPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): UNIMPLEMENTED: DNN library is not found.
	 [[{{node ENSO_pretrain/conv2d_4/Conv2D}}]]
---------------------------------------------------------------------------
UnimplementedError                        Traceback (most recent call last)
Cell In[5], line 30
     28 adam_optimizer = optimizers.Adam(learning_rate=1e-3)
     29 model.compile(optimizer=adam_optimizer, loss='mse')
---> 30 model.fit(cmip_all_train, nino34_train, validation_data=(cmip_all_test, nino34_test), epochs=1, batch_size=400)

File /srv/conda/envs/notebook/lib/python3.10/site-packages/keras/utils/traceback_utils.py:70, in filter_traceback.<locals>.error_handler(*args, **kwargs)
     67     filtered_tb = _process_traceback_frames(e.__traceback__)
     68     # To get the full stack trace, call:
     69     # `tf.debugging.disable_traceback_filtering()`
---> 70     raise e.with_traceback(filtered_tb) from None
     71 finally:
     72     del filtered_tb

File /srv/conda/envs/notebook/lib/python3.10/site-packages/tensorflow/python/eager/execute.py:52, in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
     50 try:
     51   ctx.ensure_initialized()
---> 52   tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
     53                                       inputs, attrs, num_outputs)
     54 except core._NotOkStatusException as e:
     55   if name is not None:

UnimplementedError: Graph execution error:

Detected at node 'ENSO_pretrain/conv2d_4/Conv2D' defined at (most recent call last):
    File "/srv/conda/envs/notebook/lib/python3.10/runpy.py", line 196, in _run_module_as_main
      return _run_code(code, main_globals, None,
    File "/srv/conda/envs/notebook/lib/python3.10/runpy.py", line 86, in _run_code
      exec(code, run_globals)
    File "/srv/conda/envs/notebook/lib/python3.10/site-packages/ipykernel_launcher.py", line 17, in <module>
      app.launch_new_instance()
    File "/srv/conda/envs/notebook/lib/python3.10/site-packages/traitlets/config/application.py", line 1043, in launch_instance
      app.start()
    File "/srv/conda/envs/notebook/lib/python3.10/site-packages/ipykernel/kernelapp.py", line 736, in start
      self.io_loop.start()
    File "/srv/conda/envs/notebook/lib/python3.10/site-packages/tornado/platform/asyncio.py", line 195, in start
      self.asyncio_loop.run_forever()
    File "/srv/conda/envs/notebook/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
      self._run_once()
    File "/srv/conda/envs/notebook/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
      handle._run()
    File "/srv/conda/envs/notebook/lib/python3.10/asyncio/events.py", line 80, in _run
      self._context.run(self._callback, *self._args)
    File "/srv/conda/envs/notebook/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 516, in dispatch_queue
      await self.process_one()
    File "/srv/conda/envs/notebook/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 505, in process_one
      await dispatch(*args)
    File "/srv/conda/envs/notebook/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 412, in dispatch_shell
      await result
    File "/srv/conda/envs/notebook/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 740, in execute_request
      reply_content = await reply_content
    File "/srv/conda/envs/notebook/lib/python3.10/site-packages/ipykernel/ipkernel.py", line 422, in do_execute
      res = shell.run_cell(
    File "/srv/conda/envs/notebook/lib/python3.10/site-packages/ipykernel/zmqshell.py", line 546, in run_cell
      return super().run_cell(*args, **kwargs)
    File "/srv/conda/envs/notebook/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3009, in run_cell
      result = self._run_cell(
    File "/srv/conda/envs/notebook/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3064, in _run_cell
      result = runner(coro)
    File "/srv/conda/envs/notebook/lib/python3.10/site-packages/IPython/core/async_helpers.py", line 129, in _pseudo_sync_runner
      coro.send(None)
    File "/srv/conda/envs/notebook/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3269, in run_cell_async
      has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
    File "/srv/conda/envs/notebook/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3448, in run_ast_nodes
      if await self.run_code(code, result, async_=asy):
    File "/srv/conda/envs/notebook/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3508, in run_code
      exec(code_obj, self.user_global_ns, self.user_ns)
    File "/tmp/ipykernel_1838/854972197.py", line 30, in <module>
      model.fit(cmip_all_train, nino34_train, validation_data=(cmip_all_test, nino34_test), epochs=1, batch_size=400)
    File "/srv/conda/envs/notebook/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/srv/conda/envs/notebook/lib/python3.10/site-packages/keras/engine/training.py", line 1685, in fit
      tmp_logs = self.train_function(iterator)
    File "/srv/conda/envs/notebook/lib/python3.10/site-packages/keras/engine/training.py", line 1284, in train_function
      return step_function(self, iterator)
    File "/srv/conda/envs/notebook/lib/python3.10/site-packages/keras/engine/training.py", line 1268, in step_function
      outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/srv/conda/envs/notebook/lib/python3.10/site-packages/keras/engine/training.py", line 1249, in run_step
      outputs = model.train_step(data)
    File "/srv/conda/envs/notebook/lib/python3.10/site-packages/keras/engine/training.py", line 1050, in train_step
      y_pred = self(x, training=True)
    File "/srv/conda/envs/notebook/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/srv/conda/envs/notebook/lib/python3.10/site-packages/keras/engine/training.py", line 558, in __call__
      return super().__call__(*args, **kwargs)
    File "/srv/conda/envs/notebook/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/srv/conda/envs/notebook/lib/python3.10/site-packages/keras/engine/base_layer.py", line 1145, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "/srv/conda/envs/notebook/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 96, in error_handler
      return fn(*args, **kwargs)
    File "/srv/conda/envs/notebook/lib/python3.10/site-packages/keras/engine/sequential.py", line 412, in call
      return super().call(inputs, training=training, mask=mask)
    File "/srv/conda/envs/notebook/lib/python3.10/site-packages/keras/engine/functional.py", line 512, in call
      return self._run_internal_graph(inputs, training=training, mask=mask)
    File "/srv/conda/envs/notebook/lib/python3.10/site-packages/keras/engine/functional.py", line 669, in _run_internal_graph
      outputs = node.layer(*args, **kwargs)
    File "/srv/conda/envs/notebook/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/srv/conda/envs/notebook/lib/python3.10/site-packages/keras/engine/base_layer.py", line 1145, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "/srv/conda/envs/notebook/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 96, in error_handler
      return fn(*args, **kwargs)
    File "/srv/conda/envs/notebook/lib/python3.10/site-packages/keras/layers/convolutional/base_conv.py", line 290, in call
      outputs = self.convolution_op(inputs, self.kernel)
    File "/srv/conda/envs/notebook/lib/python3.10/site-packages/keras/layers/convolutional/base_conv.py", line 262, in convolution_op
      return tf.nn.convolution(
Node: 'ENSO_pretrain/conv2d_4/Conv2D'
DNN library is not found.
	 [[{{node ENSO_pretrain/conv2d_4/Conv2D}}]] [Op:__inference_train_function_4793]

I tried several times and could not reproduce the libdevice not found at ./libdevice.10.bc error, but it seems that there is an incompatibility between tensorflow and cudnn, and the cudnn is not found.

tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:429] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR

OP_REQUIRES failed at conv_ops.cc:1068 : UNIMPLEMENTED: DNN library is not found.

@leiyan90 please report the specific HUB URL (like https://us-central1-b.gcp.pangeo.io) . And Docker Image (from a terminal run echo $JUPYTER_IMAGE and you should see something like quay.io/pangeo/ml-notebook:2023.02.27).

Hi @scottyhq , thanks for your reminder. The HUB URL I am using is https://leap.2i2c.cloud
and the version of Docker Image is pangeo/ml-notebook:2023.08.29

Thanks @leiyan90 for providing the example and all the version details, I’ve tried running ml-notebook:2023.08.29 and am getting the libdevice not found at ./libdevice.10.bc on my end locally when running your model code :sweat_smile:

But, if I follow New optimizers fail to load CUDA installed through conda · Issue #17422 · keras-team/keras · GitHub and use the legacy optimizer like so:

from tensorflow.keras import optimizers
...
adam_optimizer = optimizers.legacy.Adam(learning_rate=1e-3)
...

the code runs without any issues. So am wondering if it is an issue on keras=2.12.0?

@dhruvbalwada, have you seen errors like this recently, and is there something you did as a workaround? I knew you had some issues with cuda-nvcc earlier this year at cuda-nvcc missing again · Issue #438 · pangeo-data/pangeo-docker-images · GitHub.

I have seen issues trying to user later optimisers on tensorflow past 2.10 recently - so we basically did what you did there to get around.