TensorFlow does not see all available GPUs in my system #252

lu4 · 2018-07-21T12:45:51Z

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): N/A
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.0.4 LTS x86_64
TensorFlow installed from (source or binary): source
TensorFlow version (use command below): ComputeCpp-v0.6.0-4212-gb29ac8a 1.8.0-rc1
Python version: 3.5.2
Bazel version (if compiling from source): 0.15.0, build timestamp - 1530015019
GCC/Compiler version (if compiling from source): 5.4.0 20160609
CUDA/cuDNN version: N/A
GPU model and memory: Sapphire Radeon RX470, 8Gbytes
Exact command to reproduce: see below

from tensorflow.python.client import device_lib
device_lib.list_local_devices()

Here's info from environment capture script:

== cat /etc/issue ===============================================
Linux custom 4.16.0-rc6-smos+ #1 SMP Wed Mar 21 13:23:56 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
VERSION="16.04.3 LTS (Xenial Xerus)"
VERSION_ID="16.04"
VERSION_CODENAME=xenial

== are we in docker =============================================
No

== compiler =====================================================
c++ (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


== uname -a =====================================================
Linux custom 4.16.0-rc6-smos+ #1 SMP Wed Mar 21 13:23:56 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

== check pips ===================================================
numpy               1.14.5
protobuf            3.6.0
tensorflow          1.8.0rc1

== check for virtualenv =========================================
False

== tensorflow import ============================================
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ImportError: No module named tensorflow

== env ==========================================================
LD_LIBRARY_PATH /usr/local/lib:/usr/local/computecpp/lib:/usr/local/lib:/usr/local/computecpp/lib:
DYLD_LIBRARY_PATH is unset

== nvidia-smi ===================================================
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.


== cuda libs  ===================================================
/usr/local/lib/libcudart.so.9.0.103

== cat /etc/issue ===============================================
Linux custom 4.16.0-rc6-smos+ #1 SMP Wed Mar 21 13:23:56 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
VERSION="16.04.3 LTS (Xenial Xerus)"
VERSION_ID="16.04"
VERSION_CODENAME=xenial

== are we in docker =============================================
No

== compiler =====================================================
c++ (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


== uname -a =====================================================
Linux custom 4.16.0-rc6-smos+ #1 SMP Wed Mar 21 13:23:56 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

== check pips ===================================================
numpy               1.14.5
protobuf            3.6.0
tensorflow          1.8.0rc1

== check for virtualenv =========================================
False

== tensorflow import ============================================
tf.VERSION = 1.8.0-rc1
tf.GIT_VERSION = b'ComputeCpp-v0.6.0-4212-gb29ac8a'
tf.COMPILER_VERSION = b'ComputeCpp-v0.6.0-4212-gb29ac8a'
Sanity check: array([1], dtype=int32)

== env ==========================================================
LD_LIBRARY_PATH /usr/local/lib:/usr/local/computecpp/lib:/usr/local/lib:/usr/local/computecpp/lib:
DYLD_LIBRARY_PATH is unset

== nvidia-smi ===================================================
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.


== cuda libs  ===================================================
/usr/local/lib/libcudart.so.9.0.103

Describe the problem

Tensorflow built on top of SYCL refuses to list and use all available GPUs in the system. I'm using the following commands to get list of devices:

(please note that TensorFlow's in-line log presents 8 devices, but the actual resulting variable contains just two CPU and one GPU available through "/device:SYCL:0" name)

>>> from tensorflow.python.client import device_lib
>>> device_lib.list_local_devices()
2018-07-21 14:21:08.328612: I ./tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2
2018-07-21 14:21:09.308907: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:70] Found following OpenCL devices:
2018-07-21 14:21:09.308981: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:72] id: 0, type: GPU, name: Ellesmere, vendor: Advanced Micro Devices, Inc., profile: FULL_PROFILE
2018-07-21 14:21:09.309001: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:72] id: 1, type: GPU, name: Ellesmere, vendor: Advanced Micro Devices, Inc., profile: FULL_PROFILE
2018-07-21 14:21:09.309019: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:72] id: 2, type: GPU, name: Ellesmere, vendor: Advanced Micro Devices, Inc., profile: FULL_PROFILE
2018-07-21 14:21:09.309034: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:72] id: 3, type: GPU, name: Ellesmere, vendor: Advanced Micro Devices, Inc., profile: FULL_PROFILE
2018-07-21 14:21:09.309052: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:72] id: 4, type: GPU, name: Ellesmere, vendor: Advanced Micro Devices, Inc., profile: FULL_PROFILE
2018-07-21 14:21:09.309068: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:72] id: 5, type: GPU, name: Ellesmere, vendor: Advanced Micro Devices, Inc., profile: FULL_PROFILE
2018-07-21 14:21:09.309085: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:72] id: 6, type: GPU, name: Ellesmere, vendor: Advanced Micro Devices, Inc., profile: FULL_PROFILE
2018-07-21 14:21:09.309101: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:72] id: 7, type: GPU, name: Ellesmere, vendor: Advanced Micro Devices, Inc., profile: FULL_PROFILE
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 911408516298923653
, name: "/device:SYCL:0"
device_type: "SYCL"
memory_limit: 268435456
locality {
}
incarnation: 161138719697210983
physical_device_desc: "id: 0, type: GPU, name: Ellesmere, vendor: Advanced Micro Devices, Inc., profile: FULL_PROFILE"
]

I confirm that all devices are functional and available to OpenCL (visible to clinfo) and are operable through another 3-rd party package (ArrayFire). Also I confirm that SYCL itself sees all available devices, in order to test that purpose I've updated SYCL's 'custom-device-selector' example to following code:

/***************************************************************************
 *
 *  Copyright (C) 2016 Codeplay Software Limited
 *  Licensed under the Apache License, Version 2.0 (the "License");
 *  you may not use this file except in compliance with the License.
 *  You may obtain a copy of the License at
 *
 *      http://www.apache.org/licenses/LICENSE-2.0
 *
 *  For your convenience, a copy of the License has been included in this
 *  repository.
 *
 *  Unless required by applicable law or agreed to in writing, software
 *  distributed under the License is distributed on an "AS IS" BASIS,
 *  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 *  See the License for the specific language governing permissions and
 *  limitations under the License.
 *
 *  Codeplay's ComputeCpp SDK
 *
 *  custom-device-selector.cpp
 *
 *  Description:
 *    Sample code that shows how to write a custom device selector in SYCL.
 *
 **************************************************************************/

#include <CL/sycl.hpp>
#include <iostream>

using namespace cl::sycl;
using namespace std;

/* Classes can inherit from the device_selector class to allow users
 * to dictate the criteria for choosing a device from those that might be
 * present on a system. This example looks for a device with SPIR support
 * and prefers GPUs over CPUs. */
class custom_selector : public device_selector {
 public:
  custom_selector() : device_selector() {}

  /* The selection is performed via the () operator in the base
   * selector class.This method will be called once per device in each
   * platform. Note that all platforms are evaluated whenever there is
   * a device selection. */
  int operator()(const device& device) const override {
    cout << device.get_info<cl::sycl::info::device::vendor>() << ": " << device.get_info<cl::sycl::info::device::name>() << std::endl; // << "(" << device.get_info<cl::sycl::info::device::device_type>() << ")"
    cout << '\t' << "max_work_group_size : " << device.get_info<cl::sycl::info::device::max_work_group_size>() << std::endl;
//    cout << '\t' << "max_work_item_sizes : " << device.get_info<cl::sycl::info::device::max_work_item_sizes>() << std::endl;
    cout << '\t' << "max_compute_units   : " << device.get_info<cl::sycl::info::device::max_compute_units>() << std::endl;
    cout << '\t' << "local_mem_size      : " << device.get_info<cl::sycl::info::device::local_mem_size>() << std::endl;
    cout << '\t' << "max_mem_alloc_size  : " << device.get_info<cl::sycl::info::device::max_mem_alloc_size>() << std::endl;
    cout << '\t' << "profile             : " << device.get_info<cl::sycl::info::device::profile>() << std::endl;
    cout << "----------------------------------------------------------------------------------------------" <<  std::endl << std::endl << std::endl;

    /* We only give a valid score to devices that support SPIR. */
    if (device.has_extension(cl::sycl::string_class("cl_khr_spir"))) {
      if (device.get_info<info::device::device_type>() ==
          info::device_type::cpu) {
        return 50;
      }

      if (device.get_info<info::device::device_type>() ==
          info::device_type::gpu) {
        return 100;
      }
    }
    /* Devices with a negative score will never be chosen. */
    return -1;
  }
};

int main() {
  const int dataSize = 64;
  int ret = -1;
  float data[dataSize] = {0.f};

  range<1> dataRange(dataSize);
  buffer<float, 1> buf(data, dataRange);

  /* We create an object of custom_selector type and use it
   * like any other selector. */
  custom_selector selector;
  queue myQueue(selector);

  myQueue.submit([&](handler& cgh) {
    auto ptr = buf.get_access<access::mode::read_write>(cgh);

    cgh.parallel_for<class example_kernel>(dataRange, [=](item<1> item) {
      size_t idx = item.get_linear_id();
      ptr[item.get_linear_id()] = static_cast<float>(idx);
    });
  });

  /* A host accessor can be used to force an update from the device to the
   * host, allowing the data to be checked. */
  accessor<float, 1, access::mode::read_write, access::target::host_buffer>
      hostPtr(buf);

  if (hostPtr[10] == 10.0f) {
    ret = 0;
  }

  return ret;
}

The text was updated successfully, but these errors were encountered:

mirh · 2018-07-22T21:02:47Z

Try computecpp 0.9.0 for starters?

lu4 · 2018-07-22T21:05:37Z

Sorry, didn't understood the question... I was using ComputeCpp-v0.6.0-4212-gb29ac8a, but ComputeCpp-v0.6.0-4212-gb29ac8a itself is working fine, it looks as TF is buggy...

Rbiessy · 2018-07-23T09:43:15Z

@lu4 as @mirh suggested, compiling with our latest ComputeCpp version will let you use a more recent version of TF. Could you try and download ComputeCpp CE 0.9.1? To compile you will need to use the latest commit of the eigen_sycl branch here: https://github.com/codeplaysoftware/tensorflow/tree/eigen_sycl

lu4 · 2018-07-23T18:20:12Z

Oh, I see, thanks, trying...

lu4 · 2018-07-23T19:07:01Z

vagrant@ubuntu-xenial:~/Project/tensorflow_eigen$ bazel build -c opt --config=sycl //tensorflow/tools/pip_package:build_pip_package
Extracting Bazel installation...
Starting local Bazel server and connecting to it...
..................
INFO: SHA256 (https://github.com/KhronosGroup/OpenCL-Headers/archive/f039db6764d52388658ef15c30b2237bbda49803.tar.gz) = a29e3e67beef1ad0ea6b0afd44b4b2c0e6054d1f9d68fdbd0c4ce434e59533e0
ERROR: /home/vagrant/.cache/bazel/_bazel_vagrant/e647697a348b187726950a371af92dd1/external/jpeg/BUILD:126:12: Illegal ambiguous match on configurable attribute "deps" in @jpeg//:jpeg:
@jpeg//:k8
@jpeg//:armeabi-v7a
Multiple matches are not allowed unless one is unambiguously more specialized.
ERROR: Analysis of target '//tensorflow/tools/pip_package:build_pip_package' failed; build aborted:

/home/vagrant/.cache/bazel/_bazel_vagrant/e647697a348b187726950a371af92dd1/external/jpeg/BUILD:126:12: Illegal ambiguous match on configurable attribute "deps" in @jpeg//:jpeg:
@jpeg//:k8
@jpeg//:armeabi-v7a
Multiple matches are not allowed unless one is unambiguously more specialized.
INFO: Elapsed time: 16.227s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (132 packages loaded)
currently loading: tensorflow/core/kernels

lu4 · 2018-07-23T19:10:56Z

It looks as the build system is trying to use arm architecture to build up, have no clue why...

Rbiessy · 2018-07-23T19:42:27Z

Ha this is a known issue with TF 1.6 and the recent versions of bazel. You have to use bazel 0.11.1 for our current version of TF. Make sure to manually remove the cache before compiling again.

lu4 · 2018-07-23T20:02:13Z

Thanks, trying...

lu4 · 2018-07-24T23:05:53Z

@Rbiessy @mirh Ok, guys I've compiled TF as mentioned above, for both eigen and lukeiwanski repos, i.e. by using ComputeCpp CE 0.9.1, but the resulting TF build reports b'ComputeCpp-v0.6.0-4212-gb29ac8a' 1.8.0-rc1. In addition to that it sees just one card.

mirh · 2018-07-25T00:04:22Z

On a night in europe, hardly I think.

Anyway, for the love of me, your dev environment seems just so much weird.
Can't you clean it or try on another system?

And you are trying to build this, right? https://github.com/lukeiwanski/tensorflow/archive/dev/amd_gpu.zip

rodburns · 2018-07-25T09:23:35Z

Can you post the output of the "computecpp_info" tool located in the "bin" folder of the ComputeCpp release you are using?

lu4 · 2018-07-30T21:54:28Z

Hi, here is the output:

$ /usr/local/computecpp/bin/computecpp_info
********************************************************************************

ComputeCpp Info (CE 0.9.1)

SYCL 1.2.1 revision 3

********************************************************************************

Toolchain information:

GLIBC version: 2.23
GLIBCXX: 20160609
This version of libstdc++ is supported.

********************************************************************************


Device Info:

Discovered 8 devices matching:
  platform    : <any>
  device type : <any>

--------------------------------------------------------------------------------
Device 0:

  Device is supported                     : UNTESTED - Vendor not tested on this OS
  CL_DEVICE_NAME                          : Ellesmere
  CL_DEVICE_VENDOR                        : Advanced Micro Devices, Inc.
  CL_DRIVER_VERSION                       : 2482.3
  CL_DEVICE_TYPE                          : CL_DEVICE_TYPE_GPU
--------------------------------------------------------------------------------
Device 1:

  Device is supported                     : UNTESTED - Vendor not tested on this OS
  CL_DEVICE_NAME                          : Ellesmere
  CL_DEVICE_VENDOR                        : Advanced Micro Devices, Inc.
  CL_DRIVER_VERSION                       : 2482.3
  CL_DEVICE_TYPE                          : CL_DEVICE_TYPE_GPU
--------------------------------------------------------------------------------
Device 2:

  Device is supported                     : UNTESTED - Vendor not tested on this OS
  CL_DEVICE_NAME                          : Ellesmere
  CL_DEVICE_VENDOR                        : Advanced Micro Devices, Inc.
  CL_DRIVER_VERSION                       : 2482.3
  CL_DEVICE_TYPE                          : CL_DEVICE_TYPE_GPU
--------------------------------------------------------------------------------
Device 3:

  Device is supported                     : UNTESTED - Vendor not tested on this OS
  CL_DEVICE_NAME                          : Ellesmere
  CL_DEVICE_VENDOR                        : Advanced Micro Devices, Inc.
  CL_DRIVER_VERSION                       : 2482.3
  CL_DEVICE_TYPE                          : CL_DEVICE_TYPE_GPU
--------------------------------------------------------------------------------
Device 4:

  Device is supported                     : UNTESTED - Vendor not tested on this OS
  CL_DEVICE_NAME                          : Ellesmere
  CL_DEVICE_VENDOR                        : Advanced Micro Devices, Inc.
  CL_DRIVER_VERSION                       : 2482.3
  CL_DEVICE_TYPE                          : CL_DEVICE_TYPE_GPU
--------------------------------------------------------------------------------
Device 5:

  Device is supported                     : UNTESTED - Vendor not tested on this OS
  CL_DEVICE_NAME                          : Ellesmere
  CL_DEVICE_VENDOR                        : Advanced Micro Devices, Inc.
  CL_DRIVER_VERSION                       : 2482.3
  CL_DEVICE_TYPE                          : CL_DEVICE_TYPE_GPU
--------------------------------------------------------------------------------
Device 6:

  Device is supported                     : UNTESTED - Vendor not tested on this OS
  CL_DEVICE_NAME                          : Ellesmere
  CL_DEVICE_VENDOR                        : Advanced Micro Devices, Inc.
  CL_DRIVER_VERSION                       : 2482.3
  CL_DEVICE_TYPE                          : CL_DEVICE_TYPE_GPU
--------------------------------------------------------------------------------
Device 7:

  Device is supported                     : UNTESTED - Vendor not tested on this OS
  CL_DEVICE_NAME                          : Ellesmere
  CL_DEVICE_VENDOR                        : Advanced Micro Devices, Inc.
  CL_DRIVER_VERSION                       : 2482.3
  CL_DEVICE_TYPE                          : CL_DEVICE_TYPE_GPU

If you encounter problems when using any of these OpenCL devices, please consult
this website for known issues:
https://computecpp.codeplay.com/releases/v0.9.1/platform-support-notes

********************************************************************************

lu4 · 2018-07-30T22:03:45Z

@mirh I was able compile TF using provided archive but it still shows just one GPU in TF.

lu4 · 2018-07-30T22:06:16Z

Guys, I was wondering if you provide payed support, I need to get TF working with all devices in my machine? The issue is highly critical for me and I'm willing to pay couple of hundred bucks to get the ball rolling. Is it possible somehow?

lukeiwanski · 2018-07-31T10:19:01Z

@lu4 thanks for the report. It is some interesting rig you have there.

So far our focus was on supporting systems with only one device - like one GPU and combinations of devices like CPU with one GPU and one other accelerator.

It is quite complex to add support for multiple GPU - nevertheless, I believe we should do this.

This task most likely will take some time - have you tried HiP?

As of the paid support can you email me directly regarding that?

jwlawson · 2018-07-31T11:04:55Z

@lu4 I have absolutely no idea if this will work, but when you create a tensorflow session try setting the SYCL device count in the session config options:

import tensorflow as tf
with tf.Session(config=tf.ConfigProto(device_count={'SYCL': 8})) as sess:
  print(sess.list_devices())

Even if this does allow TF to see all your devices I don't know if it will automatically schedule compute across all of them. It would be very interesting to hear the results of this.

lu4 · 2018-08-02T17:21:26Z

@jwlawson your trick worked, I was able to access all GPUs in my system, though it turns out that not everything works smooth for example eager execution is not able to get advantage of all the cards (it may be also due to misconfiguration), for some reason it just to binds with gpu:0 and does not want to use anything else. I'm continuing to investigate further and report on if will find anything useful.

lu4 · 2018-08-02T17:23:13Z

@lukeiwanski I've sent an email to you (used your github email [email protected]), JFYI

lukeiwanski · 2018-08-03T12:22:39Z

@lu4 yes the email is correct.. however, I cannot find any email from you :(

lu4 changed the title ~~TensorFlow does not see all available GPU's in my system~~ TensorFlow does not see all available GPUs in my system Jul 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TensorFlow does not see all available GPUs in my system #252

TensorFlow does not see all available GPUs in my system #252

lu4 commented Jul 21, 2018

mirh commented Jul 22, 2018

lu4 commented Jul 22, 2018

Rbiessy commented Jul 23, 2018 •

edited

Loading

lu4 commented Jul 23, 2018

lu4 commented Jul 23, 2018

lu4 commented Jul 23, 2018 •

edited

Loading

Rbiessy commented Jul 23, 2018

lu4 commented Jul 23, 2018

lu4 commented Jul 24, 2018

mirh commented Jul 25, 2018

rodburns commented Jul 25, 2018

lu4 commented Jul 30, 2018 •

edited

Loading

lu4 commented Jul 30, 2018

lu4 commented Jul 30, 2018

lukeiwanski commented Jul 31, 2018

jwlawson commented Jul 31, 2018

lu4 commented Aug 2, 2018

lu4 commented Aug 2, 2018

lukeiwanski commented Aug 3, 2018

TensorFlow does not see all available GPUs in my system #252

TensorFlow does not see all available GPUs in my system #252

Comments

lu4 commented Jul 21, 2018

System information

Describe the problem

mirh commented Jul 22, 2018

lu4 commented Jul 22, 2018

Rbiessy commented Jul 23, 2018 • edited Loading

lu4 commented Jul 23, 2018

lu4 commented Jul 23, 2018

lu4 commented Jul 23, 2018 • edited Loading

Rbiessy commented Jul 23, 2018

lu4 commented Jul 23, 2018

lu4 commented Jul 24, 2018

mirh commented Jul 25, 2018

rodburns commented Jul 25, 2018

lu4 commented Jul 30, 2018 • edited Loading

lu4 commented Jul 30, 2018

lu4 commented Jul 30, 2018

lukeiwanski commented Jul 31, 2018

jwlawson commented Jul 31, 2018

lu4 commented Aug 2, 2018

lu4 commented Aug 2, 2018

lukeiwanski commented Aug 3, 2018

Rbiessy commented Jul 23, 2018 •

edited

Loading

lu4 commented Jul 23, 2018 •

edited

Loading

lu4 commented Jul 30, 2018 •

edited

Loading