Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TensorFlow does not see all available GPUs in my system #252

Open
lu4 opened this issue Jul 21, 2018 · 19 comments
Open

TensorFlow does not see all available GPUs in my system #252

lu4 opened this issue Jul 21, 2018 · 19 comments

Comments

@lu4
Copy link

lu4 commented Jul 21, 2018

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): N/A
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.0.4 LTS x86_64
  • TensorFlow installed from (source or binary): source
  • TensorFlow version (use command below): ComputeCpp-v0.6.0-4212-gb29ac8a 1.8.0-rc1
  • Python version: 3.5.2
  • Bazel version (if compiling from source): 0.15.0, build timestamp - 1530015019
  • GCC/Compiler version (if compiling from source): 5.4.0 20160609
  • CUDA/cuDNN version: N/A
  • GPU model and memory: Sapphire Radeon RX470, 8Gbytes
  • Exact command to reproduce: see below
from tensorflow.python.client import device_lib
device_lib.list_local_devices()

Here's info from environment capture script:

== cat /etc/issue ===============================================
Linux custom 4.16.0-rc6-smos+ #1 SMP Wed Mar 21 13:23:56 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
VERSION="16.04.3 LTS (Xenial Xerus)"
VERSION_ID="16.04"
VERSION_CODENAME=xenial

== are we in docker =============================================
No

== compiler =====================================================
c++ (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


== uname -a =====================================================
Linux custom 4.16.0-rc6-smos+ #1 SMP Wed Mar 21 13:23:56 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

== check pips ===================================================
numpy               1.14.5
protobuf            3.6.0
tensorflow          1.8.0rc1

== check for virtualenv =========================================
False

== tensorflow import ============================================
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ImportError: No module named tensorflow

== env ==========================================================
LD_LIBRARY_PATH /usr/local/lib:/usr/local/computecpp/lib:/usr/local/lib:/usr/local/computecpp/lib:
DYLD_LIBRARY_PATH is unset

== nvidia-smi ===================================================
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.


== cuda libs  ===================================================
/usr/local/lib/libcudart.so.9.0.103

== cat /etc/issue ===============================================
Linux custom 4.16.0-rc6-smos+ #1 SMP Wed Mar 21 13:23:56 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
VERSION="16.04.3 LTS (Xenial Xerus)"
VERSION_ID="16.04"
VERSION_CODENAME=xenial

== are we in docker =============================================
No

== compiler =====================================================
c++ (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


== uname -a =====================================================
Linux custom 4.16.0-rc6-smos+ #1 SMP Wed Mar 21 13:23:56 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

== check pips ===================================================
numpy               1.14.5
protobuf            3.6.0
tensorflow          1.8.0rc1

== check for virtualenv =========================================
False

== tensorflow import ============================================
tf.VERSION = 1.8.0-rc1
tf.GIT_VERSION = b'ComputeCpp-v0.6.0-4212-gb29ac8a'
tf.COMPILER_VERSION = b'ComputeCpp-v0.6.0-4212-gb29ac8a'
Sanity check: array([1], dtype=int32)

== env ==========================================================
LD_LIBRARY_PATH /usr/local/lib:/usr/local/computecpp/lib:/usr/local/lib:/usr/local/computecpp/lib:
DYLD_LIBRARY_PATH is unset

== nvidia-smi ===================================================
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.


== cuda libs  ===================================================
/usr/local/lib/libcudart.so.9.0.103

Describe the problem

Tensorflow built on top of SYCL refuses to list and use all available GPUs in the system. I'm using the following commands to get list of devices:

(please note that TensorFlow's in-line log presents 8 devices, but the actual resulting variable contains just two CPU and one GPU available through "/device:SYCL:0" name)

>>> from tensorflow.python.client import device_lib
>>> device_lib.list_local_devices()
2018-07-21 14:21:08.328612: I ./tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2
2018-07-21 14:21:09.308907: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:70] Found following OpenCL devices:
2018-07-21 14:21:09.308981: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:72] id: 0, type: GPU, name: Ellesmere, vendor: Advanced Micro Devices, Inc., profile: FULL_PROFILE
2018-07-21 14:21:09.309001: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:72] id: 1, type: GPU, name: Ellesmere, vendor: Advanced Micro Devices, Inc., profile: FULL_PROFILE
2018-07-21 14:21:09.309019: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:72] id: 2, type: GPU, name: Ellesmere, vendor: Advanced Micro Devices, Inc., profile: FULL_PROFILE
2018-07-21 14:21:09.309034: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:72] id: 3, type: GPU, name: Ellesmere, vendor: Advanced Micro Devices, Inc., profile: FULL_PROFILE
2018-07-21 14:21:09.309052: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:72] id: 4, type: GPU, name: Ellesmere, vendor: Advanced Micro Devices, Inc., profile: FULL_PROFILE
2018-07-21 14:21:09.309068: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:72] id: 5, type: GPU, name: Ellesmere, vendor: Advanced Micro Devices, Inc., profile: FULL_PROFILE
2018-07-21 14:21:09.309085: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:72] id: 6, type: GPU, name: Ellesmere, vendor: Advanced Micro Devices, Inc., profile: FULL_PROFILE
2018-07-21 14:21:09.309101: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:72] id: 7, type: GPU, name: Ellesmere, vendor: Advanced Micro Devices, Inc., profile: FULL_PROFILE
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 911408516298923653
, name: "/device:SYCL:0"
device_type: "SYCL"
memory_limit: 268435456
locality {
}
incarnation: 161138719697210983
physical_device_desc: "id: 0, type: GPU, name: Ellesmere, vendor: Advanced Micro Devices, Inc., profile: FULL_PROFILE"
]

I confirm that all devices are functional and available to OpenCL (visible to clinfo) and are operable through another 3-rd party package (ArrayFire). Also I confirm that SYCL itself sees all available devices, in order to test that purpose I've updated SYCL's 'custom-device-selector' example to following code:

/***************************************************************************
 *
 *  Copyright (C) 2016 Codeplay Software Limited
 *  Licensed under the Apache License, Version 2.0 (the "License");
 *  you may not use this file except in compliance with the License.
 *  You may obtain a copy of the License at
 *
 *      http://www.apache.org/licenses/LICENSE-2.0
 *
 *  For your convenience, a copy of the License has been included in this
 *  repository.
 *
 *  Unless required by applicable law or agreed to in writing, software
 *  distributed under the License is distributed on an "AS IS" BASIS,
 *  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 *  See the License for the specific language governing permissions and
 *  limitations under the License.
 *
 *  Codeplay's ComputeCpp SDK
 *
 *  custom-device-selector.cpp
 *
 *  Description:
 *    Sample code that shows how to write a custom device selector in SYCL.
 *
 **************************************************************************/

#include <CL/sycl.hpp>
#include <iostream>

using namespace cl::sycl;
using namespace std;

/* Classes can inherit from the device_selector class to allow users
 * to dictate the criteria for choosing a device from those that might be
 * present on a system. This example looks for a device with SPIR support
 * and prefers GPUs over CPUs. */
class custom_selector : public device_selector {
 public:
  custom_selector() : device_selector() {}

  /* The selection is performed via the () operator in the base
   * selector class.This method will be called once per device in each
   * platform. Note that all platforms are evaluated whenever there is
   * a device selection. */
  int operator()(const device& device) const override {
    cout << device.get_info<cl::sycl::info::device::vendor>() << ": " << device.get_info<cl::sycl::info::device::name>() << std::endl; // << "(" << device.get_info<cl::sycl::info::device::device_type>() << ")"
    cout << '\t' << "max_work_group_size : " << device.get_info<cl::sycl::info::device::max_work_group_size>() << std::endl;
//    cout << '\t' << "max_work_item_sizes : " << device.get_info<cl::sycl::info::device::max_work_item_sizes>() << std::endl;
    cout << '\t' << "max_compute_units   : " << device.get_info<cl::sycl::info::device::max_compute_units>() << std::endl;
    cout << '\t' << "local_mem_size      : " << device.get_info<cl::sycl::info::device::local_mem_size>() << std::endl;
    cout << '\t' << "max_mem_alloc_size  : " << device.get_info<cl::sycl::info::device::max_mem_alloc_size>() << std::endl;
    cout << '\t' << "profile             : " << device.get_info<cl::sycl::info::device::profile>() << std::endl;
    cout << "----------------------------------------------------------------------------------------------" <<  std::endl << std::endl << std::endl;

    /* We only give a valid score to devices that support SPIR. */
    if (device.has_extension(cl::sycl::string_class("cl_khr_spir"))) {
      if (device.get_info<info::device::device_type>() ==
          info::device_type::cpu) {
        return 50;
      }

      if (device.get_info<info::device::device_type>() ==
          info::device_type::gpu) {
        return 100;
      }
    }
    /* Devices with a negative score will never be chosen. */
    return -1;
  }
};

int main() {
  const int dataSize = 64;
  int ret = -1;
  float data[dataSize] = {0.f};

  range<1> dataRange(dataSize);
  buffer<float, 1> buf(data, dataRange);

  /* We create an object of custom_selector type and use it
   * like any other selector. */
  custom_selector selector;
  queue myQueue(selector);

  myQueue.submit([&](handler& cgh) {
    auto ptr = buf.get_access<access::mode::read_write>(cgh);

    cgh.parallel_for<class example_kernel>(dataRange, [=](item<1> item) {
      size_t idx = item.get_linear_id();
      ptr[item.get_linear_id()] = static_cast<float>(idx);
    });
  });

  /* A host accessor can be used to force an update from the device to the
   * host, allowing the data to be checked. */
  accessor<float, 1, access::mode::read_write, access::target::host_buffer>
      hostPtr(buf);

  if (hostPtr[10] == 10.0f) {
    ret = 0;
  }

  return ret;
}
@lu4 lu4 changed the title TensorFlow does not see all available GPU's in my system TensorFlow does not see all available GPUs in my system Jul 21, 2018
@mirh
Copy link

mirh commented Jul 22, 2018

Try computecpp 0.9.0 for starters?

@lu4
Copy link
Author

lu4 commented Jul 22, 2018

Sorry, didn't understood the question... I was using ComputeCpp-v0.6.0-4212-gb29ac8a, but ComputeCpp-v0.6.0-4212-gb29ac8a itself is working fine, it looks as TF is buggy...

@Rbiessy
Copy link
Collaborator

Rbiessy commented Jul 23, 2018

@lu4 as @mirh suggested, compiling with our latest ComputeCpp version will let you use a more recent version of TF. Could you try and download ComputeCpp CE 0.9.1? To compile you will need to use the latest commit of the eigen_sycl branch here: https://github.com/codeplaysoftware/tensorflow/tree/eigen_sycl

@lu4
Copy link
Author

lu4 commented Jul 23, 2018

Oh, I see, thanks, trying...

@lu4
Copy link
Author

lu4 commented Jul 23, 2018

vagrant@ubuntu-xenial:~/Project/tensorflow_eigen$ bazel build -c opt --config=sycl //tensorflow/tools/pip_package:build_pip_package
Extracting Bazel installation...
Starting local Bazel server and connecting to it...
..................
INFO: SHA256 (https://github.com/KhronosGroup/OpenCL-Headers/archive/f039db6764d52388658ef15c30b2237bbda49803.tar.gz) = a29e3e67beef1ad0ea6b0afd44b4b2c0e6054d1f9d68fdbd0c4ce434e59533e0
ERROR: /home/vagrant/.cache/bazel/_bazel_vagrant/e647697a348b187726950a371af92dd1/external/jpeg/BUILD:126:12: Illegal ambiguous match on configurable attribute "deps" in @jpeg//:jpeg:
@jpeg//:k8
@jpeg//:armeabi-v7a
Multiple matches are not allowed unless one is unambiguously more specialized.
ERROR: Analysis of target '//tensorflow/tools/pip_package:build_pip_package' failed; build aborted:

/home/vagrant/.cache/bazel/_bazel_vagrant/e647697a348b187726950a371af92dd1/external/jpeg/BUILD:126:12: Illegal ambiguous match on configurable attribute "deps" in @jpeg//:jpeg:
@jpeg//:k8
@jpeg//:armeabi-v7a
Multiple matches are not allowed unless one is unambiguously more specialized.
INFO: Elapsed time: 16.227s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (132 packages loaded)
currently loading: tensorflow/core/kernels

@lu4
Copy link
Author

lu4 commented Jul 23, 2018

It looks as the build system is trying to use arm architecture to build up, have no clue why...

@Rbiessy
Copy link
Collaborator

Rbiessy commented Jul 23, 2018

Ha this is a known issue with TF 1.6 and the recent versions of bazel. You have to use bazel 0.11.1 for our current version of TF. Make sure to manually remove the cache before compiling again.

@lu4
Copy link
Author

lu4 commented Jul 23, 2018

Thanks, trying...

@lu4
Copy link
Author

lu4 commented Jul 24, 2018

@Rbiessy @mirh Ok, guys I've compiled TF as mentioned above, for both eigen and lukeiwanski repos, i.e. by using ComputeCpp CE 0.9.1, but the resulting TF build reports b'ComputeCpp-v0.6.0-4212-gb29ac8a' 1.8.0-rc1. In addition to that it sees just one card.

@mirh
Copy link

mirh commented Jul 25, 2018

On a night in europe, hardly I think.

Anyway, for the love of me, your dev environment seems just so much weird.
Can't you clean it or try on another system?

And you are trying to build this, right? https://github.com/lukeiwanski/tensorflow/archive/dev/amd_gpu.zip

@rodburns
Copy link

Can you post the output of the "computecpp_info" tool located in the "bin" folder of the ComputeCpp release you are using?

@lu4
Copy link
Author

lu4 commented Jul 30, 2018

Hi, here is the output:

$ /usr/local/computecpp/bin/computecpp_info
********************************************************************************

ComputeCpp Info (CE 0.9.1)

SYCL 1.2.1 revision 3

********************************************************************************

Toolchain information:

GLIBC version: 2.23
GLIBCXX: 20160609
This version of libstdc++ is supported.

********************************************************************************


Device Info:

Discovered 8 devices matching:
  platform    : <any>
  device type : <any>

--------------------------------------------------------------------------------
Device 0:

  Device is supported                     : UNTESTED - Vendor not tested on this OS
  CL_DEVICE_NAME                          : Ellesmere
  CL_DEVICE_VENDOR                        : Advanced Micro Devices, Inc.
  CL_DRIVER_VERSION                       : 2482.3
  CL_DEVICE_TYPE                          : CL_DEVICE_TYPE_GPU
--------------------------------------------------------------------------------
Device 1:

  Device is supported                     : UNTESTED - Vendor not tested on this OS
  CL_DEVICE_NAME                          : Ellesmere
  CL_DEVICE_VENDOR                        : Advanced Micro Devices, Inc.
  CL_DRIVER_VERSION                       : 2482.3
  CL_DEVICE_TYPE                          : CL_DEVICE_TYPE_GPU
--------------------------------------------------------------------------------
Device 2:

  Device is supported                     : UNTESTED - Vendor not tested on this OS
  CL_DEVICE_NAME                          : Ellesmere
  CL_DEVICE_VENDOR                        : Advanced Micro Devices, Inc.
  CL_DRIVER_VERSION                       : 2482.3
  CL_DEVICE_TYPE                          : CL_DEVICE_TYPE_GPU
--------------------------------------------------------------------------------
Device 3:

  Device is supported                     : UNTESTED - Vendor not tested on this OS
  CL_DEVICE_NAME                          : Ellesmere
  CL_DEVICE_VENDOR                        : Advanced Micro Devices, Inc.
  CL_DRIVER_VERSION                       : 2482.3
  CL_DEVICE_TYPE                          : CL_DEVICE_TYPE_GPU
--------------------------------------------------------------------------------
Device 4:

  Device is supported                     : UNTESTED - Vendor not tested on this OS
  CL_DEVICE_NAME                          : Ellesmere
  CL_DEVICE_VENDOR                        : Advanced Micro Devices, Inc.
  CL_DRIVER_VERSION                       : 2482.3
  CL_DEVICE_TYPE                          : CL_DEVICE_TYPE_GPU
--------------------------------------------------------------------------------
Device 5:

  Device is supported                     : UNTESTED - Vendor not tested on this OS
  CL_DEVICE_NAME                          : Ellesmere
  CL_DEVICE_VENDOR                        : Advanced Micro Devices, Inc.
  CL_DRIVER_VERSION                       : 2482.3
  CL_DEVICE_TYPE                          : CL_DEVICE_TYPE_GPU
--------------------------------------------------------------------------------
Device 6:

  Device is supported                     : UNTESTED - Vendor not tested on this OS
  CL_DEVICE_NAME                          : Ellesmere
  CL_DEVICE_VENDOR                        : Advanced Micro Devices, Inc.
  CL_DRIVER_VERSION                       : 2482.3
  CL_DEVICE_TYPE                          : CL_DEVICE_TYPE_GPU
--------------------------------------------------------------------------------
Device 7:

  Device is supported                     : UNTESTED - Vendor not tested on this OS
  CL_DEVICE_NAME                          : Ellesmere
  CL_DEVICE_VENDOR                        : Advanced Micro Devices, Inc.
  CL_DRIVER_VERSION                       : 2482.3
  CL_DEVICE_TYPE                          : CL_DEVICE_TYPE_GPU

If you encounter problems when using any of these OpenCL devices, please consult
this website for known issues:
https://computecpp.codeplay.com/releases/v0.9.1/platform-support-notes

********************************************************************************

@lu4
Copy link
Author

lu4 commented Jul 30, 2018

@mirh I was able compile TF using provided archive but it still shows just one GPU in TF.

@lu4
Copy link
Author

lu4 commented Jul 30, 2018

Guys, I was wondering if you provide payed support, I need to get TF working with all devices in my machine? The issue is highly critical for me and I'm willing to pay couple of hundred bucks to get the ball rolling. Is it possible somehow?

@lukeiwanski
Copy link
Owner

@lu4 thanks for the report. It is some interesting rig you have there.

So far our focus was on supporting systems with only one device - like one GPU and combinations of devices like CPU with one GPU and one other accelerator.

It is quite complex to add support for multiple GPU - nevertheless, I believe we should do this.

This task most likely will take some time - have you tried HiP?

As of the paid support can you email me directly regarding that?

@jwlawson
Copy link
Collaborator

@lu4 I have absolutely no idea if this will work, but when you create a tensorflow session try setting the SYCL device count in the session config options:

import tensorflow as tf
with tf.Session(config=tf.ConfigProto(device_count={'SYCL': 8})) as sess:
  print(sess.list_devices())

Even if this does allow TF to see all your devices I don't know if it will automatically schedule compute across all of them. It would be very interesting to hear the results of this.

@lu4
Copy link
Author

lu4 commented Aug 2, 2018

@jwlawson your trick worked, I was able to access all GPUs in my system, though it turns out that not everything works smooth for example eager execution is not able to get advantage of all the cards (it may be also due to misconfiguration), for some reason it just to binds with gpu:0 and does not want to use anything else. I'm continuing to investigate further and report on if will find anything useful.

@lu4
Copy link
Author

lu4 commented Aug 2, 2018

@lukeiwanski I've sent an email to you (used your github email [email protected]), JFYI

@lukeiwanski
Copy link
Owner

@lu4 yes the email is correct.. however, I cannot find any email from you :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants