Skip to content

H.T.U Tensorflow

Li Jiang edited this page Jun 23, 2017 · 2 revisions

Tensorflow

About

在天河HPC 上使用 tensorflow

-- LiJIang 20161215

目前在GPU 分区(LN41) 部署了最新的TensorFlow

-- LiJiang 20170623

Quick Run

[nscc-gz_jiangli@ln2%tianhe2-C ~]$ module avail tensorflow

---------------------------- /WORK/app/modulefiles -----------------------------
tensorflow/0.10.0rc0 tensorflow/0.11.0
[nscc-gz_jiangli@ln2%tianhe2-C ~]$ 

可以看到目前部署了两个版本的 tensorflow/0.10.0rc0 和 tensorflow/0.11.0

使用方式略有区别.

我在 /WORK/app/tensorflow/test/hello.py 里写了个简单的测试脚本:

[nscc-gz_jiangli@ln2%tianhe2-C ~]$ cat /WORK/app/tensorflow/test/hello.py 
#!/usr/bin/env python
import tensorflow as tf
hello = tf.constant("Hello, TensorFLow")
sess = tf.Session()
print ( sess.run(hello) )
a = tf.constant(10)
b = tf.constant(32)
print ( sess.run(a+b) )

user tensorflow/0.10.0rc0

[nscc-gz_jiangli@ln2%tianhe2-C ~]$ module load tensorflow/0.10.0rc0 
[nscc-gz_jiangli@ln2%tianhe2-C ~]$ which python
/WORK/app/TensorFlow/anaconda2/bin/python
[nscc-gz_jiangli@ln2%tianhe2-C ~]$ python --version
Python 2.7.12 :: Anaconda 4.2.0 (64-bit)
[nscc-gz_jiangli@ln2%tianhe2-C ~]$ yhrun -n 1 python /WORK/app/tensorflow/test/hello.py 
Hello, TensorFLow
42

如上所示,直接 moule load 即可使用

use tensorflow/0.11.0

[nscc-gz_jiangli@ln2%tianhe2-C ~]$ module load tensorflow/0.11.0
####################################################################
 To use TensorFlow you have to activate the conda environment by executing these two commands:
 1. settf
 2. source activate tensorflow_0_11_0
 
 (tensorflow)$  # Your prompt should change.
 # Run Python programs that use TensorFlow.
 # ...
 # When you are done using TensorFlow, deactivate the environment.
 (tensorflow)$ source deactivate tensorflow_0_11_0
 ####################################################################
[nscc-gz_jiangli@ln2%tianhe2-C ~]$ 
[nscc-gz_jiangli@ln2%tianhe2-C ~]$ yhrun -n 1 python /WORK/app/tensorflow/test/hello.py 
Traceback (most recent call last):
  File "/WORK/app/tensorflow/test/hello.py", line 2, in <module>
    import tensorflow as tf
ImportError: No module named tensorflow
yhrun: error: cn11642: task 0: Exited with exit code 1

如上所示,直接module load 还不行,还需要进行一些操作:

[nscc-gz_jiangli@ln2%tianhe2-C ~]$ settf
[nscc-gz_jiangli@ln2%tianhe2-C ~]$  source activate tensorflow_0_11_0
(tensorflow_0_11_0) [nscc-gz_jiangli@ln2%tianhe2-C ~]
(tensorflow_0_11_0) [nscc-gz_jiangli@ln2%tianhe2-C ~]$ yhrun  -n 1 python /WORK/app/tensorflow/test/hello.py
Traceback (most recent call last):
  File "/WORK/app/tensorflow/test/hello.py", line 2, in <module>
    import tensorflow as tf
  File "/WORK/app/anaconda/4.2.0/envs/tensorflow_0_11_0/lib/python2.7/site-packages/tensorflow/__init__.py", line 23, in <module>
    from tensorflow.python import *
  File "/WORK/app/anaconda/4.2.0/envs/tensorflow_0_11_0/lib/python2.7/site-packages/tensorflow/python/__init__.py", line 49, in <module>
    from tensorflow.python import pywrap_tensorflow
  File "/WORK/app/anaconda/4.2.0/envs/tensorflow_0_11_0/lib/python2.7/site-packages/tensorflow/python/pywrap_tensorflow.py", line 28, in <module>
    _pywrap_tensorflow = swig_import_helper()
  File "/WORK/app/anaconda/4.2.0/envs/tensorflow_0_11_0/lib/python2.7/site-packages/tensorflow/python/pywrap_tensorflow.py", line 24, in swig_import_helper
    _mod = imp.load_module('_pywrap_tensorflow', fp, pathname, description)
ImportError: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by /WORK/app/anaconda/4.2.0/envs/tensorflow_0_11_0/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow.so)
yhrun: error: cn5987: task 0: Exited with exit code 1

这个版本的Tensorflow 直接提交似乎有些问题。 不过经测试可以登录节点后使用:

[nscc-gz_jiangli@ln2%tianhe2-C ~]$ yhalloc 
yhalloc: Granted job allocation 3907141
[nscc-gz_jiangli@ln2 ~]$ yhq
             JOBID PARTITION     NAME         USER ST       TIME  NODES NODELIST(REASON)
           3907141      work     bash nscc-gz_jian  R       0:02      1 cn5060
[nscc-gz_jiangli@ln2 ~]$ ssh cn5060
Warning: Permanently added 'cn5060' (RSA) to the list of known hosts.
[nscc-gz_jiangli@cn5060%tianhe2-C ~]$ module load  tensorflow/0.11.0
####################################################################
 To use TensorFlow you have to activate the conda environment by executing these two commands:
 1. settf
 2. source activate tensorflow_0_11_0
 
 (tensorflow)$  # Your prompt should change.
 # Run Python programs that use TensorFlow.
 # ...
 # When you are done using TensorFlow, deactivate the environment.
 (tensorflow)$ source deactivate tensorflow_0_11_0
####################################################################
[nscc-gz_jiangli@cn5060%tianhe2-C ~]$ settf
[nscc-gz_jiangli@cn5060%tianhe2-C ~]$ source activate tensorflow_0_11_0
(tensorflow_0_11_0) [nscc-gz_jiangli@cn5060%tianhe2-C ~]$ python /WORK/app/tensorflow/test/hello.py 
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:105] Couldn't open CUDA library libcuda.so.1. LD_LIBRARY_PATH: /WORK/app/CUDA/8.0/libnvvp:/WORK/app/CUDA/8.0/libnsight:/WORK/app/CUDA/8.0/lib64:/WORK/app/CUDA/8.0/lib:/WORK/app/cudnn/5.1-CUDA8.0/lib64:/WORK/app/gcc/4.9.2/lib64:/WORK/app/gcc/4.9.2/lib:/WORK/app/gcc/4.9.2/libexec:/WORK/app/mpc/0.8.1/lib:/WORK/app/MPFR/2.4.2/lib:/WORK/app/gmp/4.3.2/lib:/opt/intel/mic/coi/host-linux-release/lib
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:160] hostname: cn5060
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:185] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] kernel reported version is: Permission denied: could not open driver version path for reading: /proc/driver/nvidia/version
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1080] LD_LIBRARY_PATH: /WORK/app/CUDA/8.0/libnvvp:/WORK/app/CUDA/8.0/libnsight:/WORK/app/CUDA/8.0/lib64:/WORK/app/CUDA/8.0/lib:/WORK/app/cudnn/5.1-CUDA8.0/lib64:/WORK/app/gcc/4.9.2/lib64:/WORK/app/gcc/4.9.2/lib:/WORK/app/gcc/4.9.2/libexec:/WORK/app/mpc/0.8.1/lib:/WORK/app/MPFR/2.4.2/lib:/WORK/app/gmp/4.3.2/lib:/opt/intel/mic/coi/host-linux-release/lib
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1081] failed to find libcuda.so on this system: Failed precondition: could not dlopen DSO: libcuda.so.1; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcurand.so locally
E tensorflow/stream_executor/cuda/cuda_driver.cc:491] failed call to cuInit: CUDA_ERROR_NO_DEVICE
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:140] kernel driver does not appear to be running on this host (cn5060): /proc/driver/nvidia/version does not exist
Hello, TensorFLow
42
(tensorflow_0_11_0) [nscc-gz_jiangli@cn5060%tianhe2-C ~]$ 

输出的信息有些多,但是计算还是正常进行了。

补充说明

  • 如果要使用GPU,请使用相应的分区
  • 希望有用户能协助提供性能测试数据或算例