Fun with liquid-cooled GPU: February 2017

Friday, February 10, 2017

Installing tensorflow

This took 2 days. The issue was that I started out with pip install, which could impact the existing python (2.7) programs in the GPU box. I eventually went with virtualenv install.

Install required (some unrequired) packages

$ sudo apt-get install openjdk-8-jdk git python-dev python3-dev python-numpy python3-numpy build-essential python-pip python3-pip python-virtualenv swig python-wheel libcurl3-dev

Create a Virtualenv environment in the directory ~/tensorflow:

$ virtualenv --system-site-packages ~/tensorflow

Activate the environment:

$ source ~/tensorflow/bin/activate
(tensorflow)$

Pick the right tensorflow binary package (Ubuntu/Linux 64-bit, GPU enabled, Python 3.5)

(tensorflow)$ export TF_BINARY_URL=https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-0.12.1-cp35-cp35m-linux_x86_64.whl

Install tensorflow

(tensorflow)$ pip3 install --upgrade $TF_BINARY_URL

Add commands to ~/.bash_profile

export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64"
export CUDA_HOME=/usr/local/cuda

Done!

Updating bash file

Next time I logged in, my GPU box couldn't find nvcc. I panicked - do I need to install CUDA again?? I frantically searched for an answer on the web, and came to a conclusion that I didn't update bash file.

$ gedit ~/.bashrc

Add the following lines at the bottom of the bash file:

export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64"
export CUDA_HOME=/usr/local/cuda

Save and close the text file.
Type the following command to reload the .bashrc file:

$ source ~/.bashrc

Installing cuDNN

This was relatively painless.

Go to NVIDIA website, log in as a developer, download cuDNN Library v5.1 for Linux (cudnn-8.0-linux-x64-v5.1.tgz)

$ cd /usr/local/cuda
$ tar xvzf cudnn-8.0-linux-x64-v5.1.tgz
$ sudo cp -P cuda/include/cudnn.h /usr/local/cuda/include
$ sudo cp -P cuda/lib64/libcudnn* /usr/local/cuda/lib64
$ sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*

Monday, February 6, 2017

Installing CUDA

It took me about a week but I finally got it to work. I don't belabor the details but here is the summary:

1. Download CUDA toolkit 8.0
I used Ubuntu 16.04 LTS version of runfile (local)

2. Compute md5 sum:
$ md5sum cuda_8.0.44_linux.run

3. Remove CUDA toolkit 7.5
$ sudo apt-get purge nvidia-cuda*
$ sudo apt-get purge nvidia-*
(redundant but I did it to make sure)

4. Go to a terminal session
(ctrl+alt+F2)

5. Stop lightdm
$ sudo service lightdm stop

6. Install CUDA runfile
$ sudo sh cuda_8.0.44_linux.run --override

7. Start lightdm again
$ sudo service lightdm start

8. Modify PATH
$ export PATH=/usr/local/cuda-8.0/bin${PATH:+:${PATH}}
$ export LD_LIBRARY_PATH=/usr/local/cuda-8.0/lib64\${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

I basically followed the Installation Guide. Easy, right?

Here is my mistake. I realize that the CUDA driver version is 367 by default at this point.

9. Install the latest CUDA driver for Tesla K80 (= 375 at the time of this writing)
$ sudo apt-get install nvidia-375

Somehow driver 375 doesn't seem to do well in my system. When I run

$ nvidia-smi

I get the following error.

Failed to initialize NVML: Driver/library version mismatch

After multiple repetition of the loop 1-9 above, I just gave up installing driver 375 and everything seems to be working well. Now I have

CUDA Toolkit 8.0
CUDA driver 367
and... MATLAB does recognize Tesla K80!

Initial condition

Here is the spec of the liquid-cooled GPU box that I recently purchased to meet the high-performance computing demands in my laboratory:

Motherboard: Xeon E5-2600/1600 v3 C612 Chipset
CPU:               Intel® Xeon® Processor E5-2680 v4 (14-core, 35M Cache, 2.40 GHz)
Memory:       DDR4 ECC Reg SO-DIMM 128GB (= 4x32GB)
Storage:           2.5" SATA 6Gb/s Internal SSD 1TB
GPU:               2 x NVIDIA Tesla K80 24GB Passive Cooling PCI-E 3.0 x16 GPU
Cooling:          2-Phase Liquid Cooling Kit for GPU and CPU by Ebullient
OS:                  Ubuntu 16.04 LTS

Here are the problems that I found as soon as it arrived:

1. Can't login to Ubuntu using Unity
The vendor kindly installed Openbox which allows login to Ubuntu with no issues.

2. Older version of CUDA toolkit was installed
The current version of CUDA toolkit at the time of this writing is 8.0. However, toolkit 7.5 was installed.

3. Older version of CUDA driver was installed
The current version of CUDA driver for Tesla K80 at the time of this writing is 375. However, version 367 was installed.

4. MATLAB doesn't recognize NVIDIA Tesla K80
In MATLAB 2016b with Parallel Computing Toolbox;
>> gpuDeviceCount

ans =

     0

Oh, noooo!

The purpose of this blog is to document the solutions (and the struggles) so no one needs to waste their time trying to solve the same problems that I had.