Deep Learning software packages

PowerAI release 4 provides software packages for several Deep Learning frameworks, supporting libraries, and tools:

Bazel
Caffe - BVLC, IBM, and NVIDIA variants
Chainer
DIGITS
NCCL
OpenBLAS
OpenMPI - with CUDA enablement
TensorFlow
Theano
Torch

Release 4 also includes a Technology Preview of IBM PowerAI Distributed Deep Learning (DDL). Distributed Deep Learning provides support for distributed (multi-host) model training. DDL is integrated into IBM Caffe. TensorFlow support is provided by a separate package included in the PowerAI distribution.

All the packages are intended for use with Ubuntu 16.04 on POWER with NVIDIA CUDA 8.0 and cuDNN v6.0 packages.

More information about PowerAI is available at https://ibm.biz/powerai. Developer resources can be found at http://ibm.biz/poweraideveloper

System set up

Operating System

The Deep Learning packages require Ubuntu 16.04 for IBM POWER8. Ubuntu installation images can be downloaded from:

http://www.ubuntu.com/download/server/power8

NOTE: PowerAI Release 4 requires the version 4.4 linux kernel. Ubuntu 16.04 supports two different kernel versions: the base kernel (version 4.4), and the Hardware Enablement kernel (currently version 4.8; see https://wiki.ubuntu.com/Kernel/RollingLTSEnablementStack). Be sure to install the base 4.4 kernel for PowerAI.

NVIDIA components

The Deep Learning packages require CUDA, cuDNN, and GPU driver packages from NVIDIA.

The required and recommended versions of these components are:

| Component    | Required | Recommended |
|--------------|----------|-------------|
| CUDA Toolkit | 8.0      | 8.0.61      |
| cuDNN        | 6.0      | 6.0.20      |
| GPU Driver   | 384.66   | 384.66      |

These components can be installed by:

Download and install NVIDIA CUDA 8.0 from https://developer.nvidia.com/cuda-downloads
- Select Operating System: Linux
- Select Architecture: ppc64le
- Select Distribution Ubuntu
- Select Version 16.04
- Select the Installer Type that best fits your needs
- Follow the Linux installation instructions in the CUDA Quick Start Guide (linked from https://developer.nvidia.com/cuda-downloads), including the steps describing how to set up the CUDA development environment by updating PATH and LD_LIBRARY_PATH.
Download NVIDIA cuDNN 6.0 for CUDA 8.0 Power8 Deb packages from https://developer.nvidia.com/cudnn (Registration in NVIDIA's Accelerated Computing Developer Program is required)
- cuDNN v6.0 Runtime Library for Ubuntu16.04 Power8 (Deb)
- cuDNN v6.0 Developer Library for Ubuntu16.04 Power8 (Deb)
- cuDNN v6.0 Code Samples and User Guide Power8 (Deb)
Install the cuDNN v6.0 packages
```
   $ sudo dpkg -i libcudnn6*deb
```
Download the 384-series NVIDIA driver repo package from https://www.nvidia.com
- Select DRIVERS
- Select ALL NVIDIA DRIVERS
- Choose the NVIDIA Driver Downloads options:
- Select Product Type: Tesla
- Select Product Series: P-Series
- Select Product: Tesla P100
- Select Operating System: Linux POWER8 Ubuntu
- Select CUDA Toolkit: 8.0
- Click Search
Upgrade to the 384-series NVIDIA driver and reboot

The CUDA 8.0 installation step above should have installed a 361-series driver. Upgrade to the 384-series driver by:
```
   $ sudo dpkg -i nvidia-driver-local-repo-ubuntu1604-384*.deb
   $ sudo apt-get update
   $ sudo apt-get upgrade cuda-drivers
   $ sudo shutdown -r now
```

NOTE: Version 361 and 375 GPU drivers are available for download from NVIDIA but are not supported for this PowerAI release.

Installing the Deep Learning Frameworks

Software repository Setup

The PowerAI Deep Learning packages are provided via two different installation methods:

The local repository package (mldl-repo-local) creates an installation repository on the local machine. This method is best for systems with limited internet access or where strong control of upgrades is desired.
The network repository package (mldl-repo-nework) creates a reference on the local machine to the PowerAI network repository. This method is best for internet connected systems that will be readily updated to new versions of PowerAI.

These packages are mutually exclusive. Choose one or the other for your systems.

Software repository setup is similar for either method:

Download the desired repository package (.deb file) from https://public.dhe.ibm.com/software/server/POWER/Linux/mldl/ubuntu/
Install the repository package:
```
   $ sudo dpkg -i mldl-repo-*.deb
```
Update the package cache
```
   $ sudo apt-get update
```

Installing all frameworks at once

All the Deep Learning frameworks can be installed at once using the power-mldl meta-package:

    $ sudo apt-get install power-mldl

Installing frameworks individually

The Deep Learning frameworks can be installed individually if preferred. The framework packages are:

caffe-bvlc - Berkeley Vision and Learning Center (BVLC) upstream Caffe, v1.0.0
caffe-ibm - IBM Optimized version of BVLC Caffe, v1.0.0
caffe-nv - NVIDIA fork of Caffe, v0.15.14
chainer - Chainer, v1.23.0
digits - DIGITS, v5.0.0
tensorflow - Google TensorFlow, v1.1.0
ddl-tensorflow - Distributed Deep Learning custom operator for TensorFlow
theano - Theano, v0.9.0
torch - Torch, v7

Each can be installed with:

    $ sudo apt-get install <framework>

Installation note for IBM Caffe and DDL custom operator for TensorFlow

The caffe-ibm and ddl-tensorflow packages require the PowerAI OpenMPI package which is built with NVIDIA CUDA support. That OpenMPI package conflicts with Ubuntu's non-CUDA-enabled OpenMPI packages.

Please uninstall any openmpi or libopenmpi packages before installing IBM Caffe or DDL custom operator for TensorFlow. Purge any configuration files to avoid interference:

    $ dpkg -l | grep openmpi
    $ sudo apt-get purge ...

Installation note for DIGITS

The digits and python-socketio-server packages conflict with Ubuntu's older python-socketio package. Please uninstall the python-socketio package before installing DIGITS.

Upgrading from a previous release

NOTE: PowerAI Release 4 requires new versions of the NVIDIA GPU driver (384) and cuDNN (6.0). The recommended upgrade process is to uninstall the older version of PowerAI, update the NVIDIA components, then install the new version of PowerAI.

Remove the previous version of PowerAI, including the repo package

   $ dpkg -l | egrep 'mldl|3ibm'

   $ sudo apt-get purge ...

   $ sudo apt-get update

Update the NVIDIA components

A. Remove cuDNN v5
```
      $ dpkg -l | grep cudnn

      $ sudo apt-get purge ...
```
B. Remove the 361 series driver
```
      $ dpkg -l | grep 361

      $ sudo apt-get purge ...

      $ sudo apt-get update
```
C. Install cuDNN v6.0 as described above

D. Install the 384-series NVIDIA GPU driver as described above

E. Reboot to activate the new driver
Install the new PowerAI package as described above

Tuning recommendations

Recommended settings for optimal Deep Learning performance on the S822LC for High Performance Computing are:

Enable Performance Governor

   $ sudo apt-get install linux-tools-common linux-tools-generic cpufrequtils lsb-release
   $ sudo cpupower -c all frequency-set -g performance

Enable GPU persistence mode

Use nvidia-persistenced (http://docs.nvidia.com/deploy/driver-persistence/index.html) or
```
   $ sudo nvidia-smi -pm ENABLED
```
Set GPU memory and graphics clocks (P100 GPU only)
```
   $ sudo nvidia-smi -ac 715,1480
```

For TensorFlow, set the SMT mode

   $ sudo apt-get install powerpc-ibm-utils
   $ sudo ppc64_cpu --smt=2

Getting started with MLDL Frameworks

General setup

Most of the PowerAI packages install outside the normal system search paths (to /opt/DL/...), so each framework package provides a shell script to simplify environmental setup (e.g. PATH, LD_LIBRARY_PATH, PYTHONPATH).

We recommend users update their shell rc file (e.g. .bashrc) to source the desired setup scripts. For example:

    source /opt/DL/<framework>/bin/<framework>-activate

Each frame also provides a test script to verify basic function:

    $ <framework>-test

Note about python setuptools / easy_install

The python easy_install utility may interfere with the proper function of some of the PowerAI framework packages including TensorFlow and Chainer.

The PowerAI packages include local copies of python modules such as protobuf (TensorFlow) and pillow (Chainer) because they require versions newer than those provided by Canonical / Ubuntu. The <framework>-activate scripts set up the pathing needed to make that work (they set PYTHONPATH to give the local copies priority over the system default versions).

easy_install adds a script that may cause the system's default paths to be searched ahead of PYTHONPATH entries. This may result in protobuf or pillow related failures in TensorFlow and Chainer.

If easy-install was run as root, the problematic script may be found in:

    /usr/local/lib/python2.7/dist-packages/easy-install.pth

Getting started with Caffe

Caffe alternatives

Packages are provided for upstream BVLC Caffe (/opt/DL/caffe-bvlc), IBM' optimized Caffe (/opt/DL/caffe-ibm), and NVIDIA's Caffe (/opt/DL/caffe-nv). The system default Caffe (/opt/DL/caffe) can be selected using Ubuntu's alternatives system:

    $ sudo update-alternatives --config caffe
    There are 3 choices for the alternative caffe (providing /opt/DL/caffe).

      Selection    Path                Priority   Status
    ------------------------------------------------------------
    * 0            /opt/DL/caffe-ibm    100       auto mode
      1            /opt/DL/caffe-bvlc   50        manual mode
      2            /opt/DL/caffe-ibm    100       manual mode
      3            /opt/DL/caffe-nv     75        manual mode

    Press <enter> to keep the current choice[*], or type selection number:

Users can activate the system default caffe:

    source /opt/DL/caffe/bin/caffe-activate

Or they can activate a specific variant. For example:

    source /opt/DL/caffe-bvlc/bin/caffe-activate

Attempting to activate multiple Caffe packages in a single login session will cause unpredictable behavior.

Caffe samples and examples

Each Caffe package includes example scripts and sample models, etc. A script is provided to copy the sample content into a specified directory:

    $ caffe-install-samples <somedir>

More info

Visit Caffe's website (http://caffe.berkeleyvision.org/) for tutorials and example programs that you can run to get started.

Here are links to a couple of the example programs:

LeNet MNIST Tutorial - Train a neural network to understand handwritten digits
CIFAR-10 tutorial - Train a convolutional neural network to classify small images

Optimizations in IBM Caffe

The IBM Caffe package (caffe-ibm) in PowerAI is based on BVLC Caffe and includes optimizations and enhancements from IBM:

CPU/GPU layer-wise reduction
Large Model Suppot (LMS)
IBM PowerAI Distributed Deep Learning (DDL)

Command line options

IBM Caffe supports all of BVLC Caffe's options and adds a few new ones to control the enhancements:

-bvlc: Disable CPU/GPU layer-wise reduction
-threshold: Tune CPU/GPU layer-wise reduction. If the number of parameters for one layer is greater than or equal to threshold, their accumulation on CPU will be done in parallel. Otherwise, the accumulation will be done using one thread. It is set to 2,000,000 by default.
-ddl ["-option1 param -option2 param"]: Enable Distributed Deep Learning, with optional space-delimited parameter string. Supported parameters are:
- mode <mode>
- dumo_iter <N>
- dev_sync <0, 1, or 2>
- rebind_iter <N>
- dbg_level <0, 1, or 2>
-ddl_update: This option instructs Caffe to use a new custom version of the ApplyUpdate function that is optimized for DDL. It is faster, but does not support gradient clipping so is off by default. It can be used in networks that do not support clipping (common).
-ddl_align: This option ensures that the gradient buffers have a length that is a multiple of 256 bytes and have start addresses that are multiples of 256. This ensures cache line alignment on multiple platforms as well as alignment with NCCL slices. Off by default
-ddl_database_restart: This option ensures every learner always looks at the same data set during an epoch. This allows a system to cache only the pages that are touched by the learners contained within it. It can help size the number of learners needed for a given data set size by establishing a known database footprint per system. Off by default.
-lms <size>: Enable Large Model Support with threshold of <size>. See below.
-lms_frac <fraction>: Tune Large Model Support memory usage between CPU and GPU. See below.

Use the command line options as follows:

    | Feature                         | -bvlc | -ddl | -lms  | -gpu          |
    |---------------------------------|-------|------|-------|---------------|
    | CPU/GPU layer-wise reduction    |   N   |   X  |   X   | multiple GPUs |
    | Distributed Deep Learning (DDL) |   X   |   Y  |   X   | N             |
    | Large model support             |   X   |   X  |   Y   | X             |

    Y: do specify
    N: don't specifiy
    X: don't care/matter

LMS gets effective regardless of other options as long as -lms is specified. For example, you can use DDL and LMS together.

CPU/GPU layer-wise reduction is enabled only if multiple GPUs are specified and layer_wise_reduce: false.

Use of multiple GPUs with DDL is specified via the MPI rank file, so the -gpu flag may not be used to specify multiple GPUs for DDL.

About CPU/GPU layer-wise reduction

This optimization aims to reduce the running time of a multiple-GPU training by utilizing CPUs. In particular, gradient accumulation is offloaded to CPUs and done in parallel with the training. To gain the best performance with IBM Caffe, please close unnecessary applications that consume a high percentage of CPU.

If using a single GPU, IBM Caffe and BVLC Caffe will have similar performance.

The optimizations in IBM Caffe do not change the convergence of a neural network during training. IBM Caffe and BVLC Caffe should produce the same convergence results.

CPU/GPU layer-wise reduction is enabled unless the -bvlc commandline flag is used.

About IBM PowerAI Distributed Deep Learning (DDL)

See /opt/DL/ddl/doc/README.md for more information about using IBM PowerAI Distributed Deep Learning.

About Large Model Support (LMS)

You can enable the large model support by adding -lms <size in KB>. For example -lms 1000. Then, any memory chunk larger than 1000KB will be kept in CPU memory, and fetched to GPU memory only when needed for computation. Thus, if you pass a very large value like -lms 10000000000, it will effectively disable the feature while small value means more aggressive LMS. The value is to control the performance trade-off.

As a secondary option, there is -lms_frac <0~1.0>. For example, with -lms_frac 0.4 LMS doesn't kick in until more than at least 40% of GPU memory is expected to be taken. This is useful for disabling LMS for a small network.

Combining LMS and DDL

Large Model Support and Distributed Deep Learning can be combined. For example:

    $ mpirun -x PATH -x LD_LIBRARY_PATH -rf 4x4x2.rf -n 8 caffe train -solver alexnet_solver.prototxt -ddl "-mode n:4x2" -lms 1000

Getting started with Chainer

The PowerAI Chainer package includes some optimizations from IBM:

Workspace auto-tuning finds the fastest algorithm for CNN forward and backward propagation
CPU/GPU layer-wise reduction (gradient overlap) aims to provide an efficient data parallel training over multiple GPUs

A Guide with information about these optimizations can be found in /opt/DL/chainer/doc/TRL_Chainer_1.23.0_Guide.pdf.

It is not necessary to pip install the cython, pillow, or numexpr packages when using the PowerAI Chainer package.

The train_imagenet_ibm.py script mentioned in the Guide is included as an example in the PowerAI Chainer package:

    $ source /opt/DL/chainer/bin/chainer-activate

    $ chainer-install-samples $HOME/chainer
    Creating directory /home/ubuntu/chainer
    Copying examples/ into /home/ubuntu/chainer...
    Success

    $ ls $HOME/chainer/examples/imagenet/train_imagenet_ibm.py
    /home/ubuntu/chainer/examples/imagenet/train_imagenet_ibm.py

The Chainer home page at http://chainer.org/ includes documentation for the Chainer project, including a Quick Start example.

Getting started with Tensorflow

The TensorFlow homepage (https://www.tensorflow.org/) has a variety of information, including Tutorials, How Tos, and a Getting Started guide.

Additional tutorials and examples are available from the community, for example:

Distributed Deep Learning (DDL) custom operator for TensorFlow

This release of PowerAI includes a Technology Preview of the IBM PowerAI Distributed Deep Learning (DDL) custom operator for TensorFlow. The DDL custom operator uses CUDA-aware OpenMPI and NCCL to provide high-speed communications for distributed TensorFlow.

The DDL custom operator can be found in the ddl-tensorflow package. For more information about DDL and about the TensorFlow operator, see:

/opt/DL/ddl/doc/README.md
/opt/DL/ddl-tensorflow/doc/README.md
/opt/DL/ddl-tensorflow/doc/README-API.md

The DDL TensorFlow operator makes it easy to enable Slim-style models for distribution. The package includes examples of Slim models enabled with DDL:

    $ source /opt/DL/ddl-tensorflow/bin/ddl-tensorflow-activate

    $ ddl-tensorflow-install-samples <somedir>

Those examples are based on a specific commit of the TensorFlow models repo with a small adjustment. If you prefer to work from an upstream clone, rather than the packaged examples:

    $ git clone https://github.com/tensorflow/models.git
    $ cd models
    $ git checkout 11883ec6461afe961def44221486053a59f90a1b
    $ git revert fc7342bf047ec5fc7a707202adaf108661bd373d
    $ cp /opt/DL/ddl-tensorflow/examples/slim/train_image_classifier.py slim/

Additional TensorFlow features

The PowerAI TensorFlow packages include TensorBoard. See: https://www.tensorflow.org/get_started/summaries_and_tensorboard

The TensorFlow 1.1.0 package includes support for additional features:

HDFS
NCCL
experimental XLA JIT compilation (see https://www.tensorflow.org/versions/master/experimental/xla/)

Getting started with Torch

The Torch Cheatsheet contains lots of info for people new to Torch, including tutorials and examples.

The Torch project has a demos repository at https://github.com/torch/demos

Tutorials can be found at https://github.com/torch/tutorials

Visit Torch's website for the latest from Torch.

Torch samples and examples

The Torch package includes example scripts and samples models. A script is provided to copy the sample content into a specified directory:

    $ torch-install-samples <somedir>

Among these are the Imagenet examples from https://github.com/soumith/imagenet-multiGPU.torch with a few modifications.

Extending Torch with additional Lua rocks

The Torch package includes several Lua rocks useful for creating Deep Learning applications. Additional Lua rocks can be installed locally to extend functionality. For example a rock providing NCCL bindings can be installed by:

    $ source /opt/DL/torch/bin/torch-activate
    $ source /opt/DL/nccl/bin/nccl-activate

    $ luarocks install --local --deps-mode=all "https://raw.githubusercontent.com/ngimel/nccl.torch/master/nccl-scm-1.rockspec"
    ...
    nccl scm-1 is now built and installed in /home/user/.luarocks/ (license: BSD)

    $ luajit
    LuaJIT 2.1.0-beta1 -- Copyright (C) 2005-2015 Mike Pall. http://luajit.org/
    JIT: OFF
    > require 'torch'
    > require 'nccl'
    >

Using the torchIO Lua rock

torchIO is an IBM research project designed to provide optimized IO access to images for torch deep learning projects. torchIO is executed in two steps.

Creating and loading an LMDB database with training/validation data using provided binary.
Loading the torchio rock in lua code and point to the LMDB database you wish to load.

torchIO example

PowerAI comes with a sample for torchIO in the /opt/DL/torch/examples/torchIO directory.

This example is for loading batches of the Imagenet data set. You can get the images at http://image-net.org/download.

As with imagenet-multiGPU https://github.com/soumith/imagenet-multiGPU.torch we recommend resizing the images so that 256 is the smaller dimension:

    $ find . -name "*.JPEG" | xargs -I {} convert {} -resize "256^>" {}

In the examples directory there are 5 main files:

create_lmdb_wrapper.sh The main script for creating an lmdb from imagenet files. The script takes as input the location of the imagenet dataset, and the output location for your lmdb database.
tr_o.txt va_o.txt These two files contain the names of the training and validation images, and their labels(in numerical format).
create_lmdb Compiled c binary, invoked by create_lmdb.sh to create, and load the target lmdb database with the imagenet images.
test.lua Lua script which takes the location of the training, or validation lmdb database to load into memory. The number of threads, jobs, batch size etc. Can be changed from the top of test.lua

Sample run

Here's a set of commands to build the lmdb database, as well as execute the test.lua script. We assume the resized imagenet images are under /home/ubuntu/imagenet/256/

Build lmdb database:

   $ cd /opt/DL/torch/examples/torchIO/

   $ ./create_lmdb_wrapper.sh /home/ubuntu/imagenet/256 /home/ubuntu/imagenet/lmdb
   Loading Images   1%
   Loading Images   2%
   Loading Images   3%
   .
   .
   Loading Images  100%

Run sample test to load a batch of imagenet data:

   $ th test.lua -data /home/ubuntu/imagenet/lmdb/train
   Batch Size 1024
   Number of threads 2
   Number of jobs 20
   .
   .
   .
   No. of entries detected: 1281167
   Starting process 1
   Starting process 2
   .
   .
   .
   Starting process 19
   Starting process 20
   Elapsed time 0.15296839475632

No CudaHalfTensor support

This release does not support the CudaHalfTensor data type. Programs using that type may suffer failures or inconsistent results.

Getting started with Theano

Here are some links to help you get started with Theano:

Visit Theano's website for the latest from Theano.

Theano 0.9.0 deprecates support for the old GPU backend (e.g. THEANO_FLAGS=device=gpu) and adds support for the gpuarray backend (e.g. THEANO_FLAGS=device=cuda0). The old GPU backend will likely be removed in a future Theano update.

Getting started with DIGITS

The first time it's run digits-activate will create a .digits subdirectory containing the DIGITS jobs directory, as well as the digits.log file

Multiple instances of the DIGITS server can be run at once, including by different users, but users may need to set the network port number to avoid conflicts.

To start DIGITS server with default port (5000):

    $ digits-devserver

To start DIGITS server with specific port

    $ digits-devserver -p <port_num>

NVIDIA's DIGITS site has more information about DIGITS.

The DIGITS Getting Started guide describes how to train a network model to classify the MNIST hand-written digits dataset.

Additional DIGITS examples are available at https://github.com/NVIDIA/DIGITS/tree/master/examples

The PowerAI Torch package is updated to work with DIGITS. Manual installation of individual lua rocks is no longer required.

Installing DIGITS plugins

The DIGITS package supports the use of python installed plugins to provide additional features when using the DIGITS server page. These plugins are included in the PowerAI distribution and can be installed in one of two ways.

Install Plugins under the current user

    $ pip install /opt/DL/digits/plugins/data/imageGradients
    $ pip install /opt/DL/digits/plugins/view/imageGradients

Install Plugins for all users

    $ sudo pip install /opt/DL/digits/plugins/data/imageGradients
    $ sudo pip install /opt/DL/digits/plugins/view/imageGradients

Examples using DIGITS plugins can be found in DIGITS examples folder

Legal Notices

IBM, the IBM logo, ibm.com, POWER, Power, POWER8, and Power systems are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at www.ibm.com/legal/copytrade.shtml.

Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both.

The TensorFlow package includes code from the BoringSSL project. The following notices may apply:

    This product includes software developed by the OpenSSL Project for
    use in the OpenSSL Toolkit. (http://www.openssl.org/)

    This product includes cryptographic software written by Eric Young
    (eay@cryptsoft.com)

This document is current as of the initial date of publication and may be changed by IBM at any time. Not all offerings are available in every country in which IBM operates.

THE INFORMATION IN THIS DOCUMENT IS PROVIDED "AS IS" WITHOUT ANY WARRANTY, EXPRESS OR IMPLIED, INCLUDING WITHOUT ANY WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND ANY WARRANTY OR CONDITION OF NON-INFRINGEMENT. IBM products are warranted according to the terms and conditions of the agreements under which they are provided.