Guide

This page describes how to quickly get started with MALT-2. MALT-2 parallelizes Torch over multiple CPUs and GPUs.

Building MALT with Torch

Requirements

Setup

Install Torch, MPI, Boost and CUDA (if using GPU).

Checkout the latest version of MALT-2 from github

git clone https://github.com/malt2/malt2.git --recursive

Setup the environment variables

Source your torch/cuda/MKL environment:

on some machines, you might need things something like (MKL is optional):

source [torch-dir]/install/bin/torch-activate
source /opt/intel/mkl/bin/intel64/mklvars.sh intel64

If using modules, you can try:

module install icc cuda80 luajit

To build everything including dstorm, orm and torch, just type from the top-level directory:

make

This command builds the distributed shared memory component (dstorm), the shared memory transport hook (orm) and the luarocks for torch hooks and distributed optimization.

Component-wise build

To build componenet-wise (not required if using make above):

Build the dstorm directory, run:

cd dstorm
./mkit.sh GPU test

You should get a SUCCESS as the output. Check the log files to ensure the build is successful.

The general format is:

./mkit.sh <type>

where TYPE is: or MPI (liborm + mpi) or GPU (liborm + mpi + gpu) A side effect is to create ../dstorm-env.{mk|cmake} environment files, so lua capabilities can match the libdstorm compile options.

Build the orm

cd orm
./mkorm.sh GPU

Building Torch packages. With Torch environment setup, install the malt-2 and dstoptim (distributed optimization packages)

cd dstorm/src/torch
rm -rf build && VERBOSE=7 luarocks make malt-2-scm-1.rockspec >& mk.log && echo YAY #build and install the malt-2 package
cd dstoptim
rm -rf build && VERBOSE=7 luarocks make dstoptim-scm-1.rockspec >&mk.log && echo YAY # build the dstoptim package

Test

A very basic test is to run th and then try, by hand,

require "malt2"

Run a quick test.

With MPI, then you’ll need to run via mpirun, perhaps something like:

mpirun -np 2 `which th` `pwd -P`/test.lua mpi 2>&1 | tee test-mpi.log

if GPU,

mpirun -np 2 `which th` `pwd -P`/test.lua gpu 2>&1 | tee test-GPU-gpu.log

NEW: a WITH_GPU compile can also run with MPI transport

mpirun -np 2 `which th` `pwd -P`/test.lua mpi 2>&1 | tee test-GPU-mpi.log

default transport is set to the “highest” built into libdstorm2: GPU > MPI > SHM

mpirun -np 2 `which th` `pwd -P`/test.lua 2>&1 | tee test-best.log

Running over multiple GPUs.

MPI only sees the hostname. By default, on every host, MPI jobs enumerate the GPUs and start running the processes. The only way to change this and run on other GPUs in a round-robin fashion is to change this enumeration for every rank using CUDA_VISIBLE_DEVICES. An example script is in redirect.sh file in the top-level directory.
To run:

mpirun -np 2 ./redirect.sh `which th` `pwd`/test.lua

This script assigns available GPUs in a round-robin fashion. Since MPI requires visibility of all other GPUs to correctly access shared memory, this script only changes the enumeration order and does not restrict visibility.

Running applications.

Check out here to see how to run Torch applications.