git clone https://github.com/malt2/malt2.git --recursive
on some machines, you might need things something like (MKL is optional):
source [torch-dir]/install/bin/torch-activate
source /opt/intel/mkl/bin/intel64/mklvars.sh intel64
If using modules, you can try:
module install icc cuda80 luajit
make
This command builds the distributed shared memory component (dstorm), the shared memory transport hook (orm) and the luarocks for torch hooks and distributed optimization.
To build componenet-wise (not required if using make above):
cd dstorm
./mkit.sh GPU test
You should get a SUCCESS
as the output. Check the log files to ensure the build is successful.
The general format is:
./mkit.sh <type>
where TYPE is: or MPI (liborm + mpi) or GPU (liborm + mpi + gpu) A side effect is to create ../dstorm-env.{mk|cmake} environment files, so lua capabilities can match the libdstorm compile options.
cd orm
./mkorm.sh GPU
cd dstorm/src/torch
rm -rf build && VERBOSE=7 luarocks make malt-2-scm-1.rockspec >& mk.log && echo YAY #build and install the malt-2 package
cd dstoptim
rm -rf build && VERBOSE=7 luarocks make dstoptim-scm-1.rockspec >&mk.log && echo YAY # build the dstoptim package
require "malt2"
mpirun -np 2 `which th` `pwd -P`/test.lua mpi 2>&1 | tee test-mpi.log
mpirun -np 2 `which th` `pwd -P`/test.lua gpu 2>&1 | tee test-GPU-gpu.log
WITH_GPU
compile can also run with MPI transportmpirun -np 2 `which th` `pwd -P`/test.lua mpi 2>&1 | tee test-GPU-mpi.log
default transport is set to the “highest” built into libdstorm2: GPU > MPI > SHM
mpirun -np 2 `which th` `pwd -P`/test.lua 2>&1 | tee test-best.log
MPI only sees the hostname. By default, on every host, MPI jobs enumerate the
GPUs and start running the processes. The only way to change this and run on
other GPUs in a round-robin fashion is to change this enumeration for every
rank using CUDA_VISIBLE_DEVICES
. An example script is in redirect.sh
file
in the top-level directory.
To run:
mpirun -np 2 ./redirect.sh `which th` `pwd`/test.lua
This script assigns available GPUs in a round-robin fashion. Since MPI requires visibility of all other GPUs to correctly access shared memory, this script only changes the enumeration order and does not restrict visibility.