Skip to content

UCX

UCX is needed to compile OpenMPI to use InfiniBand

Work with UCX in short:

Afterward, when you launch OMPI run, you set UCX pml:

mpirun -mca btl self -mca pml ucx ....

To control which device and what transport are being used you can add following env variables:

mpirun -mca btl self -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_TLS=rc,shm ....

Try to experiment with different TLS's see here for more info.

``` tip "See also 1. https://github.com/openucx/ucx/wiki/OpenMPI-and-OpenSHMEM-installation-with-UCX 2. https://github.com/openucx/ucx/wiki

???+ note

    - OpenMPI 4.0,3 support `ucx` 1.7 or older
    - NOTE: UCX >= 1.12.0 requires rdma-core >= 28.0 or MLNX_OFED >= 5.0 for Infiniband and RoCE transports support. This may cause error `address not mapped` on old system

## Compile from Source vs. from pre-configured Release

For compiling from source codes, need [some tools](https://thangckt.github.io/cluster/compiling/Libtool/)

### 1. install from Source

```note
- work now, but should not be use to avoid runtime errors
- Requirements: `autoconf`, `libtool`, and `automake`

cd /home1/p001cao/local/wSourceCode/tooldev
git clone --branch master https://github.com/openucx/ucx.git  ucx-master
cd ucx-master
module load tooldev/autoconf-2.72c
module load tooldev/automake-1.16.5
module load tooldev/libtool-2.4.7
export ACLOCAL_PATH=/home1/p001cao/app/tooldev/libtool-2.4.7/share/aclocal

./autogen.sh
mkdir build  &&  cd build

module load tooldev/binutils-2.37              # gold
module load compiler/gcc-10.3

export PATH=$PATH:/home1/p001cao/app/compiler/gcc-10.3/bin
export CC=gcc export CXX=g++ export FORTRAN=gfortran
export LDFLAGS="-fuse-ld=gold -lrt"

../configure --enable-mt  \
--prefix=/home1/p001cao/app/tooldev/ucx-master

2. install from UCX pre-configured Release

- This way no need ./autogen.h
- ver 1.12.1 will cause error: not found auvx.h
wget https://github.com/openucx/ucx/releases/download/v1.12.0/ucx-1.12.0.tar.gz
tar xvf ucx-1.12.0.tar.gz
cd ucx-1.12.0
mkdir build && cd build

Tachyon

UCX 15 - GCC 11

- do not use GCC-11 to avoid error: Dwarf Error: found dwarf version '5', use: export CFLAGS='-gdwarf-4 -gstrict-dwarf'
export CFLAGS='-gdwarf-4 -gstrict-dwarf'
cd /home1/p001cao/0SourceCode/tooldev
# git clone --branch v1.15.x https://github.com/openucx/ucx.git  ucx-1.15.x
cd ucx-1.15.x
git pull origin v1.15.x

module load tooldev/autoconf-2.72c
module load tooldev/automake-1.16.5
module load tooldev/libtool-2.4.7
export ACLOCAL_PATH=/home1/p001cao/app/tooldev/libtool-2.4.7/share/aclocal

./autogen.sh
# tar xvf ucx-1.13.1.tar.gz
cd ucx-1.15.0

module load compiler/gcc-11
myGCC=/home1/p001cao/app/compiler/gcc-11
export PATH=$myGCC/bin:$PATH
export CFLAGS="-gdwarf-2 -gstrict-dwarf"
export CFLAGS="-Wno-shadow"
export myPREFIX=/home1/p001cao/app/tooldev/ucx1.15-gcc11

../contrib/configure-release --enable-mt --prefix=${myPREFIX}

make -j 16 && make install

Test

module load tooldev/ucx-1.15-gcc
ucx_info -d | grep Transport

Option:

export CFLAGS='-gdwarf-4 -gstrict-dwarf'
myKNEM=/home1/p001cao/app/tooldev/knem-1.1.4
myNUMA=/home1/p001cao/app/tooldev/numactl-2.0.13

--with-knem=$myKNEM \
LDFLAGS="-fuse-ld=gold -lrt  -L$myNUMA/lib -Wl,-rpath,$myNUMA/lib" \
CFLAGS="-I$myNUMA/include" \

../contrib/configure-release  --enable-optimizations

UCX 15 - LLVM

From source code

Note
  • consider to update: autoconf, libtool, and automake
  • To solve error with libuct_ib.la: command not found, use ./contrib/configure-release but not /configure
  • It deos not work with clang 16 (not use now).
  • May error gdwarf
cd /home1/p001cao/0SourceCode/tooldev
# git clone --branch v1.15.x https://github.com/openucx/ucx.git  ucx-1.15.x
cd ucx-1.15.x
git pull origin v1.15.x

module load tooldev/autoconf-2.72c
module load tooldev/automake-1.16.5
module load tooldev/libtool-2.4.7
export ACLOCAL_PATH=/home1/p001cao/app/tooldev/libtool-2.4.7/share/aclocal

./autogen.sh

Building

rm -rf build && mkdir build  &&  cd build

module load compiler/llvm-17          # clang + lld

myLLVM=/home1/p001cao/app/compiler/llvm-17
export PATH=$myLLVM/bin:$PATH
export CC=clang export CXX=clang++
export LDFLAGS="-fuse-ld=lld -lrt"
export CFLAGS="-gdwarf-2 -gstrict-dwarf -Wno-unused-but-set-variable"
RMDA=/home1/p001cao/0SourceCode/tooldev/rdma-core/build
myPREFIX=/home1/p001cao/app/tooldev/ucx1.15-clang17

../contrib/configure-release --enable-mt --with-rdmacm=$RMDA --prefix=${myPREFIX}

make -j 16 && make install

UCC:

cd /home1/p001cao/0SourceCode/tooldev
# git clone --branch master https://github.com/openucx/ucc.git  ucc
cd ucc
git pull origin master

module load tooldev/autoconf-2.72c
module load tooldev/automake-1.16.5
module load tooldev/libtool-2.4.7
export ACLOCAL_PATH=/home1/p001cao/app/tooldev/libtool-2.4.7/share/aclocal

./autogen.sh
rm -rf build && mkdir build  &&  cd build

module load compiler/llvm-17          # clang + lld

myLLVM=/home1/p001cao/app/compiler/llvm-17
export PATH=$myLLVM/bin:$PATH
export CC=clang export CXX=clang++
export LDFLAGS="-fuse-ld=lld -lrt"
myUCX=/home1/p001cao/app/tooldev/ucx1.15-clang17
myPREFIX=/home1/p001cao/app/tooldev/ucc1.2

../configure --with-ucx=${myUCX} --prefix=${myPREFIX}

make -j 16 && make install

UCX 11 - LLVM

  • NOTE: UCX >= 1.12.0 requires rdma-core >= 28.0 or MLNX_OFED >= 5.0 for Infiniband and RoCE transports support. This may cause error address not mapped on old system
  • dont use lld with ucx-1.11
cd /home1/p001cao/0SourceCode/tooldev
wget https://github.com/openucx/ucx/releases/download/v1.11.2/ucx-1.11.2.tar.gz
tar xvf ucx-1.11.2.tar.gz
cd ucx-1.11.2
rm -rf build && mkdir build && cd build

module load compiler/llvm-17
myLLVM=/home1/p001cao/app/compiler/llvm-17
export PATH=$myLLVM/bin:$PATH
export CC=clang export CXX=clang++
export LDFLAGS="-fuse-ld=gold -lrt"
export CFLAGS="-Wno-unused-but-set-variable"
export myPREFIX=/home1/p001cao/app/tooldev/ucx1.11-clang17

../contrib/configure-release --enable-mt --prefix=${myPREFIX}

make -j 16 && make install

Make module file

at directory: /uhome/p001cao/local/share/lmodfiles/GCC → create file "gcc-11.2"

# for Tcl script use only
set     topdir          /home1/p001cao/app/tooldev/ucx-1.15

prepend-path    PATH                    $topdir/bin
prepend-path    INCLUDE                 $topdir/include
prepend-path    LD_LIBRARY_PATH         $topdir/lib
prepend-path    PKG_CONFIG_PATH         $topdir/lib/pkgconfig

II. UCX optional Libs

UCX detects the exiting libraries on the build machine and enables/disables support for various features accordingly. If some of the modules UCX was built with are not found during runtime, they will be silently disabled.

  • Basic shared memory and TCP support - always enabled
  • Optimized shared memory - requires knem or xpmem drivers. On modern kernels also CMA (cross-memory-attach) mechanism will be used.
  • RDMA support - requires rdma-core or libibverbs library.
  • NVIDIA GPU support - requires Cuda drives
  • AMD GPU support - requires ROCm drivers

1. rdma-core (work)

build/bin will contain the sample programs and build/lib will contain the shared libraries. The build is configured to run all the programs 'in-place' and cannot be installed. see more

NOTE: rdma-core does not have install function, so use directly from build folder.

cd /home1/p001cao/0SourceCode/tooldev
git clone https://github.com/linux-rdma/rdma-core  rdma-core
cd rdma-core
# tar xvf rdma-core-30.0.tar.gz
# cd rdma-core-30.0

module load tooldev/cmake-3.27
module load tooldev/libnl-3.2
export LDFLAGS="-lrt"

./build.sh
libnl

https://topic.alibabacloud.com/a/how-to-compile-libnl-3225-in-centos-6_1_18_20033603.html

cd /home1/p001cao/0SourceCode/tooldev
wget --no-check-certificate https://www.infradead.org/~tgr/libnl/files/libnl-3.2.25.tar.gz
tar vxf libnl-3.2.25.tar.gz
cd libnl-3.2.25
export myPREFIX=/home1/p001cao/app/tooldev/libnl-3.2.25

./configure --prefix=${myPREFIX}
make -j 16 && make install

USC1 (eagle)

module load tooldev/binutils-2.36              # gold
module load compiler/gcc-11.2

export PATH=$PATH:/uhome/p001cao/app/compiler/gcc-11.2/bin
export CC=gcc export CXX=g++ export FORTRAN=gfortran

../configure --enable-mt --prefix=/uhome/p001cao/app/tooldev/ucx-1.11

Option:

myKNEM=/uhome/p001cao/app/tooldev/knem-1.1.4
myNUMA=/uhome/p001cao/app/tooldev/numactl-2.0.13

--with-knem=$myKNEM \
LDFLAGS="-fuse-ld=gold -lrt  -L$myNUMA/lib -Wl,-rpath,$myNUMA/lib" \
CFLAGS="-I$myNUMA/include" \

Other options:

--disable-numa
--with-rc --with-ud --with-dc --with-ib-hw-tm --with-dm --with-cm \
## consider options
--with-verbs(=DIR)      Build OpenFabrics support, adding DIR/include,
                        DIR/lib, and DIR/lib64 to the search path for
                        headers and libraries
--with-rc               Compile with IB Reliable Connection support
--with-ud               Compile with IB Unreliable Datagram support
--with-dc               Compile with IB Dynamic Connection support
--with-mlx5-dv          Compile with mlx5 Direct Verbs support. Direct Verbs
                        (DV) support provides additional acceleration
                        capabilities that are not available in a regular
                        mode.
--with-ib-hw-tm         Compile with IB Tag Matching support
--with-dm               Compile with Device Memory support

--with-cm               Compile with IB Connection Manager support

##-- Consider
myNUMA=/home1/p001cao/app/tooldev/numactl-2.0.13
LDFLAGS="-fuse-ld=gold -lrt  -L$myNUMA/lib -Wl,-rpath,$myNUMA/lib" \
CFLAGS="-I$myNUMA/include" \
##--
export myKNEM=/home1/p001cao/app/tooldev/knem1.1.3
export myOFI=/home1/p001cao/app/tooldev/libfabric-1.10.1
--with-verbs=${myOFI} --with-knem=${myKNEM} \
https://developer.arm.com/tools-and-software/server-and-hpc/help/porting-and-tuning/building-open-mpi-with-openucx/running-openmpi-with-openucx

Compile with Intel

module load intel/compiler-xe19u5
export PATH=/home1/p001cao/app/intel/xe19u5/compilers_and_libraries_2019.5.281/linux/bin/intel64:$PATH
export CC=icc  export CXX=icpc  export FORTRAN=ifort
export LD_LIBRARY_PATH=/home1/p001cao/app/intel/xe19u5/compilers_and_libraries_2019.5.281/linux/compiler/lib/intel64_lin:$LD_LIBRARY_PATH

export LD_LIBRARY_PATH=/home1/p001cao/app/tooldev/glibc-2.18/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/local/lib

export myKNEM=/home1/p001cao/app/tooldev/knem1.1.3
export myOFI=/home1/p001cao/app/tooldev/libfabric-1.10.1

../contrib/configure-release --disable-numa --enable-mt LDFLAGS="-fuse-ld=lld -lrt" \
--with-verbs=${myOFI} --with-knem=${myKNEM} \
--prefix=/home1/p001cao/app/tooldev/ucx-1.8-intel

List of main transports and aliases https://github.com/openucx/ucx/wiki/UCX-environment-parameters all use all the available transports. sm all shared memory transports. shm same as "sm". ugni ugni_rdma and ugni_udt. rc RC (=reliable connection), and UD (=unreliable datagram) for connection bootstrap. "accelerated" transports are used if possible. ud UD transport, "accelerated" is used if possible. dc DC - Mellanox scalable offloaded dynamic connection transport rc_x Same as "rc", but using accelerated transports only rc_v Same as "rc", but using Verbs-based transports only ud_x Same as "ud", but using accelerated transports only ud_v Same as "ud", but using Verbs-based transports only tcp TCP over SOCK_STREAM sockets rdmacm Use RDMACM connection management for client-server API sockcm Use sockets-based connection management for client-server API cuda_copy Use cu*Memcpy for hostcuda device self transfers but also to detect cuda memory gdr_copy Use GDRcopy library for hostcuda device self transfers cuda_ipc Use CUDA-IPC for cuda devicedevice transfers over PCIe/NVLINK rocm_copy Use for host-rocm device transfers rocm_ipc Use IPC for rocm device-device transfers self Loopback transport to communicate within the same process