Shark LUMC Wiki

From Solowiki
Jump to navigation Jump to search

Contents

Shark CentOS Slurm - User guide

Version: November 2020

! Please note this is a copy of the official wiki, for the most current version, log in to researchlumc and go to: https://git.lumc.nl/shark/shark-centos-slurm-user-guide/-/wikis/home

''TOC''

Contact

For accounts, storage, requests, anything related to the cluster, use the Topdesk Self-Service Portal:

Topdesk Self-Service Portal [Use Self-Service Portal]

HPC-Linux team

General email: ITenDI_Infra-Linux@lumc.nl

Naam Locatie Email
John Berbee D-01-133 J.A.M.Berbee@lumc.nl
Tom Brusche D-01-128 T.A.W.Brusche@lumc.nl
Michel Villerius D-01-133 M.P.Villerius@lumc.nl
Pieter van Vliet D-01-133 P.Y.B.van_Vliet@lumc.nl

Cluster overview:

Hardware overview

Hostname IP address CPU Cores Memory GPUs Purpose Type machine
res-hpc-lo01 145.88.76.243 Intel E5-2660 32 128Gb 0 Login node Dell PowerEdge M620
res-hpc-lo02 145.88.76.217 Intel Xe 6248 80 128Gb 1 Login node + Rem Vis* Dell PowerEdge R740
res-hpc-exe007 145.88.76.220 Intel E5-2697 24 384Gb 0 Execution node Dell PowerEdge M620
res-hpc-exe008 145.88.76.224 Intel E5-2697 24 384Gb 0 Execution node Dell PowerEdge M620
res-hpc-exe009 145.88.76.222 Intel E5-2697 24 384Gb 0 Execution node Dell PowerEdge M620
res-hpc-exe010 145.88.76.221 Intel E5-2690 24 384Gb 0 Execution node Dell PowerEdge M630
res-hpc-exe011 145.88.76.223 Intel E5-2697 24 384Gb 0 Execution node Dell PowerEdge M620
res-hpc-exe012 145.88.76.233 Intel E5-2690 24 384Gb 0 Execution node Dell PowerEdge M630
res-hpc-exe013 145.88.76.247 Intel E5-2670 16 128Gb 0 Execution node Dell PowerEdge M620
res-hpc-exe014 145.88.76.242 Intel E5-2697 24 384Gb 0 Execution node Dell PowerEdge M620
res-hpc-exe015 145.88.76.235 Intel E5-2660 16 96Gb 0 Execution node Dell PowerEdge M620
res-hpc-exe016 145.88.76.236 Intel E5-2660 16 128Gb 0 Execution node Dell PowerEdge M620
res-hpc-exe019 145.88.76.239 Intel E5-2660 16 128Gb 0 Execution node Dell PowerEdge M620
res-hpc-exe020 145.88.76.229 Intel E5-2697 24 192Gb 0 Execution node Dell PowerEdge M620
res-hpc-exe021 145.88.76.228 Intel E5-2660 32 96Gb 0 Execution node Dell PowerEdge M620
res-hpc-exe022 145.88.76.227 Intel E5-2697 24 192Gb 0 Execution node Dell PowerEdge M620
res-hpc-exe023 145.88.76.225 Intel E5-2697 24 192Gb 0 Execution node Dell PowerEdge M620
res-hpc-exe024 145.88.76.232 Intel E5-2690 48 128Gb 0 Execution node Dell PowerEdge M630
res-hpc-exe025 145.88.76.213 Intel E5-2670 32 32Gb 0 Execution node Dell PowerEdge M620
res-hpc-exe027 145.88.76.209 Intel E5-2690 56 128Gb 0 Execution node Dell PowerEdge M630
res-hpc-exe028 145.88.76.212 Intel E5-2670 32 32Gb 0 Execution node Dell PowerEdge M620
res-hpc-exe029 145.88.76.210 Intel E5-2690 56 384Gb 0 Execution node Dell PowerEdge M630
res-hpc-exe030 145.88.76.215 Intel E5-2697 48 384Gb 0 Execution node Dell PowerEdge M620
res-hpc-exe031 145.88.76.214 Intel E5-2697 48 512Gb 0 Execution node Dell PowerEdge M620
res-hpc-gpu01 145.88.76.237 Intel 8160 48 512Gb 3 GPU node Dell PowerEdge R740
res-hpc-gpu02 145.88.76.234 Intel 8160 48 512Gb 3 GPU node Dell PowerEdge R740
res-hpc-lkeb01 145.88.70.197 Intel E5-1620 8 64Gb 1 GPU node HP Z440 Workstation
res-hpc-lkeb02 145.88.70.196 Intel E5-1650 12 16Gb 1 GPU node HP Z420 Workstation
res-hpc-lkeb03 145.88.76.226 Intel E5-2698 40 256Gb 4 GPU node NVIDIA DGX Station
res-hpc-lkeb04 145.88.76.248 Intel E5-1650 12 256Gb 4 GPU node Asus X99-E-10G WS
res-hpc-lkeb05 145.88.76.244 Intel Xe 6134 16 256Gb 3 GPU node Dell Precision 7920
res-hpc-mem01 145.88.76.230 Intel E7-4890 60 3Tb 0 High mem Dell PowerEdge R920
res-hpc-mem02 145.88.76.218 Intel E5-4657L 96 1Tb 0 High mem Dell PowerEdge M820
res-hpc-mem03 145.88.76.216 Intel E5-4657L 96 1Tb 0 High mem Dell PowerEdge M820
res-hpc-ma01 145.88.76.246 Intel E5-2697 2 4Gb 0 Controller node 1 VM: VMware
res-hpc-ma02 145.88.76.249 Intel E5-4650 1 4Gb 0 Controller node 2 VM: VMware
res-hpc-db01 145.88.76.245 Intel E5-2697 1 4Gb 0 Slurm DB node VM: VMware
res-hpc-ood01 145.88.76.231 Intel E5-4650 2 4Gb 0 OpenOnDemand portal VM: VMware
  • Rem Vis = Remote Visualization

GPU overview

Hostname GPU 0 (cores/mem) GPU 1 (cores/mem) GPU 2 (cores/mem) GPU 3 (cores/mem)
res-hpc-lo02 Tesla T4 (2560/16Gb)
res-hpc-gpu01 TITAN Xp (3840/12Gb) TITAN Xp (3840/12Gb) TITAN Xp (3840/12Gb)
res-hpc-gpu02 TITAN Xp (3840/12Gb) TITAN Xp (3840/12Gb) TITAN Xp (3840/12Gb)
res-hpc-lkeb01 Tesla K40c (2880/12Gb)
res-hpc-lkeb02 TITAN X (Pascal) (3584/12Gb)
res-hpc-lkeb03 Tesla V100-DGXS (5120/16Gb) Tesla V100-DGXS (5120/16Gb) Tesla V100-DGXS (5120/16Gb) Tesla V100-DGXS (5120/16Gb)
res-hpc-lkeb04 GeForce GTX 1080 Ti (3584/12Gb) GeForce GTX 1080 Ti (3584/12Gb) GeForce GTX 1080 Ti (3584/12Gb) GeForce GTX 1080 Ti (3584/12Gb)
res-hpc-lkeb05 Quadro RTX 6000 (4608/24Gb) Quadro RTX 6000 (4608/24Gb) Quadro RTX 6000 (4608/24Gb)

Rules

  • Always login/connect to the login node res-hpc-lo01 or res-hpc-lo02
  • Always use the workload manager (Slurm) to run/submit jobs or use it interactive
  • Never run a job outside the workload manager (Slurm), not on the login node nor on the execution nodes
  • Never run (heavy) calculations on the login node, but do this on a compute node

How to get a Shark cluster account

  • To get a Shark account you first need to have some basic Linux knowledge.

Without basic Linux knowledge you cannot work on the Shark cluster.

  • A default cluster account will be created, you will receive an email with your RESEARCHLUMC/username and password.

Shark cluster introduction course

  • A Shark introduction course can be followed where you will receive basic information on how to start using a HPC cluster.
      1. Schedule 2020
Date Time Room Room size Seats left
2020 Canceled until further notice due to coronavirus (SARS-CoV-2)

Presentations

How to connect to the login node / hpc cluster

From a Linux workstation

You are free to choose your Linux distribution, but we recommend the following distributions: - Ubuntu Desktop 18.04, 19.10 or 20.04 - Fedora 32 workstation

Other distributions: - CentOS 8 - Debian 10.6.0 - Red Hat 8 Commercial license required - Arch Linux Rolling distribution

From the command line (ssh)

If your login user name from your workstation is the same as the username on the HPC cluster, you can use: - ssh res-hpc-lo01

or

  • ssh res-hpc-lo02

Otherwise: - ssh username@res-hpc-lo01

or

  • ssh username@res-hpc-lo02

You can make your life easier by editing the file:

vi ~/.ssh/config
Host res-hpc-lo01
    Hostname 145.88.76.243
    User user-name
        ServerAliveInterval 60
Host res-hpc-lo02
    Hostname 145.88.76.217
    User user-name
        ServerAliveInterval 60

Where you adapt the user-name.

X11 forwarding

You can show graphical output when you enable X11 forwarding - ssh -X res-hpc-lo01

or

  • ssh -X res-hpc-lo02

or - ssh -Y res-hpc-lo01

or

  • ssh -Y res-hpc-lo02

Once you are logged in, you should be able to run a graphical program, for example: - xterm - xclock - xeyes

xterm
alt text

alt text alt text

A remote desktop

Install the X2Go client: - CentOS/Fedora/Red Hat: yum install x2goclient - Ubuntu/Debian: apt-get install x2goclient - Arch Linux: pacman -S x2goclient

X2Go enables to access a graphical desktop of a computer over the network. The protocol is tunneled through the Secure Shell protocol, so it is encrypted.

Start the x2goclient: alt text

Go to Session and create a new session: alt text

For the Host: - res-hpc-lo01

or

  • res-hpc-lo02

For the Session type: - XFCE - ICEWM - MATE [only for res-hpc-lo02]

After you have created a new session alt text

Start the new session: alt text

You can ignore this error: alt text

The XFCE desktop: alt text

How to logout: alt text alt text

The ICEWM desktop: alt text

How to logout: alt text alt text

The MATE desktop: alt text

How to logout: alt text alt text

SSH proxy server

If you are working from home or from outside the LUMC (network), you can use the SSH proxy server to connect to the cluster.

  • IP address: 145.88.35.10
  • Hostname: res-ssh-alg01.researchlumc.nl

With the X2Go client, you have to enable the following options: - Use Proxy server for SSH connection - Type: SSH - Same login as on X2Go Server - Same password as on X2Go Server - Host: 145.88.35.10 - Port: 22

alt text alt text

From a Windows workstation

From the command line (ssh)

A simple terminal ssh shell is putty. Download the client from the putty homepage

Direct client download putty

Windows 64-bit MSI installer

Once you have started the putty program, you will see:

putty-01
alt text

Fill in at the Host Name (or IP address): res-hpc-lo01 or res-hpc-lo02

putty-02
alt text

At the connection setting, fill in 60 at the Seconds between keepalives. If needed, enable the Enable TCP keepalives

putty-03
alt text

At the X11 setting, you can enable the Enable X11 forwarding. You want this if you need graphical output, but for this to get this working, you need to install separately a X11 server for Windows. It is better and easier to install mobaXterm, which will be explained below.

Now press Open to connect to the login node.

putty-04
alt text

If you connect for the first time to the login node, you will see the warning. Press Yes to continue.

putty-05
alt text

Login with your user name ans password.

putty-06
alt text

Now you are logged in and you can start using the cluster.

putty-07
alt text

If you press on the putty symbol at the left corner of the terminal window, you have multiple options, like:

  • New Session
  • Duplicate Session
  • Change Settings
  • Copy All to Clipboard
  • Clear Scrollback
  • Reset Terminal
putty-08
alt text

You can give your session a name so that you can save it and reuse it later.

Give you session a name, for example “res-hpc-lo01”. Save

putty-09
alt text

Later you can load your saved session. Select your saved sessions and press Load

MobaXterm

With mobaXterm, you have a buildin SSH shell/terminal and an embedded X11 server. With this, you don’t have to worry about setting up a special X11 server for showing graphical output on your Windows workstation.

Go to the MobaXterm website: MobaXterm

Here is the direct download link: Download

Choose the MobaXterm Home Editon v20.3 and install it on your Windows workstation. Once you have installed it, you can start it and you have to create a SSH session:

mobaXterm-01
alt text

Choose for the session type: SSH and press OK

mobaXterm-02
alt text

Fill in the Remote host and Specify username

mobaXterm-03
alt text

All the default settings should be ok, but here you can check your Advanced SSH settings

mobaXterm-04
alt text

The Terminal settings

mobaXterm-05
alt text

Network settings

Pres OK to start the connection

mobaXterm-06
alt text

On the left screen you can have multiple session saved

mobaXterm-07
alt text

You can also add a, for example, a “SFTP” session, so that you can easily transfer files from and to your workstation.

mobaXterm-08
alt text

Right clicking on one of your session, you can, for example, edit you session settings

mobaXterm-09
alt text

Also here you can set the settings for the ssh jump host, if needed

A remote desktop

For the windows version, we can use the same version as we used with Linux: - x2go x2go

Install the X2Go client

The setup is already described with the Linux client.

Module environment

When you will be working on the cluster, you probably need to load the correct module, which will set the correct environment for your library, compiler or program.

Here below are some useful commands and examples:

  • List all available modules
module av

----------------------------------------------------------------------------------------------- /share/modulefiles ------------------------------------------------------------------------------------------------
   container/singularity/3.5.3/gcc.8.3.1             library/cudnn/9.2/cudnn                                     neuroImaging/fsl/5.0.11
   cryogenicEM/chimera/1.14/gcc-8.3.1                library/fltk/1.3.5/gcc-8.3.1                                neuroImaging/fsl/5.0.9
   cryogenicEM/ctffind/4.1.13/gcc-8.3.1              library/ftgl/2.1.3/gcc-8.3.1                                neuroImaging/fsl/6.0.0
   cryogenicEM/eman2/2.31                            library/gdal/2.4.4/gcc-8.3.1                                neuroImaging/fsl/6.0.1
   cryogenicEM/gctf/1.06                             library/gdal/3.0.4/gcc-8.3.1                                neuroImaging/fsl/6.0.2
   cryogenicEM/imod/4.9.12                           library/htslib/1.10.2/gcc-8.3.1                             neuroImaging/fsl/6.0.3
   cryogenicEM/motioncor2/1.31                       library/java/OpenJDK-11.0.2                                 neuroImaging/fsl/fix/1.06.12
   cryogenicEM/relion/3.0.8/gcc-8.3.1                library/java/OpenJDK-12.0.2                                 neuroImaging/mrtrix/3.0.0/gcc-8.3.1
   cryogenicEM/relion/3.1-beta/gcc-8.3.1             library/java/OpenJDK-13.0.2                                 pharmaceutical/PsN/4.9.0
   cryogenicEM/resmap/1.1.4                          library/java/OpenJDK-14.0.1                          (D)    pharmaceutical/nonmem/7.4.4/gcc-8.3.1
   genomics/hmmer/openmpi-3.1.5/3.3/gcc-8.3.1        library/lapack/3.9.0/gcc-8.3.1                              pharmaceutical/pirana/2.9.7
   genomics/ngs/bcftools/1.10.2/gcc-8.3.1            library/mpi/mpich/3.3.2/gcc-8.3.1                           pharmaceutical/pirana/2.9.8            (D)
   genomics/ngs/bcl2fastq/2.20.0                     library/mpi/openmpi/3.1.5/gcc-8.3.1                         statistical/MATLAB/R2016b
   genomics/ngs/bedtools2/2.29.1/gcc-8.3.1           library/mpi/openmpi/4.0.2/gcc-8.3.1                         statistical/MATLAB/R2018b
   genomics/ngs/bwa/0.7.17/gcc-8.3.1                 library/mpi/openmpi/4.0.3/gcc-8.3.1                  (L)    statistical/MATLAB/R2019b
   genomics/ngs/samtools/1.10/gcc-8.3.1              library/pmi/openpmix/2.2.3/gcc-8.3.1                        statistical/MATLAB/v93/MCR2017b
   genomics/ngs/shapeit4/4.1.3/gcc-8.3.1             library/pmi/openpmix/3.1.4/gcc-8.3.1                        statistical/MATLAB/v97/MCR2019b
   genomics/ngs/vcftools/0.1.16/gcc-8.3.1            library/sparsehash/2.0.3/gcc-8.3.1                          statistical/R/3.4.4/gcc.8.3.1
   graphics/gnuplot/5.2.8/gcc-8.3.1                  library/wxwidgets/3.1.3/gcc-8.3.1                           statistical/R/3.5.3/gcc.8.3.1
   graphics/graphicsmagick/1.3.35/gcc-8.3.1          mathematical/octave/5.2.0/gcc-8.3.1                         statistical/R/3.6.2/gcc.8.3.1
   gwas/depict/1.rel194                              mathematical/octave/libs/SuiteSparse/5.7.2/gcc-8.3.1        statistical/R/4.0.2/gcc.8.3.1
   gwas/plink/1.07                                   mathematical/octave/libs/arpack/3.7.0/gcc-8.3.1             statistical/RStudio/1.2.5033/gcc-8.3.1
   gwas/plink/1.90b6.17                              mathematical/octave/libs/gl2ps/1.4.2/gcc-8.3.1              system/go/1.13.7
   gwas/plink/1.90p                                  mathematical/octave/libs/glpk/4.65/gcc-8.3.1                system/hwloc/1.11.13/gcc-8.3.1
   gwas/plink/2.00a3LM                        (D)    mathematical/octave/libs/qhull/8.0.0/gcc-8.3.1              system/hwloc/2.1.0/gcc-8.3.1
   library/blas/0.3.10/gcc-8.3.1                     mathematical/octave/libs/qrupdate/1.1.2/gcc-8.3.1           system/knem/1.1.3/gcc-8.3.1
   library/boost/1.72.0/gcc-8.3.1                    mathematical/octave/libs/sundials/5.3.0/gcc-8.3.1           system/python/2.7.17
   library/cuda/10.0/gcc.8.3.1                       medicalImaging/minc-stuffs/0.1.25/gcc-8.3.1                 system/python/3.7.6
   library/cuda/10.1/gcc.8.3.1                       medicalImaging/minc-toolkit-v2/1.9.17/gcc-8.3.1             system/python/3.8.1                    (D)
   library/cuda/10.2/gcc.8.3.1                       medicalImaging/minc2-simple/2.2.30/gcc-8.3.1                system/qt/5.14.2/gcc-8.3.1
   library/cuda/7.5/gcc.8.3.1                        medicalImaging/pydpiper/2.0.9                               system/swi-prolog/8.2.0
   library/cuda/8.0/gcc.8.3.1                        medicalImaging/pydpiper/2.0.14                       (D)    tools/biomake/0.1.5
   library/cuda/9.0/gcc.8.3.1                        medicalImaging/pyminc/0.52                                  tools/cmake/3.11.4
   library/cuda/9.1/gcc.8.3.1                        neuroImaging/Elastix/5.0.0/gcc-7.4.0                        tools/gitlab-runner/12.8.0
   library/cuda/9.2/gcc.8.3.1                        neuroImaging/FSLeyes/0.32.3                                 tools/jupyterlab/4.3.1
   library/cudnn/10.0/cudnn                          neuroImaging/SimpleElastix/0.10.0/python3.6.8               tools/luarocks/3.3.1/gcc-8.3.1
   library/cudnn/10.1/cudnn                          neuroImaging/freesurfer/stable-pub-v6.0.0.patched           tools/miniconda/python2.7/4.7.12
   library/cudnn/10.2/cudnn                          neuroImaging/freesurfer/7.1.0                        (D)    tools/miniconda/python3.7/4.7.12
   library/cudnn/9.0/cudnn                           neuroImaging/fsl/5.0.10                                     tools/websockify/0.9.0

-------------------------------------------------------------------------------------- /usr/share/lmod/lmod/modulefiles/Core --------------------------------------------------------------------------------------
   lmod    settarg

  Where:
   D:  Default Module

Use "module spider" to find all possible modules and extensions.
Use "module keyword key1 key2 ..." to search for all possible modules matching any of the "keys".
  • load a module
module load library/mpi/openmpi/4.0.2/gcc-8.3.1
  • show loaded modules
module li

Currently Loaded Modules:
  1) library/mpi/openmpi/4.0.2/gcc-8.3.1
  • delete one module
module del library/mpi/openmpi/4.0.2/gcc-8.3.1
  • purge all modules
module purge

Compiling programs

We are going to compile a very simple MPI (message passing interface) program, which is quite common on a cluster. - vi hello.c

#include <stdio.h>
#include <mpi.h>

int main (int argc, char *argv[])
{
  int id, np, i;
  char processor_name[MPI_MAX_PROCESSOR_NAME];
  int processor_name_len;

  MPI_Init (&argc, &argv);

  MPI_Comm_size (MPI_COMM_WORLD, &np);
  MPI_Comm_rank (MPI_COMM_WORLD, &id);
  MPI_Get_processor_name (processor_name, &processor_name_len);

  for (i=1; i<2; i++)
    printf ("Hello world from process %03d out of %03d, processor name %s \n", id, np, processor_name);

  MPI_Finalize ();
  return 0;
}

If you are going to compile this program without the correct loaded module(s), you would see something like this:

$ module li
No modules loaded
$ gcc hello.c -o hello
hello.c:2:10: fatal error: mpi.h: No such file or directory
 #include <mpi.h>
          ^~~~~~~
compilation terminated.

So we need to load the correct module and use the correct compiler

$ module add library/mpi/openmpi/4.0.2/gcc-8.3.1
  • mpicc hello.c -o hello

Handy reference:

Language C C++ Fortran77 Fortran90 Fortran95
Command mpicc mpiCC mpif77 mpif90 mpif95
  • ./hello
Hello world from process 000 out of 001, processor name res-hpc-lo01.researchlumc.nl

Here you can see that we ran the program only on 1 core of the cpu. (which is the same as running: mpirun -np 1 ./hello) (np = number of processes to launch)

To make use of the MPI capabilities of the program, we have to run the program with the “mpirun” which comes with the loaded module library/mpi/openmpi/4.0.2/gcc-8.3.1

  • mpirun ./hello
Hello world from process 003 out of 016, processor name res-hpc-lo01.researchlumc.nl 
Hello world from process 006 out of 016, processor name res-hpc-lo01.researchlumc.nl 
Hello world from process 013 out of 016, processor name res-hpc-lo01.researchlumc.nl 
Hello world from process 015 out of 016, processor name res-hpc-lo01.researchlumc.nl 
Hello world from process 000 out of 016, processor name res-hpc-lo01.researchlumc.nl 
Hello world from process 005 out of 016, processor name res-hpc-lo01.researchlumc.nl 
Hello world from process 010 out of 016, processor name res-hpc-lo01.researchlumc.nl 
Hello world from process 011 out of 016, processor name res-hpc-lo01.researchlumc.nl 
Hello world from process 012 out of 016, processor name res-hpc-lo01.researchlumc.nl 
Hello world from process 002 out of 016, processor name res-hpc-lo01.researchlumc.nl 
Hello world from process 004 out of 016, processor name res-hpc-lo01.researchlumc.nl 
Hello world from process 007 out of 016, processor name res-hpc-lo01.researchlumc.nl 
Hello world from process 001 out of 016, processor name res-hpc-lo01.researchlumc.nl 
Hello world from process 008 out of 016, processor name res-hpc-lo01.researchlumc.nl 
Hello world from process 009 out of 016, processor name res-hpc-lo01.researchlumc.nl 
Hello world from process 014 out of 016, processor name res-hpc-lo01.researchlumc.nl

Now the program is using all the cores of the local machine. (which is the same as: mpirun -np 16 ./hello)

Workload manager: Slurm

The workload manager Slurm is installed on the cluster. This is a resource manager and a scheduler all in one. - resource manager: are there resources free (memory, cpu, gpus, etc) on the nodes for the job? - scheduler: when to run the job?

Slurm commands

User commands

Command Info
salloc Obtain a Slurm job allocation (a set of nodes), execute a command, and then release the allocation when the command is finished
sbatch Submit a batch script to Slurm
scancel Used to signal jobs or job steps that are under the control of Slurm
scontrol Used view and modify Slurm configuration and state
sinfo View information about Slurm nodes and partitions
squeue View information about jobs located in the Slurm scheduling queue
srun Run parallel jobs

Accounting info

Command Info
sacct Displays accounting data for all jobs and job steps in the Slurm job accounting log or Slurm database
sstat Display various status information of a running job/step
sacctmgr Used to view and modify Slurm account information
sreport Generate reports from the slurm accounting data

Scheduling info

Command Info
sprio View the factors that comprise a job’s scheduling priority
sshare Tool for listing the shares of associations to a cluster

Partitions (info) (in SGE called queues)

Available partitions

Partition name Nodes Default MemPerCpu DefaultTime Remark
all res-hpc-exe[007-031] Yes 2048 1:00:00 Default partition
gpu res-hpc-gpu[01-02] No 2048 1:00:00 Only for GPU/CUDA calculations
highmem res-hpc-mem[01-03] No 2048 1:00:00 For memory intensive applications
LKEBgpu res-hpc-lkeb[01-05] No 2048 - Only for GPU/CUDA calculations
short res-hpc-gpu[01-02] No 2048 - max 60 cores, 1 hour walltime, for non GPU calculations

A job must always be submitted to a partition (queue). By default, the jobs will be submitted to the “all” partition unless you specify a different queue.

The following commands are useful: - sinfo - sinfo -a - sinfo -l - sinfo -N -l

[user@res-hpc-lo01 ~]$ sinfo 
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST 
all*         up   infinite      2    mix res-hpc-exe[014,019] 
all*         up   infinite      1  alloc res-hpc-exe018 
all*         up   infinite      4   idle res-hpc-exe[013,015-017] 
gpu          up   infinite      2    mix res-hpc-gpu[01-02] 
highmem      up   infinite      1    mix res-hpc-mem01 

[user@res-hpc-lo01 ~]$ sinfo -a
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST 
all*         up   infinite      2    mix res-hpc-exe[014,019] 
all*         up   infinite      1  alloc res-hpc-exe018 
all*         up   infinite      4   idle res-hpc-exe[013,015-017] 
gpu          up   infinite      2    mix res-hpc-gpu[01-02] 
highmem      up   infinite      1    mix res-hpc-mem01 
LKEBgpu      up   infinite      1  comp* res-hpc-lkeb02 
LKEBgpu      up   infinite      1  down* res-hpc-lkeb03 
LKEBgpu      up   infinite      2    mix res-hpc-lkeb[04-05] 
LKEBgpu      up   infinite      1   idle res-hpc-lkeb01 

[user@res-hpc-lo01 ~]$ sinfo -l
Mon Mar 23 09:21:27 2020
PARTITION AVAIL  TIMELIMIT   JOB_SIZE ROOT OVERSUBS     GROUPS  NODES       STATE NODELIST 
all*         up   infinite 1-infinite   no       NO        all      2       mixed res-hpc-exe[014,019] 
all*         up   infinite 1-infinite   no       NO        all      1   allocated res-hpc-exe018 
all*         up   infinite 1-infinite   no       NO        all      4        idle res-hpc-exe[013,015-017] 
gpu          up   infinite 1-infinite   no       NO        all      2       mixed res-hpc-gpu[01-02] 
highmem      up   infinite 1-infinite   no       NO        all      1       mixed res-hpc-mem01 
[user@res-hpc-lo01 ~]$ sinfo -l -N -a
Mon Mar 23 09:34:14 2020
NODELIST        NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON               
res-hpc-exe013      1      all*       mixed 16      2:8:1 128800        0      1   (null) none                 
res-hpc-exe014      1      all*       mixed 24     2:12:1 386800        0      1   (null) none                 
res-hpc-exe015      1      all*        idle 16      2:8:1  96000        0      1   (null) none                 
res-hpc-exe016      1      all*        idle 16      2:8:1 128800        0      1   (null) none                 
res-hpc-exe017      1      all*        idle 16      2:8:1  64000        0      1   (null) none                 
res-hpc-exe018      1      all*   allocated 16      2:8:1  64000        0      1   (null) none                 
res-hpc-exe019      1      all*       mixed 16      2:8:1 128800        0      1   (null) none                 
res-hpc-gpu01       1       gpu       mixed 48     2:24:1 515000        0      1   (null) none                 
res-hpc-gpu02       1       gpu       mixed 48     2:24:1 515000        0      1   (null) none                 
res-hpc-lkeb01      1   LKEBgpu        idle 8       1:4:2  63000        0      1   (null) none                 
res-hpc-lkeb02      1   LKEBgpu completing* 12      1:6:2  15000        0      1   (null) none                 
res-hpc-lkeb03      1   LKEBgpu       down* 40     1:20:2 250000        0      1   (null) Not responding       
res-hpc-lkeb04      1   LKEBgpu       mixed 12      1:6:2 257000        0      1   (null) none                 
res-hpc-lkeb05      1   LKEBgpu       mixed 16      2:8:1 256000        0      1   (null) none                 
res-hpc-mem01       1   highmem       mixed 60     4:15:1 300000        0      1   (null) none
  • idle: this node has no jobs running on it
  • alloc(ated): the whole node is allocated by 1 or more jobs
  • mix(ed): there is 1 or more jobs running on the node, but there are still cores free on this node

Jobs info

With the following command, you can get information about your running jobs and jobs from other users: - squeue - squeue -a - squeue -l

[user@res-hpc-lo01 mpi-benchmarks]$ squeue 
             JOBID  PARTITION      USER  ST        TIME   NODES NODELIST(REASON) 
               258        all      user   R        0:03       2 res-hpc-exe[013-014] 

[user@res-hpc-lo01 mpi-benchmarks]$ squeue -a
             JOBID  PARTITION      USER  ST        TIME   NODES NODELIST(REASON) 
               258        all      user   R        0:06       2 res-hpc-exe[013-014] 

[user@res-hpc-lo01 mpi-benchmarks]$ squeue -l
Thu Jan 23 09:14:22 2020
             JOBID  PARTITION      USER     STATE        TIME TIME_LIMIT   NODES NODELIST(REASON) 
               258        all      user   RUNNING        0:12      30:00       2 res-hpc-exe[013-014] 

Jobs typically pass through several states in the course of their execution.
The typical states are PENDING, RUNNING, SUSPENDED, COMPLETING, and COMPLETED. An explanation of some state follows:

State State (full) Explanation
CA CANCELLED Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated.
CD COMPLETED Job has terminated all processes on all nodes with an exit code of zero.
CG COMPLETING Job is in the process of completing. Some processes on some nodes may still be active.
F FAILED Job terminated with non-zero exit code or other failure condition.
PD PENDING Job is awaiting resource allocation.
R RUNNING Job currently has an allocation.
S SUSPENDED Job has an allocation, but execution has been suspended and CPUs have been released for other jobs.

scontrol

With the slurm command scontrol you can get a more detailed overview of your running job, node hardware and partitions:

[user@res-hpc-lo01 ~]$ scontrol show job 260
JobId=260 JobName=IMB
   UserId=user(225812) GroupId=Domain Users(513) MCS_label=N/A
   Priority=35603 Nice=0 Account=dnst-ict QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:13 TimeLimit=00:30:00 TimeMin=N/A
   SubmitTime=2020-01-23T10:27:45 EligibleTime=2020-01-23T10:27:45
   AccrueTime=2020-01-23T10:27:45
   StartTime=2020-01-23T10:27:45 EndTime=2020-01-23T10:57:45 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-01-23T10:27:45
   Partition=all AllocNode:Sid=res-hpc-ma01:46428
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=res-hpc-exe[013-014]
   BatchHost=res-hpc-exe013
   NumNodes=2 NumCPUs=32 NumTasks=32 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=32,mem=64G,node=2,billing=32
   Socks/Node=* NtasksPerN:B:S:C=16:0:*:* CoreSpec=*
   MinCPUsNode=16 MinMemoryCPU=2G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/user/Software/imb/mpi-benchmarks/imb.slurm
   WorkDir=/home/user/Software/imb/mpi-benchmarks
   StdErr=/home/user/Software/imb/mpi-benchmarks/job.%J.err
   StdIn=/dev/null
   StdOut=/home/user/Software/imb/mpi-benchmarks/job.%J.out
   Power=
   MailUser=user@gmail.com MailType=BEGIN,END,FAIL

[user@res-hpc-lo01 ~]$ scontrol show node res-hpc-exe014
NodeName=res-hpc-exe014 Arch=x86_64 CoresPerSocket=12 
   CPUAlloc=16 CPUTot=24 CPULoad=0.00
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=res-hpc-exe014 NodeHostName=res-hpc-exe014 Version=20.02.0-0pre1
   OS=Linux 4.18.0-80.11.2.el8_0.x86_64 #1 SMP Tue Sep 24 11:32:19 UTC 2019 
   RealMemory=386800 AllocMem=32768 FreeMem=380208 Sockets=2 Boards=1
   State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=all 
   BootTime=2019-12-11T11:51:40 SlurmdStartTime=2020-01-14T15:36:20
   CfgTRES=cpu=24,mem=386800M,billing=24
   AllocTRES=cpu=16,mem=32G
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

[user@res-hpc-lo01 ~]$ scontrol show partition all
PartitionName=all
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=YES QoS=N/A
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=res-hpc-exe[013-014]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=40 TotalNodes=2 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=2048 MaxMemPerNode=UNLIMITED

Interactive jobs

  • salloc

You can open an interactive session with the salloc command:

[user@res-hpc-lo01 ~]$ salloc -N1
salloc: Granted job allocation 267
salloc: Waiting for resource configuration
salloc: Nodes res-hpc-exe013 are ready for job

[user@res-hpc-exe013 ~]$ squeue 
             JOBID  PARTITION      USER  ST        TIME   NODES NODELIST(REASON) 
               267        all      user   R        0:04       1 res-hpc-exe013 
[user@res-hpc-exe013 ~]$ exit
exit
salloc: Relinquishing job allocation 267

[user@res-hpc-lo01 ~]$ 

In the example above, we won’t run a command so we ended up in the bash environment. With exit we leave the environment and we release the node.

[user@res-hpc-lo01 ~]$ salloc -N1 mpirun ./hello1
salloc: Granted job allocation 268
salloc: Waiting for resource configuration
salloc: Nodes res-hpc-exe013 are ready for job
Hello world from process 000 out of 001, processor name res-hpc-exe013.researchlumc.nl 
salloc: Relinquishing job allocation 268
salloc: Job allocation 268 has been revoked.

Here we allocated 1 node with one core and ran the openmpi compiled “hello1” program.

Now the same with 2 nodes, 16 cores on each machine:

[user@res-hpc-lo01 ~]$ salloc -N2 --ntasks-per-node=16 mpirun ./hello1
salloc: Granted job allocation 270
salloc: Waiting for resource configuration
salloc: Nodes res-hpc-exe[013-014] are ready for job
Hello world from process 003 out of 032, processor name res-hpc-exe013.researchlumc.nl 
Hello world from process 021 out of 032, processor name res-hpc-exe014.researchlumc.nl 
Hello world from process 004 out of 032, processor name res-hpc-exe013.researchlumc.nl 
Hello world from process 005 out of 032, processor name res-hpc-exe013.researchlumc.nl 
Hello world from process 027 out of 032, processor name res-hpc-exe014.researchlumc.nl 
Hello world from process 000 out of 032, processor name res-hpc-exe013.researchlumc.nl 
Hello world from process 029 out of 032, processor name res-hpc-exe014.researchlumc.nl 
Hello world from process 006 out of 032, processor name res-hpc-exe013.researchlumc.nl 
Hello world from process 031 out of 032, processor name res-hpc-exe014.researchlumc.nl 
Hello world from process 007 out of 032, processor name res-hpc-exe013.researchlumc.nl 
Hello world from process 016 out of 032, processor name res-hpc-exe014.researchlumc.nl 
Hello world from process 010 out of 032, processor name res-hpc-exe013.researchlumc.nl 
Hello world from process 019 out of 032, processor name res-hpc-exe014.researchlumc.nl 
Hello world from process 011 out of 032, processor name res-hpc-exe013.researchlumc.nl 
Hello world from process 030 out of 032, processor name res-hpc-exe014.researchlumc.nl 
Hello world from process 012 out of 032, processor name res-hpc-exe013.researchlumc.nl 
Hello world from process 017 out of 032, processor name res-hpc-exe014.researchlumc.nl 
Hello world from process 013 out of 032, processor name res-hpc-exe013.researchlumc.nl 
Hello world from process 018 out of 032, processor name res-hpc-exe014.researchlumc.nl 
Hello world from process 014 out of 032, processor name res-hpc-exe013.researchlumc.nl 
Hello world from process 020 out of 032, processor name res-hpc-exe014.researchlumc.nl 
Hello world from process 015 out of 032, processor name res-hpc-exe013.researchlumc.nl 
Hello world from process 022 out of 032, processor name res-hpc-exe014.researchlumc.nl 
Hello world from process 001 out of 032, processor name res-hpc-exe013.researchlumc.nl 
Hello world from process 023 out of 032, processor name res-hpc-exe014.researchlumc.nl 
Hello world from process 024 out of 032, processor name res-hpc-exe014.researchlumc.nl 
Hello world from process 002 out of 032, processor name res-hpc-exe013.researchlumc.nl 
Hello world from process 025 out of 032, processor name res-hpc-exe014.researchlumc.nl 
Hello world from process 008 out of 032, processor name res-hpc-exe013.researchlumc.nl 
Hello world from process 026 out of 032, processor name res-hpc-exe014.researchlumc.nl 
Hello world from process 028 out of 032, processor name res-hpc-exe014.researchlumc.nl 
Hello world from process 009 out of 032, processor name res-hpc-exe013.researchlumc.nl 
salloc: Relinquishing job allocation 270
  • srun

With the srun command you can also open an interactive session or you can run a program through the scheduler.

Interactive:

[user@res-hpc-lo01 ~]$ srun --pty bash
[user@res-hpc-exe013 ~]$ exit
exit

Running a program:

[user@res-hpc-lo01 ~]$ cat hello.sh
#!/bin/bash
#

echo "Hello from $(hostname)"
echo "It is currently $(date)"
echo ""
echo "SLURM_JOB_NAME: $SLURM_JOB_NAME"
echo "SLURM_JOBID: " $SLURM_JOBID

[user@res-hpc-lo01 ~]$ chmod +x hello.sh 

[user@res-hpc-lo01 ~]$ srun -N1 hello.sh
Hello from res-hpc-exe013.researchlumc.nl
It is currently Thu Jan 23 12:35:18 CET 2020

SLURM_JOB_NAME: hello.sh
SLURM_JOBID:  282

sbatch

The normal and correct way to submit a job is with a slurm batch file. This is a normal bash script with special directives for slurm.

In SBATCH lines, “#SBATCH” is used to submit options. The various meanings of lines starting with “#” are:

Line starts with Treated as
# Comment in shell and Slurm
#SBATCH Comment in shell, option in Slurm
# SBATCH Comment in shell and Slurm

Options, sometimes called “directives”, can be set in the job script file using this line format for each option:

#SBATCH {option} {parameter}
Directive Description Specified As #SBATCH
Name the job < jobname > -J < jobname >
Request at least < minnodes > nodes -N < minnodes >
Request < minnodes > to < maxnodes > nodes -N < minnodes >-< maxnodes >
Request at least < MB > amount of temporary disk space –tmp < MB >
Run the job for a time of < walltime > -t < walltime >
Run the job at < time > –begin < time >
Set the working directory to < directorypath > -D < directorypath >
Set error log name to < jobname.err > -e < jobname.err >
Set output log name to < jobname.log > -o < jobname.log >
Mail < user@address > –mail-user < user@address >
Mail on any event –mail-type=ALL
Mail on job end –mail-type=END
Run job in partition -p < destination >
Run job using GPU with ID < number > –gres=gpu:< number >
M|G|T]. Better use “–mem-per-cpu” –mem=<size[units]>
M|G|T] –mem-per-cpu=<size[units]>

Node-Core reservation:

Short option Long option Description
-N –nodes= Request this many nodes on the cluster. Use 1 core on each node by default
-n –ntasks= Request this many tasks on the cluster. Defaults to 1 task per node
(none) –ntasks-per-node= Request this number of tasks per node

For example:

Options Description
-N2 use 2 nodes, 1 core on each node, so in total 2 cores
-N2 –ntasks-per-node=16 use 2 nodes, 16 cores on each node, so in total 32 cores
-n32 use 32 cores in total, let Slurm decide where to run (one or multiple nodes)

Submitting jobs

module purge
module add library/mpi/openmpi/4.0.2/gcc-8.3.1

git clone https://github.com/intel/mpi-benchmarks
cd mpi-benchmarks
make clean
cd src_c
make clean
make -f Makefile TARGET=MPI1

Please notice in the Makefile: CC=mpicc

ldd ./IMB-MPI1 
    linux-vdso.so.1 (0x00007fff6e9f6000)
    libmpi.so.40 => /share/software/library/mpi/openmpi/4.0.2/gcc-8.3.1/lib/libmpi.so.40 (0x00007f7c6acb7000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f7c6aa97000)
    libc.so.6 => /lib64/libc.so.6 (0x00007f7c6a6d4000)
    libopen-rte.so.40 => /share/software/library/mpi/openmpi/4.0.2/gcc-8.3.1/lib/libopen-rte.so.40 (0x00007f7c6a41e000)
    libopen-pal.so.40 => /share/software/library/mpi/openmpi/4.0.2/gcc-8.3.1/lib/libopen-pal.so.40 (0x00007f7c6a12e000)
    libdl.so.2 => /lib64/libdl.so.2 (0x00007f7c69f2a000)
    libudev.so.1 => /lib64/libudev.so.1 (0x00007f7c69d03000)
    libpciaccess.so.0 => /lib64/libpciaccess.so.0 (0x00007f7c69af9000)
    librt.so.1 => /lib64/librt.so.1 (0x00007f7c698f0000)
    libm.so.6 => /lib64/libm.so.6 (0x00007f7c6956e000)
    libutil.so.1 => /lib64/libutil.so.1 (0x00007f7c6936a000)
    libz.so.1 => /lib64/libz.so.1 (0x00007f7c69153000)
    libevent-2.1.so.6 => /lib64/libevent-2.1.so.6 (0x00007f7c68efa000)
    libevent_pthreads-2.1.so.6 => /lib64/libevent_pthreads-2.1.so.6 (0x00007f7c68cf7000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f7c6afda000)
    libmount.so.1 => /lib64/libmount.so.1 (0x00007f7c68a9d000)
    libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f7c68885000)
    libcrypto.so.1.1 => /lib64/libcrypto.so.1.1 (0x00007f7c683a7000)
    libblkid.so.1 => /lib64/libblkid.so.1 (0x00007f7c68155000)
    libuuid.so.1 => /lib64/libuuid.so.1 (0x00007f7c67f4d000)
    libselinux.so.1 => /lib64/libselinux.so.1 (0x00007f7c67d22000)
    libpcre2-8.so.0 => /lib64/libpcre2-8.so.0 (0x00007f7c67a9e000)
#!/bin/bash
#SBATCH -J IMB
#SBATCH -N 2
# SBATCH --ntasks-per-node=16
# SBATCH --ntasks-per-node=6
# SBATCH -n 32
# SBATCH --exclusive
#SBATCH --time=00:30:00 
#SBATCH --error=job.%J.err
#SBATCH --output=job.%J.out
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=user@lumc.nl

# Clear the environment from any previously loaded modules
module purge > /dev/null 2>&1

# Load the module environment suitable for the job
# module load library/mpi/openmpi/3.1.5/gcc-8.3.1
module load library/mpi/openmpi/4.0.2/gcc-8.3.1

echo "Starting at `date`"

echo "Running on hosts: $SLURM_JOB_NODELIST"
echo "Running on $SLURM_JOB_NUM_NODES nodes."
echo "Running $SLURM_NTASKS tasks."
echo "Account: $SLURM_JOB_ACCOUNT"
echo "Job ID: $SLURM_JOB_ID"
echo "Job name: $SLURM_JOB_NAME"
echo "Node running script: $SLURMD_NODENAME"
echo "Submit host: $SLURM_SUBMIT_HOST"

echo "Current working directory is `pwd`"

mpirun ./IMB-MPI1 
echo "Program finished with exit code $? at: `date`"

scancel

With the scancel command you can cancel your running job or your scheduled job: - scancel jobid

where the jobid is your job identifier.

scontrol update

While the scontrol show a powerful command is to show info about your job, with the scontrol update command, you can change certain settings as long as your job is on hold or pending. First put your job on hold, update the settings and then release your job: - scontrol hold jobid - scontrol update job jobid NumNodes=2-2 NumTasks=2 Features=intel16 - scontrol release jobid

See the man page for the scontrol command.

X11 forwarding

You can enable X11 forwarding with the “–x11” parameter, for example: * srun -n1 –pty –x11 xclock

Using GPU’s

You can use a GPU with the –gres parameter, for example:

--partition=gpu
--gres=gpu:1
--ntasks=1
--cpus-per-task=1

Syntax: * –gres=gpu:[type gpu]:[number of gpus]

Normally you don’t have to specify the type of GPU. But if there are different kind of GPUs in a single machine or you want to run on a certain type of GPU you have to specify on which GPU you want to run, for example:

--partition=LKEBgpu
--gres=gpu:1080Ti:1
Hostname Partition Type GPU and number
res-hpc-lkeb01 LKEBgpu Gres=gpu:K40C:1
res-hpc-lkeb02 LKEBgpu Gres=gpu:TitanX:1
res-hpc-lkeb03 LKEBgpu Gres=gpu:V100:4
res-hpc-lkeb04 LKEBgpu Gres=gpu:1080Ti:4
res-hpc-lkeb05 LKEBgpu Gres=gpu:RTX6000:3
res-hpc-gpu[01-02] gpu Gres=gpu:TitanXp:3

Simple GPU example

cat test-gpu.slurm

#!/bin/bash
#
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --time=3:00

module purge
module add library/cuda/10.1/gcc.8.3.1

hostname
echo "Cuda devices: $CUDA_VISIBLE_DEVICES"
nvidia-smi
sleep 10

Output:

cat slurm-206044.out

res-hpc-gpu01.researchlumc.nl
Cuda devices: 0
Tue Apr 14 14:24:59 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN Xp            Off  | 00000000:3B:00.0 Off |                  N/A |
| 17%   30C    P0    60W / 250W |      0MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

You can see from the output that we have 1 GPU: Cuda devices: 0

The same, but now we make a reservation for 3 GPU’s:

#!/bin/bash
#
#SBATCH --partition=gpu
#SBATCH --gres=gpu:3
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --time=3:00

module purge
module add library/cuda/10.1/gcc.8.3.1

hostname
echo "Cuda devices: $CUDA_VISIBLE_DEVICES"
nvidia-smi
sleep 10

cat slurm-206045.out

res-hpc-gpu01.researchlumc.nl
Cuda devices: 0,1,2
Tue Apr 14 14:26:22 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN Xp            Off  | 00000000:3B:00.0 Off |                  N/A |
| 17%   31C    P0    61W / 250W |      0MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  TITAN Xp            Off  | 00000000:AF:00.0 Off |                  N/A |
| 18%   29C    P0    60W / 250W |      0MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  TITAN Xp            Off  | 00000000:D8:00.0 Off |                  N/A |
| 18%   30C    P0    61W / 250W |      0MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

You can see from the output that we have 3 GPU’s: Cuda devices: 0,1,2

Compiling and running GPU programs

First download and compile the samples from NVidia: Cuda samples

module purge
module add library/cuda/10.1/gcc.8.3.1

cd
git clone https://github.com/NVIDIA/cuda-samples.git
cd cuda-samples/Samples/UnifiedMemoryPerf/
make

Create a slurm batch script:

cat gpu-test.slurm
#!/bin/bash
#
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --time=3:00

module purge
module add library/cuda/10.2/gcc.8.3.1

hostname
echo "Cuda devices: $CUDA_VISIBLE_DEVICES"
$HOME/cuda-samples/Samples/UnifiedMemoryPerf/UnifiedMemoryPerf
  • sbatch gpu-test.slurm

While running, ssh to the node (in this case res-hpc-gpu01) and run the command nvidia-smi. This will show that the “UnifiedMemoryPerf” program is running on a GPU.

[user@res-hpc-gpu01 GPU]$ nvidia-smi 
Tue Apr 14 16:06:06 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN Xp            Off  | 00000000:3B:00.0 Off |                  N/A |
| 17%   31C    P0    61W / 250W |      0MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  TITAN Xp            Off  | 00000000:AF:00.0 Off |                  N/A |
| 23%   34C    P2    69W / 250W |    259MiB / 12196MiB |      2%      Default |
+-------------------------------+----------------------+----------------------+
|   2  TITAN Xp            Off  | 00000000:D8:00.0 Off |                  N/A |
| 18%   31C    P0    61W / 250W |      0MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    1     29726      C   ...les/UnifiedMemoryPerf/UnifiedMemoryPerf   145MiB |
+-----------------------------------------------------------------------------+

Output:

cat slurm-206625.out 
res-hpc-gpu01.researchlumc.nl
Cuda devices: 0
GPU Device 0: "Pascal" with compute capability 6.1

Running ........................................................

Overall Time For matrixMultiplyPerf 

Printing Average of 100 measurements in (ms)
Size_KB  UMhint UMhntAs  UMeasy   0Copy MemCopy CpAsync CpHpglk CpPglAs
4    10.879  23.178   0.222   0.014   0.031   0.026   0.035   0.026
16   10.657  25.849   0.580   0.030   0.051   0.046   0.052   0.039
64   21.117  37.351   0.852   0.103   0.124   0.116   0.095   0.081
256  21.184  38.074   1.387   0.587   0.450   0.415   0.313   0.302
1024     24.174  33.124   3.032   3.650   1.741   1.649   1.211   1.199
4096     21.668  35.167  11.067  25.803   7.119   7.104   5.329   5.333
16384    51.674  62.263  49.300 191.051  34.179  34.632  28.582  28.054

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

Slurm Environment Variables

Available environment variables include:

Variable Meaning
SLURM_CPUS_ON_NODE processors available to the job on this node
SLURM_JOB_ID job ID of executing job
SLURM_LAUNCH_NODE_IPADDR IP address of node where job launched
SLURM_NNODES total number of nodes
SLURM_NODEID relative node ID of current node
SLURM_NODELIST list of nodes allocated to job
SLURM_NTASKS total number of processes in current job
SLURM_PROCID MPI rank (or relative process ID) of the current process
SLURM_SUBMIT_DIR directory from with job was launched
SLURM_TASK_PID process ID of task started
SLURM_TASKS_PER_NODE number of task to be run on each node
CUDA_VISIBLE_DEVICES which GPUs are available for use

Job arrays

For job arrays see the slurm web page

Batch option: * –array=0-31

You can cancel your job array with the command: * scancel jobid_[0-31]

Environment variables

Environment variable Comment
SLURM_ARRAY_JOB_ID set to the first job ID of the array
SLURM_ARRAY_TASK_ID set to the job array index value
SLURM_ARRAY_TASK_COUNT set to the number of tasks in the job array
SLURM_ARRAY_TASK_MAX set to the highest job array index value
SLURM_ARRAY_TASK_MIN set to the lowest job array index value

Limitations - Restrictions

For now we have set the maximum amount of job arrays you can summit at once to 121

scontrol show config | grep MaxArraySize
MaxArraySize            = 121

If we don’t set the limitation, some users are submitting large amounts of job arrays which will occupy the cluster so that other users can’t run.

Limit the amount of simultaneously running jobs

You can limit the amount of simultaneously running jobs with the following command (for example): * “–array=0-100%4” will limit the number of simultaneously running tasks from this job array to 25

We recommend to use this option.

Remote GPU-accelerated visualization on res-hpc-lo02

If you want to run a graphical program that will show 3D animation, movies or any other kind of simulation/visualization, we have the second login node res-hpc-lo02 for this. This server has a powerful Tesla T4 GPU card (16Gb memory).

Steps for setting up a remote GPU-accelerated visualization:

  • connect your remote desktop (X2Go) to the second login node res-hpc-lo02
  • start your visualization program (with “vglrun” in front of it if needed)

Once you are in your remote desktop, open a terminal.

For the GPU acceleration, you have to run the VirtualGL command: vglrun in front of the real program you want to run.

Examples: * /opt/VirtualGL/bin/vglrun /opt/VirtualGL/bin/glxinfo * /opt/VirtualGL/bin/vglrun /opt/VirtualGL/bin/glxspheres64 * /opt/VirtualGL/bin/vglrun glxgears

With the “glxinfo” program, you should check for the strings: * direct rendering: Yes * OpenGL renderer string: Tesla T4/PCIe/SSE2

vglrun glxinfo | egrep "rendering|OpenGL"
direct rendering: Yes
OpenGL vendor string: NVIDIA Corporation
OpenGL renderer string: Tesla T4/PCIe/SSE2

If you see llvmpipe then you are using software rendering in stead of hardware accelerating.

glxinfo | egrep "rendering|OpenGL"
direct rendering: Yes
OpenGL vendor string: VMware, Inc.
OpenGL renderer string: llvmpipe (LLVM 9.0.0, 256 bits)

Programs that can run with “vglrun”

  • fsleyes

How to run: * module add neuroImaging/FSLeyes/0.32.3 * /opt/VirtualGL/bin/vglrun fsleyes

fsleyes-01
alt text

Check for: OpenGL renderer: Tesla T4/PCIe/SSE2

If you see llvmpipe then you are using software rendering: alt text

With the nvidia-smi command, you can also check if your program is running on the GPU. Below you can see 2 programs running on the GPU: the Xorg server and the fsleyes program:

nvidia-smi 
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     25565      G   /usr/libexec/Xorg                  25MiB |
|    0   N/A  N/A     28059      G   ...ng/fsl/FSLeyes/bin/python        2MiB |
+-----------------------------------------------------------------------------+

Create a vncserver setup and start a vncserver session

If you are setting up a vcnserver session for the first time, it will ask you a few questions and after that you have to adapt the configuration and startup file:

vncserver

You will require a password to access your desktops.

Password: 
Verify:   
Would you like to enter a view-only password (y/n)? n

Desktop 'TurboVNC: res-hpc-lo02.researchlumc.nl:1 (username)' started on display res-hpc-lo02.researchlumc.nl:1

Creating default startup script /home/username/.vnc/xstartup.turbovnc
Starting applications specified in /home/username/.vnc/xstartup.turbovnc
Log file is /home/username/.vnc/res-hpc-lo02.researchlumc.nl:1.log

You should choose a strong password, the password should not be the same as the user login password.

Kill the vncserver connection: * vncserver -kill :1

Now adapt the xstartup.turbovnc file: * vi $HOME/.vnc/xstartup.turbovnc

#!/bin/sh

unset SESSION_MANAGER
unset DBUS_SESSION_BUS_ADDRESS
XDG_SESSION_TYPE=x11;  export XDG_SESSION_TYPE

exec icewm-session

Adapt/create a turbovncserver.conf file for the vncserver with some useful settings: * vi $HOME/.vnc/turbovncserver.conf

$geometry="1280x1024"
$depth=24

Now start the vncserver:

vncserver 

Desktop 'TurboVNC: res-hpc-lo02.researchlumc.nl:1 (username)' started on display res-hpc-lo02.researchlumc.nl:1

Starting applications specified in /home/username/.vnc/xstartup.turbovnc
Log file is /home/username/.vnc/res-hpc-lo02.researchlumc.nl:1.log

You can list you vncserver session with the following command: * vncserver -list

vncserver -list
TurboVNC sessions:

X DISPLAY #     PROCESS ID
:1              33915

You can/should kill your vncserver session when you are done with running your own application: * vncserver -kill :1

vncserver -kill :1
Killing Xvnc process ID 47947

vncserver and port numbers

Everytime someone start a vcnsession and there is already a vncsession running, the port number will increase. The first connection will be on :1 which is port 5900 + 1, where the port 5900 is the standard port number for VNC. For example:

Desktop 'TurboVNC: res-hpc-lo02.researchlumc.nl:3 (username)' started on display res-hpc-lo02.researchlumc.nl:3

In this case, the port to connect to is number 3. So you connect to “res-hpc-lo02.researchlumc.nl:3”.

You can always list your own open vnc session with the vncserver -list command.

Remember to kill you vnc session when you are done running your own application.

Remote visualization with a “reverse SSH tunnel” and a vncserver/client

With a reverse SSH tunnel you can make a quick connection to a remote desktop. We assume that you already have setup correctly your vncserver.

We are using the SSH proxy server for this: * IP address: 145.88.35.10 * Hostname: res-ssh-alg01.researchlumc.nl

Setting up the reverse SSH tunnel

You have to follow the following steps: * [on res-hpc-lo02:] vncserver

Desktop 'TurboVNC: res-hpc-lo02.researchlumc.nl:1 (username)' started on display res-hpc-lo02.researchlumc.nl:1

Starting applications specified in /home/username/.vnc/xstartup.turbovnc
Log file is /home/username/.vnc/res-hpc-lo02.researchlumc.nl:1.log
  • [on res-hpc-lo02:] ssh -R 8899:localhost:5901 -l username 145.88.35.10

You are now logged in on “username@res-ssh-alg01” [keep this window open]

At home (terminal 1): * ssh -L 5901:localhost:8899 -l username 145.88.35.10 [keep this window open]

At home (terminal 2): * first install a vnc client for your OS, for example: * TurboVNC * https://sourceforge.net/projects/turbovnc/files/ * TigerVNC * https://bintray.com/tigervnc/stable/tigervnc * vncviewer localhost:1

vncviewer-01.gif
alt text
vncviewer-02.gif
alt text

This will open a desktop (icewm)

From here:

  • terminal
  • module add neuroImaging/FSLeyes/0.32.3
  • /opt/VirtualGL/bin/vglrun fsleyes

Remember: the first vnc connection will start at port 5901 (:1), the second connection at port 5902 (:2) and so on. The same for the port 8899. Only 1 user can connect to this port. If this port is occupied, you can’t get a connection, so you have to try another port, for example: 8900 or 8898, etc.

Closing the reverse SSH tunnel

  • vncserver -kill :1
  • close both connection/terminals where you are logged in [to the ssh proxy server] with “exit”

More Slurm info

Sview

sview is a graphical frontend for slurm which can be handy sometimes. It only gives you some minimal functionality.

Don’t forget to enable X11 forwarding.

alt text alt text alt text alt text

Comparison between SGE and Slurm

User commands

User command SGE Slurm
Interactive login qlogin srun –pty bash or srun (-p “partition name”–time=4:0:0 –pty bash)
Job submission qsub [script_file] sbatch [script_file]
Job deletion qdel [job_id] scancel [job_id]
Job status by job qstat -u “*” [-j job_id] squeue [job_id]
Job status by user qstat [-u user_name] squeue -u [user_name]
Job hold qhold [job_id] scontrol hold [job_id]
Job release qrls [job_id] scontrol release [job_id]
Queue list qconf -sql squeue
List nodes qhost sinfo -N or scontrol show nodes
Cluster status qhost -q sinfo
GUI qmon sview

Environmental

Environmental SGE SLURM
Job ID $JOB_ID $SLURM_JOB_ID
Submit directory $SGE_O_WORKDIR $SLURM_SUBMIT_DIR
Submit host $SGE_O_HOST $SLURM_SUBMIT_HOST
Node list $PE_HOSTFILE $SLURM_NODELIST
Job Array Index $SGE_TASK_ID $SLURM_ARRAY_TASK_ID
Number of CPUs $NSLOTS $SLURM_NPROCS

More:

Slurm Comment
$SLURM_CPUS_ON_NODE processors available to the job on this node
$SLURM_LAUNCH_NODE_IPADDR IP address of node where job launched
$SLURM_NNODES total number of nodes
$SLURM_NODEID relative node ID of current node
$SLURM_NTASKS total number of processes in current job
$SLURM_PROCID MPI rank (or relative process ID) of the current process
$SLURM_TASK_PID process ID of task started
$SLURM_TASKS_PER_NODE number of task to be run on each node.
$CUDA_VISIBLE_DEVICES which GPUs are available for use

Job Specification

Job Specification SGE SLURM
Script directive #$ #SBATCH
queue -q [queue] -p [partition]
count of nodes N/A -N [min[-max]]
CPU count -pe [PE] [count] -n [count]
Wall clock limit -l h_rt=[seconds] -t [min] OR -t [days-hh:mm:ss]
Standard out file -o [file_name] -o [file_name]
Standard error file -e [file_name] -e [file_name]
Combine STDOUT & STDERR files -j yes (use -o without -e)
Copy environment -V NONE | variables]
Event notification -m abe –mail-type=[events] –mail-type=ALL (any event) –mail-type=END (job end)
Send notification email -M [address] –mail-user=[address]
Job name -N [name] –job-name=[name] [-J]
Restart job no] –requeue OR –no-requeue NOTE: configurable default)
Set working directory -wd [directory] –workdir=[dir_name] [-D]
Resource sharing -l exclusive –exclusive OR –shared
Memory size M|G] G|T] OR –mem-per-cpu=[mem][M|G|T]
Charge to an account -A [account] –account=[account]
Tasks per node (Fixed allocation_rule in PE) –tasks-per-node=[count] –cpus-per-task=[count]
Job dependency job_name] –depend=[state:job_id]
Job project -P [name] –wckey=[name]
Job host preference -q [queue]@[node] OR -q [queue]@@[hostgroup] –nodelist=[nodes] AND/OR –exclude=[nodes]
Quality of service –qos=[name]
Job arrays -t [array_spec] –array=[array_spec]
Generic Resources -l [resource]=[value] –gres=[resource_spec]
Licenses -l [license]=[count] –licenses=[license_spec]
Begin Time -a [YYMMDDhhmm] –begin=YYYY-MM-DD[HH:MM[:SS]]

Applications

R

R is a programming language for statistical computing and graphics.

You can load R with one of the modules:

  • statistical/R/3.4.4/gcc.8.3.1
  • statistical/R/3.5.3/gcc.8.3.1
  • statistical/R/3.6.2/gcc.8.3.1
  • statistical/R/4.0.0/gcc.8.3.1
  • statistical/RStudio/1.2.5033/gcc-8.3.1

Running R interactively

You can start to run R interactively, just as an exercise and test. The recommended way is to run R in batch mode.

[username@res-hpc-lo01 ~]$ salloc -N1 -n1
salloc: Pending job allocation 386499
salloc: job 386499 queued and waiting for resources
salloc: job 386499 has been allocated resources
salloc: Granted job allocation 386499
salloc: Waiting for resource configuration
salloc: Nodes res-hpc-exe017 are ready for job

[username@res-hpc-exe017 ~]$ module add statistical/R/4.0.0/gcc.8.3.1
[username@res-hpc-exe017 ~]$ R

R version 4.0.0 (2020-04-24) -- "Arbor Day"
...
Type 'q()' to quit R.

> q()
Save workspace image? [y/n/c]: n

[username@res-hpc-exe017 ~]$ exit
exit
salloc: Relinquishing job allocation 386499
salloc: Job allocation 386499 has been revoked.

Running a R script in batch mode

First example

HelloWorld.R

print ("Hello world!")

myscript.sh

#!/bin/bash

#SBATCH --job-name=HelloWord            # Job name
#SBATCH --output=slurm.out              # Output file name
#SBATCH --error=slurm.err               # Error file name
#SBATCH --partition=short               # Partition
#SBATCH --time=00:05:00                 # Time limit 
#SBATCH --nodes=1                       # Number of nodes
#SBATCH --ntasks-per-node=1             # MPI processes per node

module purge
module add statistical/R/4.0.0/gcc.8.3.1

Rscript --vanilla HelloWorld.R
  • sbatch myscript.sh

Submitted batch job 386860

[username@res-hpc-lo01 R]$ cat slurm.out

[1] “Hello world!”

Second example

driver.R

x <- rnorm(50)
cat("My sample from N(0,1) is:\n")
print(x)

run.slurm

#!/bin/bash

#SBATCH --job-name=serialR              # Job name
#SBATCH --output=slurm.out              # Output file name
#SBATCH --error=slurm.err               # Error file name
#SBATCH --partition=short               # Partition
#SBATCH --time=00:05:00                 # Time limit 
#SBATCH --nodes=1                       # Number of nodes
#SBATCH --ntasks-per-node=1             # MPI processes per node

module purge
module add statistical/R/4.0.0/gcc.8.3.1

Rscript driver.R
[username@res-hpc-lo01 R]$ sbatch run.slurm 
Submitted batch job 386568

[username@res-hpc-lo01 R]$ ls -l
total 78
-rw-r--r-- 1 username Domain Users  59 Jun  5 11:42 driver.R
-rw-r--r-- 1 username Domain Users 483 Jun  5 11:42 run.slurm
-rw-r--r-- 1 username Domain Users   0 Jun  5 11:43 slurm.err
-rw-r--r-- 1 username Domain Users 671 Jun  5 11:43 slurm.out

[username@res-hpc-lo01 R]$ cat slurm.out 
My sample from N(0,1) is:
 [1]  0.32241013 -0.78250675 -0.28872991  0.12559634 -0.29176358  0.57962942
 [7] -0.38277807 -0.21266343  0.86537064  1.06636737  0.96487417  0.31699518
[13]  0.38003556  0.78275327 -0.85745177 -1.47682958 -0.16192662  0.09207091
[19] -0.64508782  1.01504976 -0.07736039 -1.08819811  1.17762738 -0.22819258
[25]  0.79564029  1.36863520 -0.63137494 -0.58452239 -0.96832479 -1.56506037
[31]  1.68344229  1.03967058 -0.20854621  1.39479829 -0.95509839  0.80826154
[37] -0.89781029  0.99954821 -1.25047597 -1.11034908 -1.10759254  1.32150663
[43] -0.04589279 -0.62886137  0.63947415  0.18295622  0.63929410  0.16774740
[49]  0.92311091 -0.13370228
[username@res-hpc-lo01 R]$ scontrol show job 386568
JobId=386568 JobName=serialR
   UserId=username(225812) GroupId=Domain Users(513) MCS_label=N/A
   Priority=449759 Nice=0 Account=dnst-ict QOS=normal
   JobState=COMPLETED Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:02 TimeLimit=00:05:00 TimeMin=N/A
   SubmitTime=2020-06-05T11:43:02 EligibleTime=2020-06-05T11:43:02
   AccrueTime=2020-06-05T11:43:02
   StartTime=2020-06-05T11:43:02 EndTime=2020-06-05T11:43:04 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-06-05T11:43:02
   Partition=short AllocNode:Sid=res-hpc-ma01:27472
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=res-hpc-gpu01
   BatchHost=res-hpc-gpu01
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=2G,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=2G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/username/R/run.slurm
   WorkDir=/home/username/R
   StdErr=/home/username/R/slurm.err
   StdIn=/dev/null
   StdOut=/home/username/R/slurm.out
   Power=
   MailUser=(null) MailType=NONE

Some useful links:

SLURM Job Submission with R, Python, Bash

R (Programming Language)

How to run R programs on taki

How to useRon the Bioinformatics cluster

Running R parallel

Unfornately, “R” is not very efficient when running on a HPC cluster. Basically every R instance is running on only 1 core. To make your R program to run parallel and more efficient, we have installed for now the following libraries:

  • Rmpi
  • snow
  • snowfall
  • parallel

Loading one of these libraires does not make your program to run parallel. For that, you have to adapt your R program.

Example

hello.R

library(Rmpi)

id <- mpi.comm.rank(comm = 0)
np <- mpi.comm.size(comm = 0)
hostname <- mpi.get.processor.name()

msg <- sprintf("Hello world from process %03d of %03d, on host %s\n", id, np, hostname)
cat(msg)

mpi.barrier(comm = 0)
mpi.finalize()

run-rpmi.slurm

#!/bin/bash

#SBATCH --job-name=hello_parallel       # Job name
#SBATCH --output=slurm-rmpi.out         # Output file name
#SBATCH --error=slurm-rmpi.err          # Error file name
#SBATCH --partition=short               # Partition
#SBATCH --time=00:05:00                 # Time limit
#SBATCH --nodes=2                       # Number of nodes
#SBATCH --ntasks-per-node=4             # MPI processes per node

module purge
module add statistical/R/4.0.0/gcc.8.3.1
module add library/mpi/openmpi/4.0.3/gcc-8.3.1

mpirun Rscript hello.R
[username@res-hpc-lo01 R]$ cat slurm-rmpi.out 
Hello world from process 000 of 008, on host res-hpc-gpu01
Hello world from process 001 of 008, on host res-hpc-gpu01
Hello world from process 002 of 008, on host res-hpc-gpu01
Hello world from process 003 of 008, on host res-hpc-gpu01
Hello world from process 004 of 008, on host res-hpc-gpu02
Hello world from process 005 of 008, on host res-hpc-gpu02
Hello world from process 006 of 008, on host res-hpc-gpu02
Hello world from process 007 of 008, on host res-hpc-gpu02

We recommend to have a look at the following web pages:

High-Performance and Parallel Computing with R

Quick Intro to Parallel Computing in R

Parallel Processing in R

How-to go parallel in R - basics + tips

Rmpi

Parallel Computing: Introduction to MPI

MPI Tutorial for R (Rmpi)

Rmpi Tutorial 2: Sending Data

RStudio

RStudio is an integrated development environment for R.

You can run RStudio on the login node if you want (X11 forwarding enabled or connected with X2Go or MobaXterm):

module purge
module add statistical/RStudio/1.3.959/gcc-8.3.1
rstudio

RStudio on a compute node

You can also start RStudio on a compute node:

[username@res-hpc-lo01 ~]$ srun --x11 --pty bash
[username@res-hpc-exe014 ~]$ module purge
[username@res-hpc-exe014 ~]$ module add statistical/RStudio/1.3.959/gcc-8.3.1
[username@res-hpc-exe014 ~]$ rstudio

RStudio on the OOD (OpenOnDemand portal)

You can also start a RStudio server on the OOD portal:

OOD portal

FSLeyes

See: Programs that can run with “vglrun”

Python

As a researcher, student, scientist or health care worker, there is a big change you have to work with the programming language Python. Basically it is almost de facto programming language in the research world.

By default, Python version 3.6.8 is installed as part of the opererating system (CentOS 8.X).

python --version
Python 3.6.8

If you need another version of Python (older or newer), you can load them with the module command (module add).

Python versions

We have the following extra Python versions installed on the cluster: * 2.7.17 * 3.7.6 * 3.8.1

You can load one of these with:

module add system/python/2.7.17
module add system/python/3.7.6
module add system/python/3.8.1

Installing Python packages

Some Python packages are already installed on the cluster and you can load/use them with the “module load” command. If you need an extra Python package you can use the pip3 command to install the Python package.

pip

For Python version 3 you should use the pip3 command. For Python version 2 you should use the pip2 command.

Edit first your pip config file:

$HOME/.config/pip/pip.conf
[list]
format=columns

Useful commmands:

pip install packageName
pip uninstall packageName
pip search packageName
pip help
pip install --help

As a normal user you should always install a Python packages (when you are not running in a virtual Python environment) with the command:

pip install --user

So do not use pip install –user some_pkg inside a virtual environment, otherwise, virtual environment’s pip will be confused.

Example

pip3 install nibabel --user
Collecting nibabel
  Downloading https://files.pythonhosted.org/packages/8b/8c/cf676b9b3cf69164ba0703a9dcb86ed895ab172e09bece4480db4f03fcce/nibabel-3.1.1-py3-none-any.whl (3.3MB)
    100% |████████████████████████████████| 3.3MB 200kB/s 
Collecting packaging>=14.3 (from nibabel)
  Downloading https://files.pythonhosted.org/packages/46/19/c5ab91b1b05cfe63cccd5cfc971db9214c6dd6ced54e33c30d5af1d2bc43/packaging-20.4-py2.py3-none-any.whl
Requirement already satisfied: numpy>=1.13 in /usr/local/lib64/python3.6/site-packages (from nibabel)
Requirement already satisfied: six in /usr/lib/python3.6/site-packages (from packaging>=14.3->nibabel)
Requirement already satisfied: pyparsing>=2.0.2 in /usr/lib/python3.6/site-packages (from packaging>=14.3->nibabel)
Installing collected packages: packaging, nibabel
Successfully installed nibabel-3.1.1 packaging-20.4


pip3 show nibabel 
Name: nibabel
Version: 3.1.1
Summary: Access a multitude of neuroimaging data formats
Home-page: https://nipy.org/nibabel
Author: nibabel developers
Author-email: neuroimaging@python.org
License: MIT License
Location: /home/username/.local/lib/python3.6/site-packages
Requires: numpy, packaging


pip3 list
Package                  Version     
------------------------ ------------
...
nibabel                  3.1.1       
...

As you can see from the example above, the Python package(s) will be installed in your local user environment:

Location: /home/username/.local/lib/python3.6/site-packages

Python virtual environments

If you are working on different projects and you need different Python packages for each project, it is better to work in a special virtual environment.

When you activate this virtual environment, it will create a special virtual Python environment for you. In this virtual environment you can use the pip command (without the –user option) and other commands.

You create a new virtual environment with one of the following commands: * $ virtualenv python3 -m venv /path/to/new/virtual/environmentname * $ python3 -m venv /path/to/new/virtual/environmentname

You activate a new virtual environment with the command: * $ source /path/to/new/virtual/environmentname/bin/activate

You deactivate a virtual environment with the command (it will not be destroyed): * (envname) $ deactivate

Example:

virtualenv /exports/example/projects/Project-A
Using base prefix '/usr'
New python executable in /exports/example/projects/Project-A/bin/python3.6
Also creating executable in /exports/example/projects/Project-A/bin/python
Installing setuptools, pip, wheel...done.


[username@res-hpc-lo01 ~]$ source /exports/example/projects/Project-A/bin/activate
(Project-A) [username@res-hpc-lo01 ~]$ 


(Project-A) [username@res-hpc-lo01 python3.6]$ pip3 list
Package    Version
---------- -------
pip        20.1.1
setuptools 49.1.0
wheel      0.34.2


(Project-A) [username@res-hpc-lo01 ~]$ deactivate 
[username@res-hpc-lo01 ~]$ 

To remove your Python virtual environment delete the virtual environment directory.

  • $ rm -Rf /path/to/virtual/environmentname

Conda, Anaconda, Miniconda and Bioconda

If you have to install, setup and work with a complex program/project you should make use of the conda tool. Conda it self is a package management system, while anaconda, miniconda and bioconda provides you with a virtual Python environment and a lot of optimized Python packages, especially for researchers and scientists. These packages you can easily install within this environment.

  • Anaconda - collection with the most packages (> 7,500 data science and machine learning packages)
  • Miniconda - light-weighted Anaconda version (you should start with this version)
  • Bioconda - specializing in bio-informatics software

Conda

Conda is an open source package management system and environment management system that runs on Windows, macOS and Linux. Conda quickly installs, runs and updates packages and their dependencies. Conda easily creates, saves, loads and switches between environments on your local computer. It was created for Python programs, but it can package and distribute software for any language.

Conda website

Conda documentation

Anaconda

Anaconda is a package manager, an environment manager, a Python/R data science distribution, and a collection of over 7,500+ open-source packages.

Anaconda website

Anaconda product

Miniconda

Miniconda is a free minimal installer for conda. It is a small, bootstrap version of Anaconda that includes only conda, Python, the packages they depend on, and a small number of other useful packages, including pip, zlib and a few others. Use the conda install command to install 720+ additional conda packages from the Anaconda repository.

Miniconda website

Bioconda

Bioconda is a channel for the conda package manager specializing in bioinformatics software.

Bioconda website

Bioconda documentation

Overview useful commands

Desciption command
Verify Conda is installed, check version number conda info
Create a new environment named ENVNAME conda create –name ENVNAME
Activate a named Conda environment conda activate ENVNAME
Deactivate current environment conda deactivate
List all packages and versions in the active environment conda list
Delete an entire environment conda remove –name ENVNAME –all
Search for a package in currently configured channels conda search PKGNAME
Install a package conda install PKGNAME
Detailed information about package versions conda search PKGNAME –info
Remove a package from an environment conda uninstall PKGNAME –name ENVNAME
Add a channel to your Conda configuration conda config –add channels CHANNELNAME
Example
module purge
module add tools/miniconda/python3.7/4.7.12

conda info

     active environment : None
            shell level : 0
       user config file : /home/username/.condarc
 populated config files : /home/username/.condarc
          conda version : 4.7.12
    conda-build version : not installed
         python version : 3.7.4.final.0
       virtual packages : 
       base environment : /share/software/tools/miniconda/3.7/4.7.12  (read only)
           channel URLs : https://repo.anaconda.com/pkgs/main/linux-64
                          https://repo.anaconda.com/pkgs/main/noarch
                          https://repo.anaconda.com/pkgs/r/linux-64
                          https://repo.anaconda.com/pkgs/r/noarch
          package cache : /share/software/tools/miniconda/3.7/4.7.12/pkgs
                          /home/username/.conda/pkgs
       envs directories : /home/username/.conda/envs
                          /share/software/tools/miniconda/3.7/4.7.12/envs
               platform : linux-64
             user-agent : conda/4.7.12 requests/2.22.0 CPython/3.7.4 Linux/4.18.0-147.8.1.el8_1.x86_64 centos/8.1.1911 glibc/2.28
                UID:GID : 225812:513
             netrc file : None
           offline mode : False
conda create --name Project-B
Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/username/.conda/envs/Project-B

Proceed ([y]/n)? y

Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use
#
#     $ conda activate Project-B
#
# To deactivate an active environment, use
#
#     $ conda deactivate
conda init bash
no change     /share/software/tools/miniconda/3.7/4.7.12/condabin/conda
no change     /share/software/tools/miniconda/3.7/4.7.12/bin/conda
no change     /share/software/tools/miniconda/3.7/4.7.12/bin/conda-env
no change     /share/software/tools/miniconda/3.7/4.7.12/bin/activate
no change     /share/software/tools/miniconda/3.7/4.7.12/bin/deactivate
no change     /share/software/tools/miniconda/3.7/4.7.12/etc/profile.d/conda.sh
no change     /share/software/tools/miniconda/3.7/4.7.12/etc/fish/conf.d/conda.fish
no change     /share/software/tools/miniconda/3.7/4.7.12/shell/condabin/Conda.psm1
no change     /share/software/tools/miniconda/3.7/4.7.12/shell/condabin/conda-hook.ps1
no change     /share/software/tools/miniconda/3.7/4.7.12/lib/python3.7/site-packages/xontrib/conda.xsh
no change     /share/software/tools/miniconda/3.7/4.7.12/etc/profile.d/conda.csh
modified      /home/username/.bashrc

==> For changes to take effect, close and re-open your current shell. <==
[username@res-hpc-lo01 ~]$ conda activate Project-B
(Project-B) [username@res-hpc-lo01 ~]$ 
conda search beautifulsoup4
Loading channels: done
# Name                       Version           Build  Channel             
beautifulsoup4                 4.6.0          py27_1  pkgs/main           
...
beautifulsoup4                 4.9.1          py38_0  pkgs/main
conda install beautifulsoup4
Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/username/.conda/envs/Project-B

  added / updated specs:
    - beautifulsoup4


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    beautifulsoup4-4.9.1       |           py38_0         171 KB
    ca-certificates-2020.6.24  |                0         125 KB
    certifi-2020.6.20          |           py38_0         156 KB
    libedit-3.1.20191231       |       h7b6447c_0         167 KB
    libffi-3.3                 |       he6710b0_2          50 KB
    ncurses-6.2                |       he6710b0_1         817 KB
    openssl-1.1.1g             |       h7b6447c_0         2.5 MB
    pip-20.1.1                 |           py38_1         1.7 MB
    python-3.8.3               |       hcff3b4d_2        49.1 MB
    readline-8.0               |       h7b6447c_0         356 KB
    setuptools-47.3.1          |           py38_0         515 KB
    soupsieve-2.0.1            |             py_0          33 KB
    sqlite-3.32.3              |       h62c20be_0         1.1 MB
    tk-8.6.10                  |       hbc83047_0         3.0 MB
    xz-5.2.5                   |       h7b6447c_0         341 KB
    ------------------------------------------------------------
                                           Total:        60.1 MB

The following NEW packages will be INSTALLED:

  _libgcc_mutex      pkgs/main/linux-64::_libgcc_mutex-0.1-main
  beautifulsoup4     pkgs/main/linux-64::beautifulsoup4-4.9.1-py38_0
  ca-certificates    pkgs/main/linux-64::ca-certificates-2020.6.24-0
  certifi            pkgs/main/linux-64::certifi-2020.6.20-py38_0
  ld_impl_linux-64   pkgs/main/linux-64::ld_impl_linux-64-2.33.1-h53a641e_7
  libedit            pkgs/main/linux-64::libedit-3.1.20191231-h7b6447c_0
  libffi             pkgs/main/linux-64::libffi-3.3-he6710b0_2
  libgcc-ng          pkgs/main/linux-64::libgcc-ng-9.1.0-hdf63c60_0
  libstdcxx-ng       pkgs/main/linux-64::libstdcxx-ng-9.1.0-hdf63c60_0
  ncurses            pkgs/main/linux-64::ncurses-6.2-he6710b0_1
  openssl            pkgs/main/linux-64::openssl-1.1.1g-h7b6447c_0
  pip                pkgs/main/linux-64::pip-20.1.1-py38_1
  python             pkgs/main/linux-64::python-3.8.3-hcff3b4d_2
  readline           pkgs/main/linux-64::readline-8.0-h7b6447c_0
  setuptools         pkgs/main/linux-64::setuptools-47.3.1-py38_0
  soupsieve          pkgs/main/noarch::soupsieve-2.0.1-py_0
  sqlite             pkgs/main/linux-64::sqlite-3.32.3-h62c20be_0
  tk                 pkgs/main/linux-64::tk-8.6.10-hbc83047_0
  wheel              pkgs/main/linux-64::wheel-0.34.2-py38_0
  xz                 pkgs/main/linux-64::xz-5.2.5-h7b6447c_0
  zlib               pkgs/main/linux-64::zlib-1.2.11-h7b6447c_3


Proceed ([y]/n)? y


Downloading and Extracting Packages
libedit-3.1.20191231 | 167 KB    | #################################### | 100% 
sqlite-3.32.3        | 1.1 MB    | #################################### | 100% 
readline-8.0         | 356 KB    | #################################### | 100% 
pip-20.1.1           | 1.7 MB    | #################################### | 100% 
python-3.8.3         | 49.1 MB   | #################################### | 100% 
certifi-2020.6.20    | 156 KB    | #################################### | 100% 
ncurses-6.2          | 817 KB    | #################################### | 100% 
ca-certificates-2020 | 125 KB    | #################################### | 100% 
setuptools-47.3.1    | 515 KB    | #################################### | 100% 
xz-5.2.5             | 341 KB    | #################################### | 100% 
openssl-1.1.1g       | 2.5 MB    | #################################### | 100% 
libffi-3.3           | 50 KB     | #################################### | 100% 
soupsieve-2.0.1      | 33 KB     | #################################### | 100% 
beautifulsoup4-4.9.1 | 171 KB    | #################################### | 100% 
tk-8.6.10            | 3.0 MB    | #################################### | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
conda list
# packages in environment at /home/username/.conda/envs/Project-B:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
beautifulsoup4            4.9.1                    py38_0  
ca-certificates           2020.6.24                     0  
certifi                   2020.6.20                py38_0  
ld_impl_linux-64          2.33.1               h53a641e_7  
libedit                   3.1.20191231         h7b6447c_0  
libffi                    3.3                  he6710b0_2  
libgcc-ng                 9.1.0                hdf63c60_0  
libstdcxx-ng              9.1.0                hdf63c60_0  
ncurses                   6.2                  he6710b0_1  
openssl                   1.1.1g               h7b6447c_0  
pip                       20.1.1                   py38_1  
python                    3.8.3                hcff3b4d_2  
readline                  8.0                  h7b6447c_0  
setuptools                47.3.1                   py38_0  
soupsieve                 2.0.1                      py_0  
sqlite                    3.32.3               h62c20be_0  
tk                        8.6.10               hbc83047_0  
wheel                     0.34.2                   py38_0  
xz                        5.2.5                h7b6447c_0  
zlib                      1.2.11               h7b6447c_3  
conda uninstall -y beautifulsoup4
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/username/.conda/envs/Project-B

  removed specs:
    - beautifulsoup4


The following packages will be REMOVED:

  beautifulsoup4-4.9.1-py38_0
  soupsieve-2.0.1-py_0


Preparing transaction: done
Verifying transaction: done
Executing transaction: done
(Project-B) [username@res-hpc-lo01 ~]$ conda deactivate
[username@res-hpc-lo01 ~]$ 
conda remove --name Project-B --all -y

Remove all packages in environment /home/username/.conda/envs/Project-B:


## Package Plan ##

  environment location: /home/username/.conda/envs/Project-B


The following packages will be REMOVED:

  _libgcc_mutex-0.1-main
  ca-certificates-2020.6.24-0
  certifi-2020.6.20-py38_0
  ld_impl_linux-64-2.33.1-h53a641e_7
  libedit-3.1.20191231-h7b6447c_0
  libffi-3.3-he6710b0_2
  libgcc-ng-9.1.0-hdf63c60_0
  libstdcxx-ng-9.1.0-hdf63c60_0
  ncurses-6.2-he6710b0_1
  openssl-1.1.1g-h7b6447c_0
  pip-20.1.1-py38_1
  python-3.8.3-hcff3b4d_2
  readline-8.0-h7b6447c_0
  setuptools-47.3.1-py38_0
  sqlite-3.32.3-h62c20be_0
  tk-8.6.10-hbc83047_0
  wheel-0.34.2-py38_0
  xz-5.2.5-h7b6447c_0
  zlib-1.2.11-h7b6447c_3


Preparing transaction: done
Verifying transaction: done
Executing transaction: done

Open OnDemand [OOD]

Open OnDemand [OOD] provides an integrated, single access point for all of your HPC resources.

OOD

This will give you a web interface with the following options:

  • Files
    • Home directory/File Explorer
  • Jobs
    • Active/Completed Jobs
    • Job Composer
  • Clusters
    • Cluster Shell Access
  • Interactive Apps
    • Shark cluster Desktop
    • RStudio Server
    • Jupyter Notebook (with GPU support)
  • Fsleyes-01.gif
  • Fsleyes-02.gif
  • Mate-01.gif
  • Mate-02.gif
  • Mate-03.gif
  • MobaXterm-01.gif
  • MobaXterm-02.gif
  • MobaXterm-03.gif
  • MobaXterm-04.gif
  • MobaXterm-05.gif
  • MobaXterm-06.gif
  • MobaXterm-07.gif
  • MobaXterm-08.gif
  • MobaXterm-09.gif
  • Putty-01.gif
  • Putty-02.gif
  • Putty-03.gif
  • Putty-04.gif
  • Putty-05.gif
  • Putty-06.gif
  • Putty-07.gif
  • Putty-08.gif
  • Putty-09.gif
  • Sview-01.gif
  • Sview-02.gif
  • Sview-03.gif
  • Sview-04.gif
  • Vncviewer-01.gif
  • Vncviewer-02.gif
  • X2goclient-01.gif
  • X2goclient-02.gif
  • X2goclient-03.gif
  • X2goclient-04.gif
  • X2goclient-05.gif
  • X2goclient-06.gif
  • X2goclient-07.gif
  • X2goclient-08.gif
  • X2goclient-09.gif
  • X2goclient-10.gif
  • X2goclient-11.gif
  • X2goclient-12.gif
  • X2goclient-13.gif
  • Xclock.gif
  • Xeyes.gif
  • Xterm.gif