Overview
ICARE provides computational resources to registered and accredited ICARE users. Registered users can run their own codes in a linux environment that is very similar to the ICARE production environment, with on-line access to the entire ICARE archive. This PaaS (Platform as a Service) service is specially useful for users running codes on long time-series data sets who can’t afford to download huge amounts of data to their own facility. This service is also useful to mature and test codes that are intended to run in operational mode the ICARE production environment. This service is suitable for both interactive use and massive batch processing exported to the back-end computing nodes of the cluster.
Registration
A specific registration is required to access ICARE computing resources. Because ICARE resources are limited, access is restricted to partners working with ICARE on collaborative projects. You register to ICARE data services first (see here), then fill out this additional registration form to request an SSH account. You will be required to provide additional information including the framework of your request and an ICARE project referent.
If you only want to access ICARE data services (i.e. SFTP or web access), please use the data access registration form.
Description of the cluster
The ICARE computing cluster is composed of one front-end server and 192 allocated cores spread
over 4 back-end computing nodes (see table):
- 1 front-end server (access.icare.univ-lille.fr)
- 4 computing nodes
Servers | Number of cores allocated to cluster | Hyperthreading | Processor | RAM |
Front-end access.icare.univ-lille.fr | 40 | Yes | Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz | 384 Go |
Node 006-009 | 24 physical cores 48 logical cores | Yes | Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz | 384 Go |
The front-end server is the primary access to the cluster. No intensive processing is to be run on the front-end server. It is dedicated to interactive use only. All intensive processing jobs must be run on the computing nodes and must be submitted through the job scheduler SLURM (seenbelow).
Disk Space
- Home Directory (51 TB total)
This space should be used for storing files you want to keep in the long term such as source codes, scripts, etc. The home directory is backed up nightly.
Note: home directories are shared by all nodes of the cluster, so be aware that any modification in
your home directory on access32 also modifies your home directory on access64.
- Main Storage Space /work_users (75 TB total)
This is the main storage space for large amounts of data. This work space is backed up nightly.
- Scratch Space /scratch (50 TB total)
The scratch filesystem is intended for temporary storage and should be considered volatile. Older
files are subject to being automatically purged. No backup of any kind is performed for this work
space.
Logging in
To use the computer cluster, you have to log in to the front-end server access.icare.univ-lille.fr using
your ICARE username and password:
ssh -X username@access.icare.univ-lille.fr
Cluster Software and Environment Modules
We are using the Environment Modules Package to provide a dynamic modification of a user’s environment.
The Environment Modules package is a tool that simplifies shell initialization and lets users easily
modify their environment during the session with modulefiles. Each modulefile contains the
information needed to configure the shell for an application.
The main module commands are:
module avail # to list all available modules you can load
module list # to list your currently loaded modules
module load moduleName # to load moduleName into your environment
module unload moduleName # to unload moduleName from your environment
When you login into ICARE cluster some modules are automatically loaded for your convenience.
Initially, your module environment is not empty !
- Display default environment variables
To see the default environment that you get at login issue the “module list” command.
[ops@access ~]$ module list
Currently Loaded Modulefiles:
1) scilab/6.1.0 6) cmake/3.15.3 11) proj/6.2.1 16) netcdf-c/4.6.3 21) swig/4.0.2
2) jdk/13.0.1 7) git/2.21.0 12) HDFView/3.1.0 17) netcdf-fortran/4.4.5 22) Python/2.7.16
3) gcc/9.2.0 8) cvs2svn/2.5.0 13) hdf4/4.2.14 18) netcdf-cxx4/4.3.0 23) coda/2.21
4) openssl/1.1.1d 9) eccodes/2.20.0 14) HDF-EOS2/20v1.00 19) nccmp/1.9.1.0 24) icare-env/3.0.0/python2
5) curl/7.65.3 10) geos/3.7.3 15) hdf5/1.10.5 20) oracle_instantclient/18.5.0.0 25) perl/5.32.0
- Display all available software installed on the cluster
[ops@access ~]$ module avail
--------------------------- modulefiles ----------------------------
rhel7/Anaconda/3/2020.11 rhel7/hdf4/4.2.14-without-netcdf rhel7/netcdf-c/4.6.3
rhel7/cmake/3.15.3 rhel7/hdf5/1.10.5 rhel7/netcdf-c/4.9.0
rhel7/coda/2.21 rhel7/HDF-EOS2/20v1.00 rhel7/netcdf-cxx4/4.3.0
rhel7/coda/2.24 rhel7/HDFView/3.1.0 rhel7/netcdf-fortran/4.4.5
rhel7/coda/2.24.1 rhel7/icare-env/3.0.0/python2 rhel7/openssl/1.1.1d
rhel7/conda_envs/dataviz rhel7/icare-env/3.0.0/python3 rhel7/oracle_instantclient/18.5.0.0
rhel7/conda_envs/dataviz_v3 rhel7/icare-env/3.1.0/python3 rhel7/perl/5.32.0
rhel7/conda_envs/dataviz_v4 rhel7/idl/8.2 rhel7/proj/6.2.1
rhel7/curl/7.65.3 rhel7/idl/8.7 rhel7/proj/9.0.1
rhel7/cvs2svn/2.5.0 rhel7/idl/8.8 rhel7/Python/2.7.16
rhel7/eccodes/2.13.1 rhel7/intel/2021.1.1 rhel7/Python/3.10.5
rhel7/eccodes/2.20.0 rhel7/jdk/13.0.1 rhel7/Python/3.8.0
rhel7/gcc/9.2.0 rhel7/matlab/R2012a rhel7/scilab/6.0.2
rhel7/gdal/3.5.1 rhel7/matlab/R2018b rhel7/scilab/6.1.0
rhel7/gdal/3.6.0 rhel7/matlab/R2020a rhel7/sqlite/3.31.0
rhel7/geos/3.7.3 rhel7/matlab_runtime/R2012a rhel7/swig/4.0.2
rhel7/git/2.21.0 rhel7/matlab_runtime/R2018b
rhel7/hdf4/4.2.14 rhel7/nccmp/1.9.1.0
- Show what a module sets for your shell environment
module show rhel7/Python/3.10.5
-------------------------------------------------------------------
/usr/local/modulefiles/rhel7/Python/3.10.5:
prepend-path PATH /usr/local//modules/rhel7/Python/3.10.5/bin
prepend-path LD_LIBRARY_PATH /usr/local//modules/rhel7/Python/3.10.5/lib
prepend-path PKG_CONFIG_PATH /usr/local//modules/rhel7/Python/3.10.5/lib/pkgconfig
prepend-path PYTHONPATH /usr/local//modules/rhel7/Python/3.10.5/lib/python3.10/site-packages/osgeo
prepend-path CARTOPY_DATA_DIR /usr/local//modules/rhel7/Python/3.10.5/lib/python3.10/site-packages/cartopy/
data
-------------------------------------------------------------------
- Get help information about a module
module help rhel7/Python/3.10.5
----------- Module Specific Help for 'rhel7/Python/3.10.5' --------
This modulefile defines the pathes and variables for the package
Python-3.10.5
.............................................
- Loading/ unloading modules
Modules can be loaded and unloaded dynamically.
[ops@access ~]$ module load rhel7/matlab/R2018b
[ops@access ~]$ which matlab
/usr/local/modules/rhel7/matlab/R2018b/bin/matlab
[ops@access ~]module unload rhel7/matlab/R2018b
- Unload ALL software modules
The module purge command will remove all currently loaded modules. This is particularly useful if
you have to run incompatible software (e.g. python 2.x or python 3.x). The module unload
command will remove a specific module.
[ops@access ~]module purge
Running your jobs
No intensive processing is to be run on the front-end node. Processing jobs must be submitted
through SLURM‘s job scheduler to run on the computing nodes. SLURM (Simple Linux Utility
for Resource Management) is a workload manager and a job scheduling system for LINUX clusters.
In the current configuration, all the computing nodes belong to one single partition named
“COMPUTE” (i.e. all jobs end up in the same queue). The maximum RAM allowed is 4GB per job
and the maximum execution time is 24 hours by default (i.e. jobs are automatically killed if this
limit is reached). See options to modify that limit (–time option)
The job priority is automatically adjusted based on the required resources specified by the user
when scheduling the job. The lower the resources the higher priority.
SLURM commands
Jobs can be submitted to the scheduler using sbatch or srun
• sbatch: to submit a job to the queue
The job is submitted via the sbatch command. SLURM then assigns a number to the job and places
it in the queue. It will execute when the resources are available.
ops@access:~ $ sbatch submit.sh
Submitted batch job 17
Example (submit.sh) for bash users
#!/bin/bash
#===============================================================================
# Options SBATCH :
#SBATCH --job-name=TestJob # Defines a name for the batch job
#SBATCH --time=10:00 # Time limit for the job.(format = m:s ou h:m:s
ou j-h:m:s)
#SBATCH -o OUTPUT_FILE # Specifies the file containing the stdout
#SBATCH -e ERROR_FILE # Specifies the file containing the stderr
#SBATCH --mem=2000 # Memory limit per compute node for the job
#SBATCH --partition=COMPUTE # Partition is a queue for jobs. (Default is
COMPUTE)
#SBATCH --mail-type=ALL # When email is sent to user (all
notifications)
#SBATCH --mail-user=user@univ-lille.fr # User's email address
### Setting the TMPDIR environment variable, specify a directory that is
accessible to the user ID
export TMPDIR=/scratch/$USER/temp
mkdir -p $TMPDIR
###Purge any previous modules
module purge
###Load the application
module load rhel7/Python/3.8.0 #load module anaconda/Python 3.8.0
### Run program
./executable_name
#===============================================================================
Example (submit.sh)) for tcsh users
#!/bin/tcsh
#===============================================================================
# Options SBATCH :
#SBATCH --job-name=TestJob # Defines a name for the batch job
#SBATCH --time=10:00 # Time limit for the job.(format = m:s ou h:m:s
ou j-h:m:s)
#SBATCH -o OUTPUT_FILE # Specifies the file containing the stdout
#SBATCH -e ERROR_FILE # Specifies the file containing the stderr
#SBATCH --mem=2000 # Memory limit per compute node for the job
#SBATCH --partition=COMPUTE # Partition is a queue for jobs. (Default is
COMPUTE)
#SBATCH --mail-type=ALL # When email is sent to user (all
notifications)
#SBATCH --mail-user=user@univ-lille.fr # User's email address
### Setting the TMPDIR environment variable, specify a directory that is
accessible to the user ID
setenv TMPDIR /scratch/$USER/temp
mkdir -p $TMPDIR
###Purge any previous modules
module purge
###Load the application
module load rhel7/Python/3.10.5 #load module anaconda/Python 3.10.5
### Run program
./executable_name
#===============================================================================
• srun: to submit a job for interactive execution (as you would execute any command line),
i.e. you lose the prompt until the execution is complete.
Example of a run in the partition COMPUTE for 30 minutes :
ops@access:~ $ srun --partition=COMPUTE –time=30.0 job.sh
• squeue: to view information about jobs
Usage:
ops@access:~ $ squeue
squeue –u <myusername>
• scancel: to remove a job from the queue, or cancel it if it is running
ops@access:~ $ scancel <jobid>
ops@access:~ $ scancel cancel -u <myusername> --state=pending (cancels all
pending jobs by <myusername>)
ops@access:~ $ scancel cancel -u <myusername> --state=running (cancels all
running jobs by <myusername>)
• sinfo: provides information about nodes and partitions
sinfo -N -l
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE
REASON
node006 1 COMPUTE* idle 48 2:12:2 385563 0 1000
(null) none
node007 1 COMPUTE* idle 48 2:12:2 385563 0 1000
(null) none
node008 1 COMPUTE* idle 48 2:12:2 385563 0 1000
(null) none
node009 1 COMPUTE* idle 48 2:12:2 385563 0 1000
(null) none
• scontrol: to see the configuration and state of a job
ops@access:~ $ scontrol show job <jobid>
• sview: is a graphical user interface version
The following table translates some of the more commonly used options for qsub to their sbatch
equivalents:
qsub to sbatch translation | |||
---|---|---|---|
To specify the: | qsub option | sbatch option | Comments |
Queue/partition | -q QUEUENAME | -p QUEUENAME | Torque “queues” are called “partitions” in slurm. Note: the partition/queue structure has been simplified, see below. |
Number of nodes/ cores requested | -l nodes=NUMBERCORES | -n NUMBERCORES | See below |
-l nodes=NUMBERNODE S:CORESPERNODE | -N NUMBERNODES – n NUMBERCORES | ||
Wallclock limit | -l walltime=TIMELIMIT | -t TIMELIMIT | TIMELIMIT should have form of HOURS:MINUTES:SECONDS. Slurm supports some other time formats as well. |
Memory requirements | -l mem=MEMORYmb | –mem=MEMORY | Torque/Maui: This is Total memory used by job Slurm: This is memory per node |
-l pmem=MEMORYmb | –mem-percpu=MEMORY | This is per CPU/core. MEMORY in MB | |
Stdout file | -o FILENAME | -o FILENAME | This will combine stdout/stderr on slurm if -e not given also |
Stderr file | -e FILENAME | -e FILENAME | This will combine stderr/stdout on slurm if -o not given also |
Combining stdout/stderr | -j oe | -o OUTFILE and no -eoption | stdout and stderr merged to stdout/OUTFILE |
-j eo | -e ERRFILE and no -ooption | stdout and stderr merged to stderr/ERRFILE | |
Email address | -M EMAILADDR | –mailuser=EMAILADDR | |
Email options | -mb | –mail-type=BEGIN | Send email when job starts |
-me | –mail-type=END | Send email when job ends | |
-mbe | –mail-type=BEGIN –mail-type=END | Send email when job starts and ends | |
Job name | -N NAME | –job-name=NAME | |
Working directory | -d DIR | –workdir=DIR |
See also
A documentation of SLURM and SLURM commands is available online:
http://slurm.schedmd.com
http://slurm.schedmd.com/man_index.html