Job Scheduling & Resource Allocation¶

An HPC cluster needs a way for users to access its computational capacity in a fair and efficient manner. It does this using a scheduler. The scheduler takes user requests in the form of jobs and allocates resources to these jobs based on availability and cluster policy.

The Easley cluster uses the Slurm scheduler. Slurm is a proven job scheduler used by many of the top universities and research institutions in the world. It is open source, fault-tolerant, and highly scalable. Slurm is not the same scheduler used on the Hopper cluster. The Hopper cluster uses Moab/Torque. Both schedulers basically do the same thing, but implement it differently using their own distinct commands and terminology.

Job Submission¶

Job submission is the process of requesting resources from the scheduler. It is the gateway to all the computational horsepower in the cluster. Users submit jobs to tell the scheduler what resources are needed and for how long. The scheduler then evaluates the request according to resource availability and cluster policy to determine when the job will run and which resources to use.

How to Submit a Job¶

Job submission uses the Slurm ‘sbatch’ command. This command includes numerous directives which are used to specify resource requirements and other job attributes. Slurm directives can be in a job script as header lines (#SBATCH), as command-line options to the sbatch command or a combination of both. If both, the command-line option takes precedence.

The general form of the sbatch command:

sbatch [OPTIONS(0)...] [ : [OPTIONS(N)...]] script(0) [args(0)...]

Ex.

sbatch -N1 -t 4:00:00 myScript.sh
cat myScript.sh

#!/bin/bash
#SBATCH --job-name=myJob             # job name
#SBATCH --ntasks=10                  # number of tasks across all nodes
#SBATCH --partition=general          # name of partition to submit job
#SBATCH --time=01:00:00              # Run time (D-HH:MM:SS)
#SBATCH --output=job-%j.out          # Output file. %j is replaced with job ID
#SBATCH --error=job-%j.err           # Error file. %j is replaced with job ID
#SBATCH --mail-type=ALL              # will send email for begin,end,fail
#SBATCH --mail-user=user@auburn.edu
...

This job submission requests one node (N1) and a walltime of 4 hr (4:00:00) as command-line options. Other options are specified in the job script as sbatch directives.

Common Slurm Job Submission Options¶

Description	Long Option	Short Option	Default	Moab/Torque
Job Name	–job-name	-J	name of job script	-N
Time limit for job in D-HH:MM:SS	–time	-t	2-00:00:00	-l walltime
Number of Nodes requested	–nodes	-N	1	-l nodes
Number of processors	–ntasks	-n	1
Partition	–partition	-p	general	-q
Job Array	–array	-a		-t
Output File	–output	-o		-o
Error File	–error	-e		-e
Memory	–mem=[M,G]			-l mem=[MB,GB]

Default Memory per Partition¶

If you do not specify the amount of memory for a job, the job will receive the default memory provided by the scheduler. The default memory for each partition is listed below

Partition	Default Memory	Total
General	3 GB	192 GB
Bigmem2	7 GB	384 GB
Bigmem4	15 GB	768 GB
Amd	1 GB	256 GB
Gpu2	7 GB	384 GB
Gpu4	15GB	768 GB

Job-Related Commands¶

User Command	Slurm	Description	Moab/Torque
Job Submission	sbatch [jobscript]	Submit a job.	qsub
Job Listing	squeue	Job info for all jobs.	showq
Job Status(job)	squeue -j [jobid]	Job info for a specific job.	qstat[jobid]
Job Status(user)	squeue -u [userid]	Job info for a specific user.	qstat -u
Partitions	sinfo	Partition info.
Delete Job	scancel [jobid]	Cancel a job.	qdel
Hold Job	scontrol hold [jobid]	Prevent a job from starting.	qhold[jobid]
Release Job	scontrol release [jobid]	Release a held job.	qrls [jobid]

Using ‘squeue’ will give you the following abbreviation:

JOBID  PARTITION     NAME     USER     ST     TIME  NODES    NODELIST(REASON)
 135    general     job.sh    user     R       0:01   1       node010

Running ‘squeue -l’ will give you a longer format of each job state code:

JOBID  PARTITION     NAME     USER    STATE       TIME   TIME_LIMI   NODES   NODELIST(REASON)
 135    general      job.sh   user    RUNNING     1:26     3:00       1         node010

Slurm Job States¶

Jobs will pass through several states during the course of their submission and execution. The following job state codes listed below are the most common codes along with their abbreviation and description.

Job State	Code	Description
Pending	PD	Job is awaiting resource allocation
Running	R	Job currently has an allocation and is running
Completing	CG	Job is in the process of completing. Some processes may still be acitve
Cancelled	CA	Job was cancelled by user or admin
Failed	F	Job terminated with failure
Stopped	ST	Job has an allocation but execution has been stopped
Configuring	CF	Job has been allocated resources but are waiting for them to become available

Partitions¶

In Slurm, the concept of partitions is important in job submission. A partition is used to logically group different types of capacity and provide them with special functionality. In Easley, there are high-level partitions based on the node type: general, bigmem2, bigmem4, amd, gpu2 and gpu4. The general partition consists of 126 standard nodes, the bigmem2 partition consists of 21 bigmem2 nodes, and so on as defined in the Locations and Resources section. All users can use these high-level partitions on a first-come,first-served basis. However, there is no priority access to these partitions.

There are also partitions based on a PI’s purchased capacity. Only the PI and their sponsored accounts can use these partitions. Not only do they have exclusive access to them, but they also have priority access. Jobs submitted with a PI partition will preempt, if needed, any job running on the same capacity not using that PI partition. Note that the capacity in the PI partitions overlaps the capacity in the high-level partitions in a one-to-one fashion.

To illustrate, let’s say that nodeX is in the general partition. For sake of example, let’s say this same nodeX is also in the PI partition ‘mylab_std’. So both partitions contain, or overlap, nodeX. A user who does not have access to the ‘mylab_std’ partition submits job A using the general partition and it runs on nodeX. Later a user who does have access to the ‘mylab_std’ partition submits job B using the ‘mylab_std’ partition. Since job B uses the ‘mylab_std’ partition that has priority access, it preempts job A and runs on nodeX. Job A is requeued and waits for available resources in order to run.

Partition Types¶

Type	Priority	Availability	Preemption	Example
Dedicated	1	Lab group	Cannot be preempted.	hpcadmin_std
Department	2	Department members	Can be preempted.	chen_bg2
Investor	3	Investors in special capacity	Can be preempted.	investor_amd
Community	4	All Easley users	Can be preempted.	general

Partition Commands¶

To view all available partitions on the cluster, use the sinfo command. This command can also be used to find out information such as the number of available nodes,cpus per node, and walltime.

User Command		Slurm Command
Show partition information		sinfo
Show nodes(idle)		sinfo -t idle
Show nodes(allocated)		sinfo -t alloc
Show nodes(by partition)		sinfo -p partition name
Show max cpus per node		sinfo -o%c -p partition name

Monitor Jobs¶

To display information about active, eligible and blocked jobs, use the squeue command:

squeue

Option	Description
squeue -l	Displays all jobs
squeue -r	Displays running jobs
sinfo -t alloc	Displays nodes allocated to jobs
squeue –start -j <jobid>	Displays the estimated time a job will begin

To display detailed job state information and diagnostic output for a specified job, use the scontrol show job <job id> command:

scontrol show job <job id>

To cancel a job:

scancel <job id>

To prevent a pending job from starting:

scontrol hold <job id>

To release a previously held job:

scontrol release <job id>

Monitor Resources¶

All jobs require resources to run. This includes memory and cores on compute nodes as well as resources like file system space for output files. These commands help determine what resources are available for your jobs.

To check the status of a your dedicated capacity.

my_capacity

To display idle capacity by partition.

sinfo -t idle

To display pending jobs on a specific partition.

squeue -t PD -p <partition>

To check your disk space usage.

checkquota

To see if you have files that are scheduled to expire soon

expiredfiles

Testing¶

Interactive Job Submission¶

Interactive jobs may assist with troubleshooting and testing performance. Typing in the following will log you into a shell on a compute node:

srun --pty /bin/bash

You can also specify the resources needed

srun -N1 -n1 --time=01:00 --pty bash

Here we are requesting one core on a single node to run our job interactively. Next you will need to check and make sure the necessary modules needed for the job are loaded

module list

Load any additional modules needed before running the program

module load samtools

You can exit the interactive session by typing in the following

exit

Job Sub Examples¶

Command-Line Examples¶

Example 1:

This job submission requests 40 processors on two nodes for the job ‘test.sh’ and 20 hr of walltime. It will also email ‘nouser’ when the job begins and ends or if the job is aborted. Since no partition is specified, the general partition is used as it is the default.

sbatch -N2 -n40 -t20:00:00  --mail-type=begin,end,fail  --mail-user=nosuer@auburn.edu test.sh

Example 2:

This job requests a node with 200MB of available memory in the general partition. Since no walltime is indicated, the job will get the default walltime.

sbatch -pgeneral --mem=200M  <job script>

SBATCH Examples¶

Serial Job Submission

For jobs that require only one CPU-core…

#!/bin/bash
#SBATCH --job-name=testJob            # job name
#SBATCH --nodes=1                     # node(s) required for job
#SBATCH --ntasks=1                    # number of tasks across all nodes
#SBATCH --partition=general           # name of partition
#SBATCH --time=01:00:00               # Run time (D-HH:MM:SS)
#SBATCH --output=test-%j.out          # Output file. %j is replaced with job ID
#SBATCH --error=test_error-%j.err     # Error file. %j is replaced with job ID
#SBATCH --mail-type=ALL               # will send email for begin,end,fail
#SBATCH --mail-user=user@auburn.edu

Multithread Job Submission

For jobs that require the use of multiple cores

Ex.

#!/bin/bash
#SBATCH --job-name=testJob           # job name
#SBATCH --nodes=1                    # node(s) required for job
#SBATCH --ntasks=10                  # number of tasks across all nodes
#SBATCH --partition=general          # name of partition to submit job
#SBATCH --time=01:00:00              # Run time (D-HH:MM:SS)
#SBATCH --output=test-%j.out         # Output file. %j is replaced with job ID
#SBATCH --error=test_error-%j.err    # Error file. %j is replaced with job ID
#SBATCH --mail-type=ALL              # will send email for begin,end,fail
#SBATCH --mail-user=user@auburn.edu

In this case, 10 cores will be allocated on one node. Note: if you do not specify the node count = 1, the cores may be allocated accross multiple nodes. Especially if they exceed the amount of cores available on one node.

Multinode Job Submission

For jobs that require the use of multiple nodes and multiple cores.

#!/bin/bash
#SBATCH --job-name=testJob          # job name
#SBATCH --nodes=2                   # node(s) required for job
#SBATCH --ntasks-per-node=10        # number of tasks per node
#SBATCH --partition=general         # name of partition to submit job
#SBATCH --output=test-%j.out        # Output file. %j is replaced with job ID
#SBATCH --error=test_error-%j.err   # Error file. %j is replaced with job ID
#SBATCH --time=01:00:00             # Run time (D-HH:MM:SS)
#SBATCH --mail-type=ALL             # will send email for begin,end,fail
#SBATCH --mail-user=user@auburn.edu

In this case, 20 cores will be allocated, 10 task per node.

GPU Job Submission

#!/bin/bash
#SBATCH --job-name=testJob          # job name
#SBATCH --nodes=1                   # node(s) required for job
#SBATCH --ntasks=1                  # number of tasks across all nodes
#SBATCH --partition=gpu2            # name of partition to submit job(gpu2 or gpu4)
#SBATCH --gres=gpu:tesla:1          # specifies the number of gpu devices needed
#SBATCH --output=test-%j.out        # Output file. %j is replaced with job ID
#SBATCH --error=test_error-%j.err   # Error file. %j is replaced with job ID
#SBATCH --time=01:00:00             # Run time (D-HH:MM:SS)
#SBATCH --mail-type=ALL             # will send email for begin,end,fail
#SBATCH --mail-user=user@auburn.edu

Note: For more information relating to gpu, consult our GPU Quick Start Section of the documentation

Job Arrays¶

Job arrays are useful for submitting and managing a large number of similar jobs. As an example, job arrays are convenient if a user wishes to run the same analysis on 100 different files. Slurm provides job array environment variables that allow multiple versions of input files to be easily referenced.

A job array can be submitted by adding the following to an sbatch submission

sbatch --array=0-4 job_script.sh

Where 0-4 specifies the array length. You can also create the array length within your script

#!/bin/bash
#SBATCH --job-name=Array
#SBATCH --output=array-%A.txt
#SBATCH --error=array-%A.txt
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --array=0-4

Then submit the following

sbatch job_script.sh

Naming output and error files¶

In order to produce output and error files for each array task, you will need to specify both the job ID and task ID. Slurm uses %A for the master job ID and %a for the task ID.

#SBATCH --output=Array-%A_%a.out
#SBATCH --error=Array-%A_%a.error

The result will be the following

Array-JOBID_1.txt
Array-JOBID_2.txt
Array-JOBID_3.txt
Array-JOBID_4.txt

Note: If you only use %A, all array tasks will write to a single file.

Deleting job arrays and tasks¶

To delete all array tasks, use scancel with the job ID:

scancel JOBID

To delete a single array task, specify the task ID:

scancel JOBID_1