Job Scheduling & Resource Allocation

An HPC cluster needs a way for users to access its computational capacity in a fair and efficient manner. It does this using a scheduler. The scheduler takes user requests in the form of jobs and allocates resources to these jobs based on availability and cluster policy.

The Easley cluster uses the Slurm scheduler. Slurm is a proven job scheduler used by many of the top universities and research institutions in the world. It is open source, fault-tolerant, and highly scalable. Slurm is not the same scheduler used on the Hopper cluster. The Hopper cluster uses Moab/Torque. Both schedulers basically do the same thing, but implement it differently using their own distinct commands and terminology.

Job Submission

Job submission is the process of requesting resources from the scheduler. It is the gateway to all the computational horsepower in the cluster. Users submit jobs to tell the scheduler what resources are needed and for how long. The scheduler then evaluates the request according to resource availability and cluster policy to determine when the job will run and which resources to use.

How to Submit a Job

Job submission uses the Slurm ‘sbatch’ command. This command includes numerous directives which are used to specify resource requirements and other job attributes. Slurm directives can be in a job script as header lines (#SBATCH), as command-line options to the sbatch command or a combination of both. If both, the command-line option takes precedence.

The general form of the sbatch command:

sbatch [OPTIONS(0)...] [ : [OPTIONS(N)...]] script(0) [args(0)...]

Ex.

sbatch -N1 -t 4:00:00 myScript.sh
cat myScript.sh
#!/bin/bash
#SBATCH --job-name=myJob             # job name
#SBATCH --ntasks=10                  # number of tasks across all nodes
#SBATCH --partition=general          # name of partition to submit job
#SBATCH --time=01:00:00              # Run time (D-HH:MM:SS)
#SBATCH --output=job-%j.out          # Output file. %j is replaced with job ID
#SBATCH --error=job-%j.err           # Error file. %j is replaced with job ID
#SBATCH --mail-type=ALL              # will send email for begin,end,fail
#SBATCH --mail-user=user@auburn.edu
...

This job submission requests one node (N1) and a walltime of 4 hr (4:00:00) as command-line options. Other options are specified in the job script as sbatch directives.

Common Slurm Job Submission Options

Description

Long Option

Short Option

Default

Moab/Torque

Job Name

–job-name

-J

name of job script

-N

Time limit for job in D-HH:MM:SS

–time

-t

2-00:00:00

-l walltime

Number of Nodes requested

–nodes

-N

1

-l nodes

Number of processors

–ntasks

-n

1

Partition

–partition

-p

general

-q

Job Array

–array

-a

-t

Output File

–output

-o

-o

Error File

–error

-e

-e

Memory

–mem=[M,G]

-l mem=[MB,GB]

Default Memory per Partition

If you do not specify the amount of memory for a job, the job will receive the default memory provided by the scheduler. The default memory for each partition is listed below

Partition

Default Memory

Total

General

3 GB

192 GB

Bigmem2

7 GB

384 GB

Bigmem4

15 GB

768 GB

Amd

1 GB

256 GB

Gpu2

7 GB

384 GB

Gpu4

15GB

768 GB

Slurm Job States

Jobs will pass through several states during the course of their submission and execution. The following job state codes listed below are the most common codes along with their abbreviation and description.

Job State

Code

Description

Pending

PD

Job is awaiting resource allocation

Running

R

Job currently has an allocation and is running

Completing

CG

Job is in the process of completing. Some processes may still be acitve

Cancelled

CA

Job was cancelled by user or admin

Failed

F

Job terminated with failure

Stopped

ST

Job has an allocation but execution has been stopped

Configuring

CF

Job has been allocated resources but are waiting for them to become available

Partitions

In Slurm, the concept of partitions is important in job submission. A partition is used to logically group different types of capacity and provide them with special functionality. In Easley, there are high-level partitions based on the node type: general, bigmem2, bigmem4, amd, gpu2 and gpu4. The general partition consists of 126 standard nodes, the bigmem2 partition consists of 21 bigmem2 nodes, and so on as defined in the Locations and Resources section. All users can use these high-level partitions on a first-come,first-served basis. However, there is no priority access to these partitions.

There are also partitions based on a PI’s purchased capacity. Only the PI and their sponsored accounts can use these partitions. Not only do they have exclusive access to them, but they also have priority access. Jobs submitted with a PI partition will preempt, if needed, any job running on the same capacity not using that PI partition. Note that the capacity in the PI partitions overlaps the capacity in the high-level partitions in a one-to-one fashion.

To illustrate, let’s say that nodeX is in the general partition. For sake of example, let’s say this same nodeX is also in the PI partition ‘mylab_std’. So both partitions contain, or overlap, nodeX. A user who does not have access to the ‘mylab_std’ partition submits job A using the general partition and it runs on nodeX. Later a user who does have access to the ‘mylab_std’ partition submits job B using the ‘mylab_std’ partition. Since job B uses the ‘mylab_std’ partition that has priority access, it preempts job A and runs on nodeX. Job A is requeued and waits for available resources in order to run.

Partition Types

Type

Priority

Availability

Preemption

Example

Dedicated

1

Lab group

Cannot be preempted.

hpcadmin_std

Department

2

Department members

Can be preempted.

chen_bg2

Investor

3

Investors in special capacity

Can be preempted.

investor_amd

Community

4

All Easley users

Can be preempted.

general

Partition Commands

To view all available partitions on the cluster, use the sinfo command. This command can also be used to find out information such as the number of available nodes,cpus per node, and walltime.

User Command

Slurm Command

Show partition information

sinfo

Show nodes(idle)

sinfo -t idle

Show nodes(allocated)

sinfo -t alloc

Show nodes(by partition)

sinfo -p partition name

Show max cpus per node

sinfo -o%c -p partition name

Monitor Jobs

To display information about active, eligible and blocked jobs, use the squeue command:

squeue

Option

Description

squeue -l

Displays all jobs

squeue -r

Displays running jobs

sinfo -t alloc

Displays nodes allocated to jobs

squeue –start -j <jobid>

Displays the estimated time a job will begin

To display detailed job state information and diagnostic output for a specified job, use the scontrol show job <job id> command:

To cancel a job:

scancel <job id>

To prevent a pending job from starting:

scontrol hold <job id>

To release a previously held job:

scontrol release <job id>

Monitor Resources

All jobs require resources to run. This includes memory and cores on compute nodes as well as resources like file system space for output files. These commands help determine what resources are available for your jobs.

To check the status of a your dedicated capacity.

my_capacity

To display idle capacity by partition.

sinfo -t idle

To display pending jobs on a specific partition.

squeue -t PD -p <partition>

To check your disk space usage.

checkquota

To see if you have files that are scheduled to expire soon

expiredfiles

Testing

Interactive Job Submission

Interactive jobs may assist with troubleshooting and testing performance. Typing in the following will log you into a shell on a compute node:

srun --pty /bin/bash

You can also specify the resources needed

srun -N1 -n1 --time=01:00 --pty bash

Here we are requesting one core on a single node to run our job interactively. Next you will need to check and make sure the necessary modules needed for the job are loaded

module list

Load any additional modules needed before running the program

module load samtools

You can exit the interactive session by typing in the following

exit

Job Sub Examples

Command-Line Examples

Example 1:

This job submission requests 40 processors on two nodes for the job ‘test.sh’ and 20 hr of walltime. It will also email ‘nouser’ when the job begins and ends or if the job is aborted. Since no partition is specified, the general partition is used as it is the default.

sbatch -N2 -n40 -t20:00:00  --mail-type=begin,end,fail  --mail-user=nosuer@auburn.edu test.sh

Example 2:

This job requests a node with 200MB of available memory in the general partition. Since no walltime is indicated, the job will get the default walltime.

sbatch -pgeneral --mem=200M  <job script>

SBATCH Examples

Serial Job Submission

For jobs that require only one CPU-core…

#!/bin/bash
#SBATCH --job-name=testJob            # job name
#SBATCH --nodes=1                     # node(s) required for job
#SBATCH --ntasks=1                    # number of tasks across all nodes
#SBATCH --partition=general           # name of partition
#SBATCH --time=01:00:00               # Run time (D-HH:MM:SS)
#SBATCH --output=test-%j.out          # Output file. %j is replaced with job ID
#SBATCH --error=test_error-%j.err     # Error file. %j is replaced with job ID
#SBATCH --mail-type=ALL               # will send email for begin,end,fail
#SBATCH --mail-user=user@auburn.edu

Multithread Job Submission

For jobs that require the use of multiple cores

Ex.

#!/bin/bash
#SBATCH --job-name=testJob           # job name
#SBATCH --nodes=1                    # node(s) required for job
#SBATCH --ntasks=10                  # number of tasks across all nodes
#SBATCH --partition=general          # name of partition to submit job
#SBATCH --time=01:00:00              # Run time (D-HH:MM:SS)
#SBATCH --output=test-%j.out         # Output file. %j is replaced with job ID
#SBATCH --error=test_error-%j.err    # Error file. %j is replaced with job ID
#SBATCH --mail-type=ALL              # will send email for begin,end,fail
#SBATCH --mail-user=user@auburn.edu

In this case, 10 cores will be allocated on one node. Note: if you do not specify the node count = 1, the cores may be allocated accross multiple nodes. Especially if they exceed the amount of cores available on one node.

Multinode Job Submission

For jobs that require the use of multiple nodes and multiple cores.

#!/bin/bash
#SBATCH --job-name=testJob          # job name
#SBATCH --nodes=2                   # node(s) required for job
#SBATCH --ntasks-per-node=10        # number of tasks per node
#SBATCH --partition=general         # name of partition to submit job
#SBATCH --output=test-%j.out        # Output file. %j is replaced with job ID
#SBATCH --error=test_error-%j.err   # Error file. %j is replaced with job ID
#SBATCH --time=01:00:00             # Run time (D-HH:MM:SS)
#SBATCH --mail-type=ALL             # will send email for begin,end,fail
#SBATCH --mail-user=user@auburn.edu

In this case, 20 cores will be allocated, 10 task per node.

GPU Job Submission

#!/bin/bash
#SBATCH --job-name=testJob          # job name
#SBATCH --nodes=1                   # node(s) required for job
#SBATCH --ntasks=1                  # number of tasks across all nodes
#SBATCH --partition=gpu2            # name of partition to submit job(gpu2 or gpu4)
#SBATCH --gres=gpu:tesla:1          # specifies the number of gpu devices needed
#SBATCH --output=test-%j.out        # Output file. %j is replaced with job ID
#SBATCH --error=test_error-%j.err   # Error file. %j is replaced with job ID
#SBATCH --time=01:00:00             # Run time (D-HH:MM:SS)
#SBATCH --mail-type=ALL             # will send email for begin,end,fail
#SBATCH --mail-user=user@auburn.edu

Note: For more information relating to gpu, consult our GPU Quick Start Section of the documentation

Job Arrays

Job arrays are useful for submitting and managing a large number of similar jobs. As an example, job arrays are convenient if a user wishes to run the same analysis on 100 different files. Slurm provides job array environment variables that allow multiple versions of input files to be easily referenced.

A job array can be submitted by adding the following to an sbatch submission

sbatch --array=0-4 job_script.sh

Where 0-4 specifies the array length. You can also create the array length within your script

#!/bin/bash
#SBATCH --job-name=Array
#SBATCH --output=array-%A.txt
#SBATCH --error=array-%A.txt
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --array=0-4

Then submit the following

sbatch job_script.sh

Naming output and error files

In order to produce output and error files for each array task, you will need to specify both the job ID and task ID. Slurm uses %A for the master job ID and %a for the task ID.

#SBATCH --output=Array-%A_%a.out
#SBATCH --error=Array-%A_%a.error

The result will be the following

Array-JOBID_1.txt
Array-JOBID_2.txt
Array-JOBID_3.txt
Array-JOBID_4.txt

Note: If you only use %A, all array tasks will write to a single file.

Deleting job arrays and tasks

To delete all array tasks, use scancel with the job ID:

scancel JOBID

To delete a single array task, specify the task ID:

scancel JOBID_1