5. Compute on Spider¶
This is a quickstart on the platform. In this page you will learn:
- how to prepare and run your workloads
- about job types, partitions and Slurm constraints
5.1. Prepare your workloads¶
When you submit jobs to the batch system, you create a job script where you specify the resources that your programs need from the system to execute successfully.
Before submitting your jobs, it is a good practice to run a few tests of your programs locally (on the login node or other system) and observe:
- the time that your programs take to execute
- the amount of cores that your software needs to execute these tasks
- the maximum memory used by the programs during execution
We suggest you, where possible, first debug your job template on the login node. In doing so, please take into account that the login node is a shared resource and hence any job testing should consume the least demanding set of resources. If you have high resource demands please contact our helpdesk for support in testing your jobs.
Once you get a rough estimate of the resources above, you are set to go. Create your job script to request from the scheduler the estimated resources.
In the current setup of Slurm on Spider, we ask you to specify at least the following attributes:
|SBATCH directive||Functionality||Usage example|
||the number of nodes||
||the number of cores||
||the wall-clock time||
The specifics of each partition can be found with
scontrol show partitions, the information per machine can be found with
scontrol show node NAME, where NAME is the name of the worker node and for a simple overview use
5.2. Run your jobs¶
5.2.1. Running a local Job with srun¶
srun command creates an allocation and executes an application on a cluster managed by Slurm.
It comes with a great deal of options for which help is available by typing
srun --help on
the login node. Alternatively, you can also get
help at the Slurm documentation page.
srun command when used on the command line is executed locally by Slurm,
an example of this is given below. A python script,
hello_world.py, has the
#!/usr/bin/env python print("Hello World")
This python script can be locally executed as;
srun python hello_world.py #Hello World
srun should only be used with a job script that is submitted with
sbatch to the Slurm managed job queue.
5.2.2. Running an interactive Job with srun¶
You can start an interactive session on a worker node. This helps when you want to debug your pipeline or compile some software directly on the node. You will have direct access to your home and project space files from within your interactive session.
The interactive jobs will also be ‘scheduled’ along with batch jobs for resources so they may not always start immediately.
The example below shows how to start an interactive session on a normal partition worker node with maximum time of one hour, one core and one task per node;
srun --partition=normal --time=00:60:00 -c 1 --ntasks-per-node=1 --pty bash -i -l
To stop your session and return to the login node, type
The example below shows how to start an interactive session on a single core of a specific worker node;
srun -c 1 --time=01:00:00 --nodelist=wn-db-02 --x11 --pty bash -i -l
5.2.3. Submitting a Job Script with sbatch¶
sbatch command submits
batch script or
job description script with 1 or more
commands to the batch queue. This script is written in bash, and requires SBATCH header lines that define
all of your jobs global parameters. Slurm then manages this queue and schedules the
srun jobs for execution on the available worker nodes. Slurm takes
into account the global options specified with
#SBATCH <options> in the job
description script as well as any local options specified for individual
srun <options> jobs.
Below we provide an example for
sbatch job submission with options. Here we
submit and execute the above mentioned
hello_world.py script to the
sbatch and provide options
- N 1 to request only 1 node,
-c 1 to request for 1 core and 8000 MB memory (coupled) and
-t 1:00 to
request a maximum run time of 1 minute. The job script,
is an executable bash script with the following code;
#!/bin/bash #SBATCH -N 1 #SBATCH -c 1 #SBATCH -t 1:00 srun python /home/[USERNAME]/[path-to-script]/hello_world.py
You can submit this job script to the Slurm managed job queue as;
sbatch hello_world.sh #Submitted batch job 808
The job is scheduled in the queue with
jobid 808 and the stdout output of
the job is saved in the ascii file
more slurm-808.out #Hello World
More information on
sbatch can be found at the Slurm documentation page.
5.2.4. Using local
If you run jobs that require intensive IO processes, we advise you to use
scratch because it is the local SSD on every compute node of the the
Spider. This is a temporary storage that can be used only during the
execution of your job and will be arbitrarily removed at any point once your
job has finished running.
In order to access the
scratch filesystem within your jobs, you should
$TMPDIR variable in your job script. We advise you the following
- At the start of your job, copy the necessary input files to
- Run your analysis and produce your intermediate/output files on
- Copy the output files at the end of the job from
$TMPDIRto your home directory
/tmp which is a ‘bind mount’ from
/scratch/slurm.<JOBID> so you will only see your own job files in
/tmp and all files will be removed after the job finishes.
TMPDIR variable can only be used within the SLURM jobs. It can not be used nor tested on the UI because there is no scratch space.
Here is a job script template for
#!/bin/bash #SBATCH -N 1 #request 1 node #SBATCH -c 1 #request 1 core and 8000 MB RAM #SBATCH -t 5:00 #request 5 minutes jobs slot mkdir "$TMPDIR"/myanalysis cp -r $HOME/mydata "$TMPDIR"/myanalysis cd "$TMPDIR"/myanalysis # = Run your analysis here = #when done, copy the output to your /home storage tar cf output.tar output/ cp "$TMPDIR"/myanalysis/output.tar $HOME/ echo "SUCCESS" exit 0
5.3. Job types¶
5.3.1. CPU jobs¶
- For regular jobs we advise to always only use 1 node per job script i.e.,
-N 1. If you need multi-node job execution, consider better an HPC facility.
- On Spider we provide 8000 MB RAM per core.
- This means that your memory requirements can be specified via the number of cores without an extra directive for memory
- For example, by specifying
-c 4you request 4 cores and 32000 MB RAM
- On Spider we provide 80 GB scratch disk per core.
- This means that your scratch disk requirements can be specified via the number of cores without an extra directive for storage
- For example, by specifying
-c 2you request 2 cores and 160 GB scratch disk
- When you target specifically our fat nodes with 12TB available scratch, the provided scratch disk per requested core is 200 GB
5.4. Slurm partitions¶
We have configured four CPU and two GPU partitions on Spider as shown in the table above:
- If no partition is specified, the jobs will be scheduled on the normal partition which has a maximum walltime of 120 hours and can run on any worker nodes.
- Infinite partition jobs have a maximum walltime of 720 hours. Please note that you should run on this partition at your own risk. Jobs running on this partition can be killed without warning for system maintenances and we will not be responsible for data loss or loss of compute hours.
- Short partition is meant for testing jobs. It allows for 2 jobs per user with 8 cores max per job and 12 hours max walltime.
- Interactive partition is meant for testing jobs and has 12 hours maximum walltime.
- GPU V100 contains 1 Nvidia V100 (32GB) card per node.
- GPU A100 contains 2 Nvidia A100 (40GB) cards per node.
5.5. Slurm constraints¶
5.5.1. Regular constraints¶
The Slurm scheduler will schedule your job on any compute node that can fulfil
the constraints that you provide with your
sbatch command upon job
The minimum constraints that we ask you to provide with your job are given in the example above.
Many other constraints can also be provided with your job submission. However, by adding more constraints it may become more difficult to schedule and execute your job. See the Slurm manual (https://slurm.schedmd.com) for more information and please note that not all constraint options are implemented on Spider. In case you are in doubt then please contact our helpdesk.
5.5.2. Spider-specific constraints¶
In addition to the regular
sbatch constraints, we also have introduced a
number of Spider-specific constraints that are tailored to the hardware of our
compute nodes for the Spider platform.
These specific constraints need to be specified via constraint labels to
on job submission via the option
Here a comma separated list implies that all constraints in the list must be fulfilled before the job can be executed.
In terms of Spider-specific constraints, we support the following constraints to select specific hardware:
|SBATCH directive||Functionality||Worker Node|
As an example we provide below a bash shell script
hello_world.sh that executes a compiled C script called ‘hello’. In this script the #SBATCH line specifies that this script may only be executed on a node with 2 cpu-cores where the node must have a skylake cpu-architecture and ssd (solid state drive) local scratch disk space.
#!/bin/bash #SBATCH -c 2 --constraint=skylake,ssd echo "start hello script" /home/[USERNAME]/[path-to-script]/hello echo "end hello script"
From the command line interface the above script may be submitted to Slurm via:
Please note that not all combinations will be supported. In case you submit a combination that is not available you will receive the following error message:
‘sbatch: error: Batch job submission failed: Requested node configuration is not available’
5.6. Querying compute usage¶
sreport are slurm tools that allows users to query their usage from the slurm database. The accounting tools
sreport are both documented on the Slurm documentation page.
These slurm queries result in a users total usage for a user. The sum of Raw CPU times / 3600 gives total core usage for the defined period. -d Produces delimited results for easier exporting / reporting
# look into the details of your usage by job sacct \ -X #sum\ -S2020-07-01 -E2020-07-30 \ --format=jobid,jobname,cputimeraw,user,alloccpus,state,partition,account,exitcode
#view the spexone project usage and your user's usage sreport \ -t second \ -T cpu cluster \ AccountUtilizationByUser \ Start="2020-07-01" \ End="2020-07-30"
Still need help? Contact our helpdesk