Cedar cluster
2020-04-22Cedar (GP2) is a heterogeneous HPC cluster located at Simon Fraser University and operated by WestGrid. It is one of several new Compute Canada national systems.
Nodes
There are seven kinds of CPU nodes:
| Type | Count | Memory | Cores | Total cores | Memory/core | 
|---|---|---|---|---|---|
| basecompute | 575 | 128 GB | 32 | 18,400 | 4 GB | 
| basecomputev2 | 640 | 192 GB | 48 | 30,720 | 4 GB | 
| basecomputev3 | 768 | 192 GB | 48 | 36,864 | 4 GB | 
| largecompute | 94 | 256 GB | 32 | 3,008 | 8 GB | 
| bigmem512 | 24 | 512 GB | 32 | 768 | 16 GB | 
| bigmem1500 | 24 | 1.5 TB | 32 | 768 | 48 GB | 
| bigmem3000 | 4 | 3 TB | 32 | 128 | 96 GB | 
| Total | 2,129 | 90,656 | 
The first four types are in the base partitions, and the last three are in the large partitions.
The memory amounts are nominal, with the actual usable space potentially being a bit less.
There are also 338 GPU nodes with all sorts of different specs, having P100 or V100 cards.
All of the processors in the cluster are Xeons, clocked anywhere from 2.1 GHz to 2.4 GHz, and spanning several architectures.
Partitions
There are five special CPU partitions: cpubase_interac, cpularge_interac, cpubackfill, c12hbackfill, and cpupreempt.
The first two are for interactive jobs (maximum time: 3 hours).
The next two are automatically added to the list of partitions for jobs not exceeding 24 or 12 hours, respectively, and are used for backfilling lower priority short jobs into otherwise wasted slots.
The cpubackfill partition includes some of every kind of CPU node, so even jobs requiring huge amounts of memory can get backfilled, and c12hbackfill even includes contributed nodes.
The last of the special partitions has a time limit of 122 days (4 months), but must be requested manually and has PreemptMode=REQUEUE, so jobs from other partitions allocated to the same hardware will cause jobs on this partition to be restarted.
The other 24 CPU partitions are of the form cpuX_Y_bZ, where X is either base or large, Y is either bycore or bynode, and Z is a digit between 1 and 6; for example, cpularge_bycore_b4.
The difference between bycore and bynode is whether the entire node is allocated to the job; this happens when the job is submitted as exclusive, or otherwise depending on the core and memory requirements (gory details in /etc/slurm/job_submit.lua).
The base memory bucket is used by default (88,992 cores).
If the memory requirements call for it, the large bucket is used instead (1,664 cores).
The time buckets are much more straightforward:
| Bucket | Time (hours) | Time (days) | 
|---|---|---|
| b1 | 3 | |
| b2 | 12 | |
| b3 | 24 | 1 | 
| b4 | 72 | 3 | 
| b5 | 168 | 7 | 
| b6 | 672 | 28 | 
Every job goes into the smallest time bucket that is large enough to contain it.
The total number of cores in each partition is as follows:
| cpubase_bycore | cpubase_bynode | cpularge_bycore | cpularge_bynode | |
|---|---|---|---|---|
| b1 | 68,928 | 88,352 | 1,088 | 1,600 | 
| b2 | 68,928 | 88,352 | 1,088 | 1,600 | 
| b3 | 59,840 | 76,192 | 1,088 | 1,600 | 
| b4 | 48,608 | 61,888 | 704 | 1,120 | 
| b5 | 34,784 | 44,992 | 288 | 544 | 
| b6 | 20,960 | 27,616 | 160 | 224 | 
Here's a shell one-liner to see the number of free cores in each partition:
| 1 | for x in base large; do for y in core node; do for z in {1..6}; do partition="cpu${x}_by${y}_b${z}"; echo -n "$partition "; sinfo --noheader -o '%C' -p "$partition" | tr / ' ' | awk '{print $2, $4}'; done; done; done | column -t
 | 
It's important not to run this too frequently to avoid further stressing the scheduler, which is (seemingly permanently) under heavy load.
Examples
Some concrete examples of where jobs end up based on their parameters:
| sbatcharguments | Partition | 
|---|---|
| -c 1 --mem-per-cpu 2000 -t 18:00:00 | cpubase_bycore_b3,cpubackfill | 
| -c 48 --mem 2000 -t 28-0 | cpubase_bynode_b6 | 
| -c 1 --mem 500G -t 3-0 | cpularge_bycore_b4 | 
| -c 32 --mem-per-cpu 9000 -t 3:00:00 | cpularge_bynode_b1,cpubackfill,c12hbackfill |