Cedar cluster

Cedar (GP2) is a heterogeneous HPC cluster located at Simon Fraser University and operated by WestGrid. It is one of several new Compute Canada national systems.

Nodes

There are seven kinds of CPU nodes:

Type Count Memory Cores Total cores Memory/core
basecompute 575 128 GB 32 18,400 4 GB
basecomputev2 640 192 GB 48 30,720 4 GB
basecomputev3 768 192 GB 48 36,864 4 GB
largecompute 94 256 GB 32 3,008 8 GB
bigmem512 24 512 GB 32 768 16 GB
bigmem1500 24 1.5 TB 32 768 48 GB
bigmem3000 4 3 TB 32 128 96 GB
Total 2,129 90,656

The first four types are in the base partitions, and the last three are in the large partitions. The memory amounts are nominal, with the actual usable space potentially being a bit less.

There are also 338 GPU nodes with all sorts of different specs, having P100 or V100 cards.

All of the processors in the cluster are Xeons, clocked anywhere from 2.1 GHz to 2.4 GHz, and spanning several architectures.

Partitions

There are five special CPU partitions: cpubase_interac, cpularge_interac, cpubackfill, c12hbackfill, and cpupreempt. The first two are for interactive jobs (maximum time: 3 hours). The next two are automatically added to the list of partitions for jobs not exceeding 24 or 12 hours, respectively, and are used for backfilling lower priority short jobs into otherwise wasted slots. The cpubackfill partition includes some of every kind of CPU node, so even jobs requiring huge amounts of memory can get backfilled, and c12hbackfill even includes contributed nodes. The last of the special partitions has a time limit of 122 days (4 months), but must be requested manually and has PreemptMode=REQUEUE, so jobs from other partitions allocated to the same hardware will cause jobs on this partition to be restarted.

The other 24 CPU partitions are of the form cpuX_Y_bZ, where X is either base or large, Y is either bycore or bynode, and Z is a digit between 1 and 6; for example, cpularge_bycore_b4. The difference between bycore and bynode is whether the entire node is allocated to the job; this happens when the job is submitted as exclusive, or otherwise depending on the core and memory requirements (gory details in /etc/slurm/job_submit.lua).

The base memory bucket is used by default (88,992 cores). If the memory requirements call for it, the large bucket is used instead (1,664 cores).

The time buckets are much more straightforward:

Bucket Time (hours) Time (days)
b1 3
b2 12
b3 24 1
b4 72 3
b5 168 7
b6 672 28

Every job goes into the smallest time bucket that is large enough to contain it.

The total number of cores in each partition is as follows:

cpubase_bycore cpubase_bynode cpularge_bycore cpularge_bynode
b1 68,928 88,352 1,088 1,600
b2 68,928 88,352 1,088 1,600
b3 59,840 76,192 1,088 1,600
b4 48,608 61,888 704 1,120
b5 34,784 44,992 288 544
b6 20,960 27,616 160 224

Here's a shell one-liner to see the number of free cores in each partition:

1
for x in base large; do for y in core node; do for z in {1..6}; do partition="cpu${x}_by${y}_b${z}"; echo -n "$partition "; sinfo --noheader -o '%C' -p "$partition" | tr / ' ' | awk '{print $2, $4}'; done; done; done | column -t

It's important not to run this too frequently to avoid further stressing the scheduler, which is (seemingly permanently) under heavy load.

Examples

Some concrete examples of where jobs end up based on their parameters:

sbatch arguments Partition
-c 1 --mem-per-cpu 2000 -t 18:00:00 cpubase_bycore_b3,cpubackfill
-c 48 --mem 2000 -t 28-0 cpubase_bynode_b6
-c 1 --mem 500G -t 3-0 cpularge_bycore_b4
-c 32 --mem-per-cpu 9000 -t 3:00:00 cpularge_bynode_b1,cpubackfill,c12hbackfill

See also