Andromeda Linux Cluster
The Andromeda Linux Cluster runs the SLURM scheduler to execute jobs.
Note
If necessary, contact Professor Wei to request a BC ID. Additional information can be found on the Boston College website
Connecting to the cluster
Note
If you are off campus, it’s necessary to first connect to Eagle VPN, which is available on Windows, Mac, and Linux.
Users can log into the cluster via Secure Shell. Below are some helpful commands:
Connect to login node of Andromeda from local machine:
ssh {user}@andromeda.bc.eduRun an interactive bash shell from login node (CPU only):
interactiveRun an interactive bash shell from login node (with GPU):
srun -t 12:00:00 -N 1 -n 1 --mem=32gb --partition=gpua100 --gpus-per-node=1 --pty bash
Note
Resources on the login node are limited and split among all users. Users spending a protracted session on the cluster are asked to run interactive as a courtesy to other users.
Tip
It is recommended to setup passwordless SSH login for both convenience and security
Filesystem
The BC-CV’s main directory is located at
/mmfs1/data/projects/weilab. It’s commonly used to share large files like datasets.The user home directory is located at
/mmfs1/data/{user}. Daily backups are automatically made.These backups are found at
/mmfs1/data/{user}/.snapshots.
Each user has a directory
/scratch/{user}that can store large temporary files. Unlike the home directory, it isn’t backed up.
Note
Users should contact Wei Qiu and ask to be added to Professor Wei’s group in order to get access to /mmfs1/data/projects/weilab.
Modules
Andromeda uses the Modules package to manage packages that influence the shell environment.
List available module:
module availList loaded modules:
module listLoad a module:
module load {module}Unload all modules:
module purge
Tip
To avoid having to load modules every time you SSH, you can append the module load commands at the end of your ~/.tcshrc file.
Conda
It is recommended to use Conda (a Python package manager) to minimize conflicts between projects’ dependencies. For a primer on Conda see the following cheatsheet. To use Conda, load the anaconda module.
Tip
Using Mamba (a drop-in replacement for Conda) can significantly speed up package installation.
SLURM
Although long running tasks can technically be run on l001 (by using screen or tmux), computationally intensive jobs should be run through SLURM scheduler. To use SLURM, load the slurm module.
To view statuses of all nodes:
sinfoTo view detailed info of all nodes:
sinfo --Node --longTo view all queued jobs:
squeueTo view your queued jobs:
squeue -u {user}To submit a job:
sbatch {job_script}
A basic SLURM job script is provided below; more details can be found here.
#!/bin/tcsh -e
#SBATCH --job-name=example-job # job name
#SBATCH --nodes=1 # how many nodes to use for this job
#SBATCH --ntasks=1
#SBATCH --cpus-per-task 8 # how many CPU-cores (per node) to use for this job
#SBATCH --gpus-per-task 1 # how many GPUs (per node) to use for this job
#SBATCH --mem=16GB # how much RAM (per node) to allocate
#SBATCH --time=120:00:00 # job execution time limit formatted hrs:min:sec
#SBATCH --mail-type=BEGIN,END,FAIL. # mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user={user}@bc.edu # where to send mail
#SBATCH --partition=gpuv100,gpua100 # see sinfo for available partitions
#SBATCH --output=main_%j.out
hostname # print the node which the job is running on
module purge # clear all modules
module load slurm # to allow sub-scripts to use SLURM commands
module load anaconda
conda activate {my_env}
{more commands...}
Advanced
Port Forwarding
Port forwarding is useful for accessing services from a job (e.g. Jupyter notebooks, Tensorboard). Suppose that a user is running a service on node {node} that is exposed to port {port_2}. To recieve the node’s services on your local machine:
ssh {user}@andromeda.bc.edu -L {local_port}:localhost:{port_1} ssh -T -N {node} -L {port_1}:localhost:{port_2}
FAQ
Q: My SLURM jobs running Python raises
ImportError:despite havingmodule load anaconda; conda activate {my_env}A: Try adding
which pythonto the beginning of the script to see which Python binary is being used. If it is not the binary of your conda environment, hardcode the path to the Python binary.Q: My PyTorch model returns incorrect numerical results when running on A100 nodes but work fine on V100 nodes
A: Add the following lines to your Python script (source):
import torch torch.backends.cuda.matmul.allow_tf32 = False torch.backends.cudnn.allow_tf32 = False