Andromeda Linux Cluster
The Andromeda Linux Cluster runs the SLURM scheduler to execute jobs.
Note
If necessary, contact Professor Wei to request a BC ID. Additional information can be found on the Boston College website
Connecting to the cluster
Note
If you are off campus, it’s necessary to first connect to Eagle VPN, which is available on Windows, Mac, and Linux.
Users can log into the cluster via Secure Shell. Below are some helpful commands:
Connect to login node of Andromeda from local machine:
ssh {user}@andromeda.bc.edu
Run an interactive bash shell from login node (CPU only):
interactive
Run an interactive bash shell from login node (with GPU):
srun -t 12:00:00 -N 1 -n 1 --mem=32gb --partition=gpua100 --gpus-per-node=1 --pty bash
Note
Resources on the login node are limited and split among all users. Users spending a protracted session on the cluster are asked to run interactive
as a courtesy to other users.
Tip
It is recommended to setup passwordless SSH login for both convenience and security
Filesystem
The BC-CV’s main directory is located at
/mmfs1/data/projects/weilab
. It’s commonly used to share large files like datasets.The user home directory is located at
/mmfs1/data/{user}
. Daily backups are automatically made.These backups are found at
/mmfs1/data/{user}/.snapshots
.
Each user has a directory
/scratch/{user}
that can store large temporary files. Unlike the home directory, it isn’t backed up.
Note
Users should contact Wei Qiu and ask to be added to Professor Wei’s group in order to get access to /mmfs1/data/projects/weilab
.
Modules
Andromeda uses the Modules package to manage packages that influence the shell environment.
List available module:
module avail
List loaded modules:
module list
Load a module:
module load {module}
Unload all modules:
module purge
Tip
To avoid having to load modules every time you SSH, you can append the module load
commands at the end of your ~/.tcshrc
file.
Conda
It is recommended to use Conda (a Python package manager) to minimize conflicts between projects’ dependencies. For a primer on Conda see the following cheatsheet. To use Conda, load the anaconda
module.
Tip
Using Mamba (a drop-in replacement for Conda) can significantly speed up package installation.
SLURM
Although long running tasks can technically be run on l001
(by using screen
or tmux
), computationally intensive jobs should be run through SLURM scheduler. To use SLURM, load the slurm
module.
To view statuses of all nodes:
sinfo
To view detailed info of all nodes:
sinfo --Node --long
To view all queued jobs:
squeue
To view your queued jobs:
squeue -u {user}
To submit a job:
sbatch {job_script}
A basic SLURM job script is provided below; more details can be found here.
#!/bin/tcsh -e
#SBATCH --job-name=example-job # job name
#SBATCH --nodes=1 # how many nodes to use for this job
#SBATCH --ntasks=1
#SBATCH --cpus-per-task 8 # how many CPU-cores (per node) to use for this job
#SBATCH --gpus-per-task 1 # how many GPUs (per node) to use for this job
#SBATCH --mem=16GB # how much RAM (per node) to allocate
#SBATCH --time=120:00:00 # job execution time limit formatted hrs:min:sec
#SBATCH --mail-type=BEGIN,END,FAIL. # mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user={user}@bc.edu # where to send mail
#SBATCH --partition=gpuv100,gpua100 # see sinfo for available partitions
#SBATCH --output=main_%j.out
hostname # print the node which the job is running on
module purge # clear all modules
module load slurm # to allow sub-scripts to use SLURM commands
module load anaconda
conda activate {my_env}
{more commands...}
Advanced
Port Forwarding
Port forwarding is useful for accessing services from a job (e.g. Jupyter notebooks, Tensorboard). Suppose that a user is running a service on node {node}
that is exposed to port {port_2}
. To recieve the node’s services on your local machine:
ssh {user}@andromeda.bc.edu -L {local_port}:localhost:{port_1} ssh -T -N {node} -L {port_1}:localhost:{port_2}
FAQ
Q: My SLURM jobs running Python raises
ImportError:
despite havingmodule load anaconda; conda activate {my_env}
A: Try adding
which python
to the beginning of the script to see which Python binary is being used. If it is not the binary of your conda environment, hardcode the path to the Python binary.Q: My PyTorch model returns incorrect numerical results when running on A100 nodes but work fine on V100 nodes
A: Add the following lines to your Python script (source):
import torch torch.backends.cuda.matmul.allow_tf32 = False torch.backends.cudnn.allow_tf32 = False