Andromeda Linux Cluster ======================= The `Andromeda Linux Cluster `_ runs the SLURM scheduler to execute jobs. .. note:: If necessary, contact `Professor Wei `_ to request a BC ID. Additional information can be found on the `Boston College website `_ Connecting to the cluster ------------------------- .. note:: If you are off campus, it's necessary to first connect to `Eagle VPN `_, which is available on Windows, Mac, and Linux. Users can log into the cluster via Secure Shell. Below are some helpful commands: - Connect to login node of Andromeda from local machine: ``ssh {user}@andromeda.bc.edu`` - Run an interactive bash shell from login node (CPU only): ``interactive`` - Run an interactive bash shell from login node (with GPU): ``srun -t 12:00:00 -N 1 -n 1 --mem=32gb --partition=gpua100 --gpus-per-node=1 --pty bash`` .. note:: Resources on the login node are limited and split among all users. Users spending a protracted session on the cluster are asked to run ``interactive`` as a courtesy to other users. .. tip:: It is recommended to setup `passwordless SSH login `_ for both convenience and security Filesystem ---------- - The BC-CV's main directory is located at ``/mmfs1/data/projects/weilab``. It's commonly used to share large files like datasets. - The user home directory is located at ``/mmfs1/data/{user}``. Daily backups are automatically made. - These backups are found at ``/mmfs1/data/{user}/.snapshots``. - Each user has a directory ``/scratch/{user}`` that can store large temporary files. Unlike the home directory, it isn't backed up. .. note:: Users should contact `Wei Qiu `_ and ask to be added to Professor Wei's group in order to get access to ``/mmfs1/data/projects/weilab``. Modules ------- Andromeda uses the `Modules package `_ to manage packages that influence the shell environment. - List available module: ``module avail`` - List loaded modules: ``module list`` - Load a module: ``module load {module}`` - Unload all modules: ``module purge`` .. tip:: To avoid having to load modules every time you SSH, you can append the ``module load`` commands at the end of your ``~/.tcshrc`` file. Conda ----- It is recommended to use Conda (a Python package manager) to minimize conflicts between projects' dependencies. For a primer on Conda see the following `cheatsheet `_. To use Conda, load the ``anaconda`` module. .. tip:: Using `Mamba `_ (a drop-in replacement for Conda) can significantly speed up package installation. SLURM ----- Although long running tasks can technically be run on ``l001`` (by using ``screen`` or ``tmux``), computationally intensive jobs should be run through SLURM scheduler. To use SLURM, load the ``slurm`` module. - To view statuses of all nodes: ``sinfo`` - To view detailed info of all nodes: ``sinfo --Node --long`` - To view all queued jobs: ``squeue`` - To view your queued jobs: ``squeue -u {user}`` - To submit a job: ``sbatch {job_script}`` A basic SLURM job script is provided below; more details can be found `here `_. .. code-block:: bash #!/bin/tcsh -e #SBATCH --job-name=example-job # job name #SBATCH --nodes=1 # how many nodes to use for this job #SBATCH --ntasks=1 #SBATCH --cpus-per-task 8 # how many CPU-cores (per node) to use for this job #SBATCH --gpus-per-task 1 # how many GPUs (per node) to use for this job #SBATCH --mem=16GB # how much RAM (per node) to allocate #SBATCH --time=120:00:00 # job execution time limit formatted hrs:min:sec #SBATCH --mail-type=BEGIN,END,FAIL. # mail events (NONE, BEGIN, END, FAIL, ALL) #SBATCH --mail-user={user}@bc.edu # where to send mail #SBATCH --partition=gpuv100,gpua100 # see sinfo for available partitions #SBATCH --output=main_%j.out hostname # print the node which the job is running on module purge # clear all modules module load slurm # to allow sub-scripts to use SLURM commands module load anaconda conda activate {my_env} {more commands...} Advanced -------- Port Forwarding ############### Port forwarding is useful for accessing services from a job (e.g. Jupyter notebooks, Tensorboard). Suppose that a user is running a service on node ``{node}`` that is exposed to port ``{port_2}``. To recieve the node's services on your local machine: .. code-block:: bash ssh {user}@andromeda.bc.edu -L {local_port}:localhost:{port_1} ssh -T -N {node} -L {port_1}:localhost:{port_2} FAQ --- - **Q:** My SLURM jobs running Python raises ``ImportError:`` despite having ``module load anaconda; conda activate {my_env}`` **A:** Try adding ``which python`` to the beginning of the script to see which Python binary is being used. If it is not the binary of your conda environment, hardcode the path to the Python binary. - **Q:** My PyTorch model returns incorrect numerical results when running on A100 nodes but work fine on V100 nodes **A:** Add the following lines to your Python script (`source `_): .. code-block:: python import torch torch.backends.cuda.matmul.allow_tf32 = False torch.backends.cudnn.allow_tf32 = False