DROID-SLAM HPC Port

Overview

DROID-SLAM is a deep-learning-based visual odometry system developed at Princeton. During my research internship at the University of Cape Town (UCT), I was tasked with making it run reliably on UCT's HPC cluster and building tooling to make the whole workflow manageable from a local machine.

The Challenges

Running deep-learning pipelines on an HPC cluster is rarely plug-and-play:

No X server — the cluster has no display server. DROID-SLAM's built-in visualiser cannot run; --disable_vis is mandatory, which meant the standard reconstruction export path was broken
No outbound internet on compute nodes — all dependencies and container images must be pre-staged
SLURM job queues — GPU jobs are submitted as batch scripts with no interactive debugging
Dependency isolation — DROID-SLAM requires specific PyTorch and CUDA extension versions incompatible with the cluster's global modules

DROIDSLAMCLI — The Go CLI

The core deliverable was a full Go CLI application built with the Cobra and Viper frameworks. It wraps the entire DROID-SLAM workflow — validating inputs, templating SLURM scripts, uploading data to the cluster, submitting jobs, monitoring progress, and pulling results back — all over SSH/SFTP from a local machine.

Commands

# Submit an inference job
droidslamcli infer --config config.yaml

# Submit a training job
droidslamcli train --config config.yaml

# Check running jobs
droidslamcli status

# Stream stdout from a specific job
droidslamcli status --jobID=<jobID>

# Extract results for a specific job
droidslamcli extract infer --jobId=<jobID> --location=./results --config config.yaml

# Extract results for all jobs
droidslamcli extract infer -a --location=./results --config config.yaml

SSH / SFTP

The CLI communicates with the HPC cluster entirely over SSH. Commands like sbatch, squeue, and path setup are executed as remote shell commands, while files (images, calibration data, model weights) are transferred and results retrieved via SFTP. This meant the entire workflow — from job submission to results extraction — could be driven from a local Windows or macOS machine without needing to manually scp files or log in to the cluster.

Singularity instead of Docker

Most HPC clusters don't allow Docker because it requires daemon-level (root) access, which is a security risk in a shared multi-user environment. Instead, the cluster supports Singularity, which runs containers as the current user with no elevated privileges. The CLI and SLURM runner scripts were built around Singularity — pulling the DROID-SLAM Docker image from DockerHub and converting it to a Singularity Image File (.sif) on first use, then reusing the cached image for subsequent jobs.

Inference Pipeline

When droidslamcli infer is run:

Validates local input files — PNG images directory, .txt calibration file, .pth model weights
Generates a UUID as the job ID for isolation
Creates the remote directory structure and uploads calibration file, model weights, and all images via SFTP
Templates the SLURM header and runner scripts with config values (account, time limit, GPU type, partition) using Go's text/template
Submits the job via sbatch

The SLURM runner script on the compute node then downloads the Singularity image, clones the headless_changes fork, compiles it inside the container, and executes inference with --save_headless --disable_vis. Results are pulled back to the local machine via droidslamcli extract.

Forking DROID-SLAM — Making it Headless

DROID-SLAM was never designed to run without a screen. Its visualiser — the part that shows you the 3D reconstruction being built in real time — and the code that actually saves the output were tangled together. On an HPC cluster there's no screen, so the visualiser crashes immediately, and with it, any chance of getting results out.

The fix was to fork the project and separate these two concerns. I wrote a new headless export module that does everything the visualiser does internally — reading the 3D data DROID-SLAM builds up during processing — but instead of drawing it on screen, it just saves the output straight to files: a point cloud of the reconstructed scene and a record of where the camera was at each moment. A new --save_headless flag was added so you could opt into this behaviour from the command line.

A few other papercuts were fixed along the way: the output path logic had a bug that was nesting results in an unintended subfolder, the trajectory data wasn't being saved at all, and the training pipeline was hardcoded to only work with one specific dataset format. That last fix meant the system could now be trained on custom-captured footage rather than being locked to a single public dataset.

Results

Successfully ran DROID-SLAM inference on the TUM RGB-D dataset on the HPC cluster's A100 GPU nodes
Reproduced trajectory results matching the paper's benchmarks
The fork, CLI, and containerised workflow were documented and handed off for continued use by the UCT robotics research group

Learnings

This project touched a wide stack — SLURM job scheduling, Singularity containers, SSH automation in Go, and the internals of a deep-learning visual odometry system. The headless export problem was the most interesting challenge: to solve it I had to understand how DROID-SLAM builds up its 3D reconstruction internally and find a way to get that data out without any of the display infrastructure it was designed around.