Basic Matrix Multiplication with GPUs

This example demonstrates how to run GPU-accelerated matrix multiplication using PyTorch with CUDA. Create a Python script, cuda_mm.py with the following content,

cuda_mm.py:

#! /usr/bin/env python

import time       
import torch

# Check if CUDA is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Matrix size                       
N = 8192       

# Create random matrices on the GPU
A = torch.randn((N, N), device=device)
B = torch.randn((N, N), device=device)

# Warm-up (helps get accurate timing)
_ = torch.mm(A, B)     

# Measure performance
torch.cuda.synchronize()
start = time.time()           

# Matrix multiplication
C = torch.mm(A, B)                                                                

torch.cuda.synchronize()                                                          
end = time.time()

print(f"Matrix size: {N}x{N}")

# CPU timing for speedup calculation
A_cpu = A.cpu()
B_cpu = B.cpu()

# CPU warm-up
_ = torch.mm(A_cpu, B_cpu)

# Measure CPU performance
start_cpu = time.time()
C_cpu = torch.mm(A_cpu, B_cpu)
end_cpu = time.time()

cpu_time = end_cpu - start_cpu
gpu_time = end - start

print(f"CPU time: {cpu_time:.4f} seconds")
print(f"GPU time: {gpu_time:.4f} seconds")
print(f"Speedup: {cpu_time / gpu_time:.2f}x")

Run the script either through a Jupyter Notebook session, interactive job, or a Slurm batch job script that requests GPU resources.

For example, from the login node, request an interactive GPU session,

srun -p gpu --gres=gpu:h200:1 -t 1:00:00 --mem=20G --pty bash -i

Then execute the script,

chmod +x cuda_mm.py
./cuda_mm.py

The code will output the CPU and GPU timings along with the GPU speedup,

Using device: cuda
Matrix size: 8192x8192
CPU time: 6.3201 seconds
GPU time: 0.2054 seconds
Speedup: 30.77x

Notes

It is important to understand that GPU operations are inherently asynchronous, i.e., when you call torch.mm(), the operation is queued on the GPU but the Python code continues executing immediately. Without torch.cuda.synchronize(), your timing measurements would capture when the operation was queued rather than when it actually completed, leading to misleadingly fast results¹. The synchronization calls act as barriers, forcing the CPU to wait until all GPU operations finish before recording timestamps, ensuring you measure true computational time rather than just queue latency.

Memory efficiency plays a crucial role in GPU performance. By creating matrices directly on the GPU itself, we eliminate the expensive data transfer bottleneck between CPU and GPU memory². A typical PCIe connection might transfer data at ~10 GB/s, while modern GPU memory operates at ~1000+ GB/s. Creating matrices directly on GPU memory means we avoid this transfer overhead entirely, allowing the GPU to work with data that's already in its fastest memory space. This approach is particularly important for large matrices where the transfer time could exceed the actual computation time.

In practice, GPU-accelerated matrix multiplication may achieve ~500x speedup compared to CPU implementations³, depending on the GPU hardware and matrix size. Modern NVIDIA GPUs with Tensor Cores can process thousands of matrix operations simultaneously, making them ideal for the parallel nature of linear algebra computations. The performance scales non-linearly with matrix size with very small matrices typically not showing significant speedup due to GPU initialization overhead, while larger matrices can achieve dramatic performance improvements.

GPU accelerate matrix multiplication finds applications across numerous domains. In deep learning, matrix multiplication forms the backbone of neural network forward and backward passes⁴. Scientific computing relies on it for simulations, finite element analysis, and numerical methods. Data scientists use it for large-scale dimensionality reduction, clustering algorithms, and feature transformations. The script's graceful fallback to CPU when CUDA isn't available makes it suitable for development environments that may not have GPU access, while still providing optimal performance in production HPC environments.

Basic Matrix Multiplication with GPUs

Notes

References