Advice
- https://horace.io/brrr_intro.html
Hardware architectures
- Nvidia
  - Reference
  - In succession, from oldest to newest
    - Maxwell (2014): e.g., M60
    - Pascal (2016)
      - CUDA 8+
    - in parallel (2018)
      - Turing: for consumers, e.g. T4, RTX 2080
      - Volta: for datacenters, e.g. V100
      - No support for bfloat16
      - CUDA 9+
    - Ampere (2020): e.g. A100, RTX 30*, RTX A4000, RTX A4500, RTX A5000, RTX A6000
      - Supports bfloat16
      - CUDA 11.1+
    - In parallel (2022)
      - Ada aka SM89: for consumers, e.g. RTX 4090 (24G), RTX 6000 (32G)
        
        CUDA 11.8+
      - Hopper aka SM90: for datacenters, e.g. H100 (80G)
        
        CUDA 12+
    - Blackwell (2024), e.g. B200, GB200
      - CUDA 12+
      - B200: 20 pflops
      - GB200 = 2x B200 + Grace CPU
      - 1.8T param model = 8k Hopper GPUs, 15MW power. Now 2k Blackwell GPUs in 4MW. GPT3 175B params: 7x perf of H100, 4x training speed.
      - If you were to train GPT-4, 1.8T params model,
        
        On A100, it will take 25k A100s and take 3-5 months.
        
        On H100, it will take 8k GPUs and take ~3 months.
        
        On B100, it will take 2k GPUs and take ~ 3 months.
CUDA model
- Grid > blocks > threads, can be identified using x/y/z (only x req’d).
- Shared memory is scratchpad
- Warps are HW detail.
Tools
- https://openai.com/research/triton : a higher-level Pythonic CUDA
- mojo: a more generalized triton
- https://github.com/NVIDIA/nccl multi GPU/host comm
Usage tips
- CUDA_VISIBLE_DEVICES=0,3,5
Sparse matrices
- CSR, CSC
- Block sparse: BSR, BSC. Same as CSR but on entire (uniformly sized) blocks

Interconnects

Typically 1600Gbps (source)
Ethernet: latest 400 Gbit/s, with rates up to 1.6 Tbit/s under development
InfiniBand: 10 to 400Gb/s
A single NVIDIA Blackwell Tensor Core GPU supports up to 18 NVLink 100 gigabyte-per-second (GB/s) connections for a total bandwidth of 1.8 terabytes per second (TB/s)—2X more bandwidth than the previous generation and over 14X the bandwidth of PCIe Gen5.

Second Generation	Third Generation	Fourth Generation	Fifth Generation
NVLink bandwidth per GPU	300GB/s	600GB/s	900GB/s
Maximum Number of Links per GPU	6	12	18
Supported NVIDIA Architectures	NVIDIA Volta™ architecture	NVIDIA Ampere architecture	NVIDIA Hopper™ architecture

The NVIDIA NVLink Switch features 144 NVLink ports with a non-blocking switching capacity of 14.4TB/s
PCle 6.0: 64GT/s (gigatransfers) and a bandwidth of 256GB/s across 16 channels
More alternatives to PCIe: NVLINK, CAPI, GenZ, CCIX, and CXL

Scripts

pip install cuda-python

conda install pytorch==2.1.2 pytorch-cuda=12.1 -c pytorch -c nvidia
conda install nvidia/label/cuda-12.1.0::cuda-toolkit
conda install nvidia/label/cuda-12.1.0::cuda-cudart

Tools
- nvidia-smi
- nsight
- tcpx/tcpd (Google Cloud) for interconnect
- EFA (AWS) for interconnect

Nondeterminism

There are three main interrelated sources of non-determinism in GPU programs:

Unpredictable Thread Scheduling:
- GPU hardware schedulers make dynamic decisions about which threads/warps to execute based on various factors like cache hits/misses
- For example, if Thread A experiences a cache miss while Thread B gets a cache hit, Thread B might execute first even though it came later in program order
- This means the exact ordering of thread execution can vary between runs, even with identical inputs
Memory Access Patterns:
- The state of the memory hierarchy (caches, memory controllers, etc.) is unpredictable across different runs
- This affects both:
  - When memory operations complete
  - The order in which concurrent memory accesses from different threads are processed
- For example, two threads doing atomic additions to the same memory location may have their updates applied in different orders on different runs
Floating Point Non-Associativity:
- Consider three threads adding values: a=1.00, b=0.555, c=-0.555
- If executed in order (a+b)+c: 1.56+(-0.555) = 1.01
- If executed in order (b+c)+a: 0+1.00 = 1.00
- The different results occur because floating point arithmetic is not associative due to rounding