GPU Mem Arch

发表于2024-08-09|更新于2026-04-07

|总字数:449|阅读时长:1分钟|浏览量:

资料

数据

TPU vs GPU内存、带宽、算力对比图：

Chip-to-chip comparison of different DL hardware

内存模型

Mem Arch with SM

Thread Block vs SM:
- 每个thread block是由一个SM来执行，并且不能跨越多个SM；
- 一个SM上可以并发调度多个thread block；
- 一个kernel是在一个GPU上执行，而一个GPU可以同时执行多个kernel；

Thread block vs SM

Shared memory: 一个block内threads共享的内存；
一个大的计算问题，被分解成多个并行的小问题，这些独立的小问题在各自的CUDA block中独立支行，CUDA runtime将决定如何/何时调试这些CUDA blocks到SM上，因此CUDA程序可以在任意数量的SM上扩展运行；
如下图所示，一个CUDA程序有8个block，在不同数量SM的GPU上，可以有不同的调度方案，如在4个SM的GPU上，每个SM将调度上2个block，而在8个SM的GPU上，每个SM将调度1个block；

warps

在thread block运行阶段，block内的thread会被划分到warps中来执行SIMT，即如warps的字面意思，里面有多条并行线程（一般是32），这些线程在SM上并行执行同一个instruction；
warp中的线程具有连续、递增的线程ID，第一个warp包含threadId=0的线程；
一个block包含的warps数量定义为：`ceil(threads per block / warp size, 1)

warps

thread block size vs warps:

Memory

per-thread registers (L0): 最高效的内存访问
per-thread local memory: 位于DRAM中，私有化数据，速度最慢；
per-block memory (L1):
- block中所有threads可见，用于block内threads的数据交互；
- 速度低于register；
Global memory:
- 所有grid中的线程可见；
- 效率最低；

Memory Hierarchy

文章作者: so2bin

文章链接: https://so2bin.github.io/2024/08/09/AI-Infer/gpu-mem-arch/index/

版权声明: 本博客所有文章除特别声明外，均采用 CC BY-NC-SA 4.0 许可协议。转载请注明来源 so2bin！

GPU LLM Mem Arch

相关推荐

Grace Hopper资料 https://developer.nvidia.com/zh-cn/blog/nvidia-grace-hopper-superchip-architecture-in-depth/ https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/ H100 Architecture Overview: https://resources.nvidia.com/en-us-tensor-core?ncid=no-ncid 关键特性计算架构：sm90 H100 + InfiniBand性能是A100的30x； H100 + NVLink性能是H100 + InfiniBand的3x；与A100相比，H100的第四代Tensor Core有较大的提升：6x 芯片间速度，更快的SM，更多的SM，更高的clocks；同样的数据类型、数据量下，H100 SM计算速率是A100 SM的2x； New thread block cluster: 支持比单个SM上的单个thread ...

Ampere资料 https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth/ Ada Architecture资料 https://images.nvidia.com/aem-dam/Solutions/geforce/ada/nvidia-ada-gpu-architecture.pdf https://flashinfer.ai/2024/02/02/introduce-flashinfer.html H100/A100使用HBM3和HBM2e，因此内存带宽远高于RTX Ada系列； RTX Ada有更高的non-Tensor Cores峰值性能，4090：80TFLops，A100：20TFLops，H100：67TFLops； H100的Tensor Cores峰值性能远高于A100, Ada 4090； Ada 4090的FP16性能是FP32的2倍，而其它卡FP32与FP16的峰值性能一样； SM 架构图

资料 https://lmsys.org/blog/2024-01-17-sglang/ https://arxiv.org/pdf/2312.07104 https://lmsys.org/blog/2024-07-25-sglang-llama3/ https://flashinfer.ai/ https://flashinfer.ai/2024/02/02/introduce-flashinfer.html https://github.com/flashinfer-ai/flashinfer

资料 https://flashinfer.ai/ https://flashinfer.ai/2024/02/02/introduce-flashinfer.html https://flashinfer.ai/2024/02/02/cascade-inference Flash-Decode: https://crfm.stanford.edu/2023/10/12/flashdecoding.html 介绍该项目重点关注的是self-attention的计算效率，集成了当前最前沿的优化技术；其将self-attention分为了三步：prefill, decode, append；同时分析了单个请求和批量请求的场景下的性能瓶颈；开源项目地址：https://github.com/flashinfer-ai/flashinfer/ 优势 Comprehensive Attention Kernels: attention kernel集成了前沿的高性能优化技术，覆盖了single, batch下的：prefill, decode, append kernels，包...

资料 SmoothQuant: https://juejin.cn/post/7330079146515611687 SmoothQuant: https://arxiv.org/pdf/2211.10438

资料 v1: https://arxiv.org/pdf/2205.14135 v2: https://arxiv.org/pdf/2307.08691 GPT with pytorch: https://medium.com/@akriti.upadhyay/building-custom-gpt-with-pytorch-59e5ba8102d4 https://www.analyticsvidhya.com/blog/2024/04/mastering-decoder-only-transformer-a-comprehensive-guide/ https://medium.com/@akriti.upadhyay/building-custom-gpt-with-pytorch-59e5ba8102d4 原理TF Decoder计算 GPT transformer block: in_shape = [B, S] # shape# after embeding, H在MHA中，要求为head_num的整数倍，这样就可以将H拆分到各head中完成# embeding...