LLM Chunk Context

发表于2024-07-25|更新于2026-06-09

|总字数:104|阅读时长:1分钟|浏览量:

资料

Chunk Context原理解析
算力与显存的数量分析：https://blog.csdn.net/taoqick/article/details/132009733

prefill vs decode

prefill是长序列并行计算，decode是token by token
prefill过程直接计算QKV，不需要读KVCache，decode过程需要读KVCache拼接后再计算
各请求的context长度不同，prefill计算量不同
对于deocde，不同请求的iteration次数不同，计算attention时的mask矩阵也不同；

文章作者: so2bin

文章链接: https://so2bin.github.io/2024/07/25/AI-Infer/llm-chunk-context/index/

版权声明: 本博客所有文章除特别声明外，均采用 CC BY-NC-SA 4.0 许可协议。转载请注明来源 so2bin！

LLM Chunk Context

相关推荐

资料 https://lmsys.org/blog/2024-01-17-sglang/ https://arxiv.org/pdf/2312.07104 https://lmsys.org/blog/2024-07-25-sglang-llama3/ https://flashinfer.ai/ https://flashinfer.ai/2024/02/02/introduce-flashinfer.html https://github.com/flashinfer-ai/flashinfer

资料 https://flashinfer.ai/ https://flashinfer.ai/2024/02/02/introduce-flashinfer.html https://flashinfer.ai/2024/02/02/cascade-inference Flash-Decode: https://crfm.stanford.edu/2023/10/12/flashdecoding.html 介绍该项目重点关注的是self-attention的计算效率，集成了当前最前沿的优化技术；其将self-attention分为了三步：prefill, decode, append；同时分析了单个请求和批量请求的场景下的性能瓶颈；开源项目地址：https://github.com/flashinfer-ai/flashinfer/ 优势 Comprehensive Attention Kernels: attention kernel集成了前沿的高性能优化技术，覆盖了single, batch下的：prefill, decode, append kernels，包...

Ampere资料 https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth/ Ada Architecture资料 https://images.nvidia.com/aem-dam/Solutions/geforce/ada/nvidia-ada-gpu-architecture.pdf https://flashinfer.ai/2024/02/02/introduce-flashinfer.html H100/A100使用HBM3和HBM2e，因此内存带宽远高于RTX Ada系列； RTX Ada有更高的non-Tensor Cores峰值性能，4090：80TFLops，A100：20TFLops，H100：67TFLops； H100的Tensor Cores峰值性能远高于A100, Ada 4090； Ada 4090的FP16性能是FP32的2倍，而其它卡FP32与FP16的峰值性能一样； SM 架构图

资料 https://khairy2011.medium.com/tpu-vs-gpu-vs-cerebras-vs-graphcore-a-fair-comparison-between-ml-hardware-3f5a19d89e38 https://flashinfer.ai/2024/02/02/cascade-inference https://developer.nvidia.com/blog/cuda-refresher-reviewing-the-origins-of-gpu-computing/ CUDA优化：https://www.nvidia.com/en-us/on-demand/session/gtc24-s62191/ 数据 TPU vs GPU内存、带宽、算力对比图：内存模型 Thread Block vs SM: 每个thread block是由一个SM来执行，并且不能跨越多个SM；一个SM上可以并发调度多个thread block；一个kernel是在一个GPU上执行，而一个GPU可以同时执行多个kernel； ...

资料 SmoothQuant: https://juejin.cn/post/7330079146515611687 SmoothQuant: https://arxiv.org/pdf/2211.10438

Grace Hopper资料 https://developer.nvidia.com/zh-cn/blog/nvidia-grace-hopper-superchip-architecture-in-depth/ https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/ H100 Architecture Overview: https://resources.nvidia.com/en-us-tensor-core?ncid=no-ncid 关键特性计算架构：sm90 H100 + InfiniBand性能是A100的30x； H100 + NVLink性能是H100 + InfiniBand的3x；与A100相比，H100的第四代Tensor Core有较大的提升：6x 芯片间速度，更快的SM，更多的SM，更高的clocks；同样的数据类型、数据量下，H100 SM计算速率是A100 SM的2x； New thread block cluster: 支持比单个SM上的单个thread ...