Triton-Lang
发表于|更新于
|总字数:34|阅读时长:1分钟|浏览量:
资料
- https://openai.com/index/triton/
- https://github.com/triton-lang/triton
- Triton: an intermediate language and compiler for tiled neural network computations
- https://triton-lang.org/main/getting-started/tutorials/index.html
文章作者: so2bin
版权声明: 本博客所有文章除特别声明外,均采用 CC BY-NC-SA 4.0 许可协议。转载请注明来源 so2bin!
相关推荐
2024-03-03
CUDA Learn
01 向量加 nvcc compiler识别kernel call: add<<<N, 1>>>(); // N block parallel, 1 thread each block 组织:grid, block, thread: block代表一组worker,可以完成一块任务; __global__关键字,会向compiler提示这是一个kernel函数,需要在GPU上运行: __global__ void add(int *a, int *b, int *c) { c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x]} block的下一级为thread,如下所示为thread parallel: addVec<<<1, N>>>(); // 1 block, N thread parallel each block__global__ void add(int *a, int *b, int *c) { c[th...
2024-08-12
FA
资料 v1: https://arxiv.org/pdf/2205.14135 v2: https://arxiv.org/pdf/2307.08691 GPT with pytorch: https://medium.com/@akriti.upadhyay/building-custom-gpt-with-pytorch-59e5ba8102d4 https://www.analyticsvidhya.com/blog/2024/04/mastering-decoder-only-transformer-a-comprehensive-guide/ https://medium.com/@akriti.upadhyay/building-custom-gpt-with-pytorch-59e5ba8102d4 原理TF Decoder计算 GPT transformer block: in_shape = [B, S] # shape# after embeding, H在MHA中,要求为head_num的整数倍,这样就可以将H拆分到各head中完成# embeding...
2024-08-08
flashinfer
资料 https://flashinfer.ai/ https://flashinfer.ai/2024/02/02/introduce-flashinfer.html https://flashinfer.ai/2024/02/02/cascade-inference Flash-Decode: https://crfm.stanford.edu/2023/10/12/flashdecoding.html 介绍 该项目重点关注的是self-attention的计算效率,集成了当前最前沿的优化技术; 其将self-attention分为了三步:prefill, decode, append; 同时分析了单个请求和批量请求的场景下的性能瓶颈; 开源项目地址:https://github.com/flashinfer-ai/flashinfer/ 优势 Comprehensive Attention Kernels: attention kernel集成了前沿的高性能优化技术,覆盖了single, batch下的:prefill, decode, append kernels,包...
2024-08-05
ampere
Ampere资料 https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth/ Ada Architecture资料 https://images.nvidia.com/aem-dam/Solutions/geforce/ada/nvidia-ada-gpu-architecture.pdf https://flashinfer.ai/2024/02/02/introduce-flashinfer.html H100/A100使用HBM3和HBM2e,因此内存带宽远高于RTX Ada系列; RTX Ada有更高的non-Tensor Cores峰值性能,4090:80TFLops,A100:20TFLops,H100:67TFLops; H100的Tensor Cores峰值性能远高于A100, Ada 4090; Ada 4090的FP16性能是FP32的2倍,而其它卡FP32与FP16的峰值性能一样; SM 架构图
2024-03-03
SGLang
资料 https://lmsys.org/blog/2024-01-17-sglang/ https://arxiv.org/pdf/2312.07104 https://lmsys.org/blog/2024-07-25-sglang-llama3/ https://flashinfer.ai/ https://flashinfer.ai/2024/02/02/introduce-flashinfer.html https://github.com/flashinfer-ai/flashinfer
2024-09-19
LLM Quant
资料 SmoothQuant: https://juejin.cn/post/7330079146515611687 SmoothQuant: https://arxiv.org/pdf/2211.10438