avatar
文章
54
标签
57
分类
6
首页
归档
标签
分类
关于
so2binLLM Quant 返回首页
首页
归档
标签
分类
关于

LLM Quant

发表于2024-09-19|更新于2026-04-07
|总字数:15|阅读时长:1分钟|浏览量:

资料

  • SmoothQuant: https://juejin.cn/post/7330079146515611687
  • SmoothQuant: https://arxiv.org/pdf/2211.10438
文章作者: so2bin
文章链接: https://so2bin.github.io/2024/09/19/AI-Infer/llm-quant/index/
版权声明: 本博客所有文章除特别声明外,均采用 CC BY-NC-SA 4.0 许可协议。转载请注明来源 so2bin!
GPULLMQuant
上一篇
MADR
MADR资料 https://adr.github.io/adr-templates/ https://www.ozimmer.ch/practices/2022/11/22/MADRTemplatePrimer.html 模板:https://github.com/adr/madr/blob/4.0.0/template/adr-template.md?plain=1 AWS: https://docs.aws.amazon.com/zh_cn/prescriptive-guidance/latest/architectural-decision-records/adr-process.html AWS demo: https://docs.aws.amazon.com/zh_cn/prescriptive-guidance/latest/architectural-decision-records/appendix.html Nygard ARD: https://cognitect.com/blog/2011/11/15/documenting-architecture-d...
下一篇
Tritonserver 源码阅读
tritonserver 推理接口入口:server/src/http_server.cc HTTPAPIServer::HandleInfer函数https://github.com/triton-inference-server/server/blob/363bcdcd03cddcd00979c7fd3315557328221c6d/src/http_server.cc#L3578;
相关推荐
2024-08-12
FA
资料 v1: https://arxiv.org/pdf/2205.14135 v2: https://arxiv.org/pdf/2307.08691 GPT with pytorch: https://medium.com/@akriti.upadhyay/building-custom-gpt-with-pytorch-59e5ba8102d4 https://www.analyticsvidhya.com/blog/2024/04/mastering-decoder-only-transformer-a-comprehensive-guide/ https://medium.com/@akriti.upadhyay/building-custom-gpt-with-pytorch-59e5ba8102d4 原理TF Decoder计算 GPT transformer block: in_shape = [B, S] # shape# after embeding, H在MHA中,要求为head_num的整数倍,这样就可以将H拆分到各head中完成# embeding...
2024-08-09
GPU Mem Arch
资料 https://khairy2011.medium.com/tpu-vs-gpu-vs-cerebras-vs-graphcore-a-fair-comparison-between-ml-hardware-3f5a19d89e38 https://flashinfer.ai/2024/02/02/cascade-inference https://developer.nvidia.com/blog/cuda-refresher-reviewing-the-origins-of-gpu-computing/ CUDA优化:https://www.nvidia.com/en-us/on-demand/session/gtc24-s62191/ 数据 TPU vs GPU内存、带宽、算力对比图: 内存模型 Thread Block vs SM: 每个thread block是由一个SM来执行,并且不能跨越多个SM; 一个SM上可以并发调度多个thread block; 一个kernel是在一个GPU上执行,而一个GPU可以同时执行多个kernel; ...
2024-08-05
nvidia Hopper
Grace Hopper资料 https://developer.nvidia.com/zh-cn/blog/nvidia-grace-hopper-superchip-architecture-in-depth/ https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/ H100 Architecture Overview: https://resources.nvidia.com/en-us-tensor-core?ncid=no-ncid 关键特性 计算架构:sm90 H100 + InfiniBand性能是A100的30x; H100 + NVLink性能是H100 + InfiniBand的3x; 与A100相比,H100的第四代Tensor Core有较大的提升:6x 芯片间速度,更快的SM,更多的SM,更高的clocks;同样的数据类型、数据量下,H100 SM计算速率是A100 SM的2x; New thread block cluster: 支持比单个SM上的单个thread ...
2024-03-03
SGLang
资料 https://lmsys.org/blog/2024-01-17-sglang/ https://arxiv.org/pdf/2312.07104 https://lmsys.org/blog/2024-07-25-sglang-llama3/ https://flashinfer.ai/ https://flashinfer.ai/2024/02/02/introduce-flashinfer.html https://github.com/flashinfer-ai/flashinfer
2024-08-08
flashinfer
资料 https://flashinfer.ai/ https://flashinfer.ai/2024/02/02/introduce-flashinfer.html https://flashinfer.ai/2024/02/02/cascade-inference Flash-Decode: https://crfm.stanford.edu/2023/10/12/flashdecoding.html 介绍 该项目重点关注的是self-attention的计算效率,集成了当前最前沿的优化技术; 其将self-attention分为了三步:prefill, decode, append; 同时分析了单个请求和批量请求的场景下的性能瓶颈; 开源项目地址:https://github.com/flashinfer-ai/flashinfer/ 优势 Comprehensive Attention Kernels: attention kernel集成了前沿的高性能优化技术,覆盖了single, batch下的:prefill, decode, append kernels,包...
2024-08-05
ampere
Ampere资料 https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth/ Ada Architecture资料 https://images.nvidia.com/aem-dam/Solutions/geforce/ada/nvidia-ada-gpu-architecture.pdf https://flashinfer.ai/2024/02/02/introduce-flashinfer.html H100/A100使用HBM3和HBM2e,因此内存带宽远高于RTX Ada系列; RTX Ada有更高的non-Tensor Cores峰值性能,4090:80TFLops,A100:20TFLops,H100:67TFLops; H100的Tensor Cores峰值性能远高于A100, Ada 4090; Ada 4090的FP16性能是FP32的2倍,而其它卡FP32与FP16的峰值性能一样; SM 架构图
avatar
so2bin
专注于AI框架、平台、架构、k8s、Go、Python领域
文章
54
标签
57
分类
6
Follow Me
目录
  1. 1. 资料
最新文章
Claude Code OpenTelemetry 可观测性体系深度分析2026-04-07
Hexo Tag Plugins 写法速查2026-04-07
nano banana 技术风格2026-01-05
架构治理2025-10-22
OPA2025-04-25
© 2023 - 2026 By so2bin