so2bin

发表于2026-04-07|更新于2026-06-09|claudecode|Claude Code•OpenTelemetry•源码分析•可观测性

Claude Code OpenTelemetry 可观测性体系深度分析1. 设计哲学总览Claude Code 作为一个 AI Agent 级别的工具，其 OpenTelemetry（OTel）可观测性体系的设计哲学可以概括为： “分层递进、按需激活、多管齐下、隐私优先” 分层递进（Layered Telemetry）：可观测性被分为三个独立层次——Metrics（指标）、Logs/Events（事件）、Traces（链路追踪），各自独立配置、独立导出，互不干扰。按需激活（Opt-in Activation）：不同级别的可观测性通过不同环境变量和 Feature Gate 控制，从”默认开启的基础指标”到”需要显式启用的详细链路追踪”，形成阶梯式激活策略。多管齐下（Multi-Backend）：同一份采集数据可以同时导出到多个后端（OTLP、BigQuery、Prometheus、Perfetto、内部 1P 分析系统），每个后端有专门的 Exporter 实现。隐私优先（Privacy-First）：默认对用户 Prompt 做脱敏处理（<REDACT...

Hexo Tag Plugins 写法速查

发表于2026-04-07|更新于2026-06-09|resources|工具•Hexo•写作

Hexo Tag Plugins 写法速查本文汇总 Butterfly / NexT 主题常用的 Tag Plugin 写法，方便写文章时随时参考。 1. Tabs 标签页切换（Butterfly + NexT 通用）GoPythonBashfunc main() { fmt.Println("Hello Go")}def main(): print("Hello Python")echo "Hello Bash" 写法： {% raw %}{% tabs 唯一名称 %}内容1内容2{% endtabs %}{% endraw %} 图标可选（Butterfly），格式为 <!-- tab 标题@fas fa-xxx ...

nano banana 技术风格

发表于2026-01-05|更新于2026-06-09|AI•gemini

风格“Retro Engineering Schematic” (复古工程蓝图/原理图风格) 特点：融合了“达芬奇手稿”、“老式专利图纸”和“现代UI图标” 🎨 风格定义 (Style Definition)这种风格的核心要素包含：背景 (Background): 材质: 米黄色纸张 (Yellowish Paper)、羊皮纸 (Vintage Parchment)、做旧纸张 (Aged Paper)。纹理: 轻微的纸张纹理，偶尔带有网格线或测量标记 (Grid/Measurement Lines)，增加工程感。线条 (Line Work): 墨线 (Ink Lines): 清晰、精细的黑色轮廓线，像针管笔手绘。风格: 干净利落 (Clean)，非素描草图，具有专业制图的严谨性。色彩 (Color Palette): 主色调: 暖色调背景 + 黑色线条。点缀色 (Accents): 使用低饱和度淡彩 (Light Pastels) 或复古墨水色（如青色 Teal、砖红色 Brick Red、琥珀色 Amber）来区分功能块。避免使...

架构治理

发表于2025-10-22|更新于2026-06-09|arch|架构•微服务

Arch Govern资料 https://medium.com/leadercircle/stop-giving-the-excuse-of-i-am-too-busy-and-tackle-those-hard-things-f438bf71ff45 微服务架构治理： https://dzone.com/articles/ending-microservices-chaos-architecture-governance

OPA

发表于2025-04-25|更新于2026-06-09|cloud-native|OPA•APISIX

OPA资料 https://cloud.tencent.com/developer/article/1755148 备选方案 https://apisix.apache.org/zh/blog/2023/03/30/what-is-wasm-and-how-does-apache-apisix-support-it/ APISIX Go plugin: https://zhuanlan.zhihu.com/p/613540331 APISIX Go plugin + OPA Go SDK: https://apisix.apache.org/zh/blog/2021/08/19/go-makes-apache-apisix-better/ tiny go apisix plugin: https://navendu.me/posts/tiny-apisix-plugin/

画图工具

发表于2025-04-25|更新于2026-06-09|resources|工具

画图工具icons 各种组件的SVG：https://techicons.dev/icons/

MADR

发表于2025-04-07|更新于2026-06-09|ADR|架构•ADR

MADR资料 https://adr.github.io/adr-templates/ https://www.ozimmer.ch/practices/2022/11/22/MADRTemplatePrimer.html 模板：https://github.com/adr/madr/blob/4.0.0/template/adr-template.md?plain=1 AWS: https://docs.aws.amazon.com/zh_cn/prescriptive-guidance/latest/architectural-decision-records/adr-process.html AWS demo: https://docs.aws.amazon.com/zh_cn/prescriptive-guidance/latest/architectural-decision-records/appendix.html Nygard ARD: https://cognitect.com/blog/2011/11/15/documenting-architecture-d...

LLM Quant

发表于2024-09-19|更新于2026-06-09|GPU•LLM•Quant

资料 SmoothQuant: https://juejin.cn/post/7330079146515611687 SmoothQuant: https://arxiv.org/pdf/2211.10438

Tritonserver 源码阅读

发表于2024-09-13|更新于2026-06-09|AI-Infer|tritonserver•AI推理

tritonserver 推理接口入口:server/src/http_server.cc HTTPAPIServer::HandleInfer函数https://github.com/triton-inference-server/server/blob/363bcdcd03cddcd00979c7fd3315557328221c6d/src/http_server.cc#L3578;

共享GPU技术

发表于2024-08-24|更新于2026-06-09|GPU•k8s

资料 https://developer.nvidia.com/zh-cn/blog/improving-gpu-utilization-in-kubernetes/

karmada-scheduler

发表于2024-08-22|更新于2026-06-09|k8s•karmada

流程总览如下为scheduler的流程总览：

Triton-Lang

发表于2024-08-14|更新于2026-06-09|GPU•CUDA•Triton

资料 https://openai.com/index/triton/ https://github.com/triton-lang/triton Triton: an intermediate language and compiler for tiled neural network computations https://triton-lang.org/main/getting-started/tutorials/index.html

发表于2024-08-12|更新于2026-06-09|GPU•LLM•FlashAttention

资料 v1: https://arxiv.org/pdf/2205.14135 v2: https://arxiv.org/pdf/2307.08691 GPT with pytorch: https://medium.com/@akriti.upadhyay/building-custom-gpt-with-pytorch-59e5ba8102d4 https://www.analyticsvidhya.com/blog/2024/04/mastering-decoder-only-transformer-a-comprehensive-guide/ https://medium.com/@akriti.upadhyay/building-custom-gpt-with-pytorch-59e5ba8102d4 原理TF Decoder计算 GPT transformer block: in_shape = [B, S] # shape# after embeding, H在MHA中，要求为head_num的整数倍，这样就可以将H拆分到各head中完成# embeding...

GPU Mem Arch

发表于2024-08-09|更新于2026-06-09|GPU•LLM•Mem Arch

资料 https://khairy2011.medium.com/tpu-vs-gpu-vs-cerebras-vs-graphcore-a-fair-comparison-between-ml-hardware-3f5a19d89e38 https://flashinfer.ai/2024/02/02/cascade-inference https://developer.nvidia.com/blog/cuda-refresher-reviewing-the-origins-of-gpu-computing/ CUDA优化：https://www.nvidia.com/en-us/on-demand/session/gtc24-s62191/ 数据 TPU vs GPU内存、带宽、算力对比图：内存模型 Thread Block vs SM: 每个thread block是由一个SM来执行，并且不能跨越多个SM；一个SM上可以并发调度多个thread block；一个kernel是在一个GPU上执行，而一个GPU可以同时执行多个kernel； ...

cutlass

发表于2024-08-08|更新于2026-06-09

资料 https://github.com/NVIDIA/cutlass?tab=readme-ov-file https://github.com/NVIDIA/cutlass/blob/main/media/docs/efficient_gemm.md

flashinfer

发表于2024-08-08|更新于2026-06-09|GPU•LLM•flashinfer

资料 https://flashinfer.ai/ https://flashinfer.ai/2024/02/02/introduce-flashinfer.html https://flashinfer.ai/2024/02/02/cascade-inference Flash-Decode: https://crfm.stanford.edu/2023/10/12/flashdecoding.html 介绍该项目重点关注的是self-attention的计算效率，集成了当前最前沿的优化技术；其将self-attention分为了三步：prefill, decode, append；同时分析了单个请求和批量请求的场景下的性能瓶颈；开源项目地址：https://github.com/flashinfer-ai/flashinfer/ 优势 Comprehensive Attention Kernels: attention kernel集成了前沿的高性能优化技术，覆盖了single, batch下的：prefill, decode, append kernels，包...

ampere

发表于2024-08-05|更新于2026-06-09|GPU•LLM•Ampere

Ampere资料 https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth/ Ada Architecture资料 https://images.nvidia.com/aem-dam/Solutions/geforce/ada/nvidia-ada-gpu-architecture.pdf https://flashinfer.ai/2024/02/02/introduce-flashinfer.html H100/A100使用HBM3和HBM2e，因此内存带宽远高于RTX Ada系列； RTX Ada有更高的non-Tensor Cores峰值性能，4090：80TFLops，A100：20TFLops，H100：67TFLops； H100的Tensor Cores峰值性能远高于A100, Ada 4090； Ada 4090的FP16性能是FP32的2倍，而其它卡FP32与FP16的峰值性能一样； SM 架构图

nvidia Hopper

发表于2024-08-05|更新于2026-06-09|GPU•LLM•Hopper•H100

Grace Hopper资料 https://developer.nvidia.com/zh-cn/blog/nvidia-grace-hopper-superchip-architecture-in-depth/ https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/ H100 Architecture Overview: https://resources.nvidia.com/en-us-tensor-core?ncid=no-ncid 关键特性计算架构：sm90 H100 + InfiniBand性能是A100的30x； H100 + NVLink性能是H100 + InfiniBand的3x；与A100相比，H100的第四代Tensor Core有较大的提升：6x 芯片间速度，更快的SM，更多的SM，更高的clocks；同样的数据类型、数据量下，H100 SM计算速率是A100 SM的2x； New thread block cluster: 支持比单个SM上的单个thread ...

LLM Chunk Context

发表于2024-07-25|更新于2026-06-09|LLM•Chunk Context

资料 Chunk Context原理解析算力与显存的数量分析：https://blog.csdn.net/taoqick/article/details/132009733 prefill vs decode prefill是长序列并行计算，decode是token by token prefill过程直接计算QKV，不需要读KVCache，decode过程需要读KVCache拼接后再计算各请求的context长度不同，prefill计算量不同对于deocde，不同请求的iteration次数不同，计算attention时的mask矩阵也不同；

istio多集群

发表于2024-05-20|更新于2026-06-09|istio•多集群网格

线上多集群流量分发异常故障排查问题背景我们平台在线上建了以3个k8s集群组成的istio多集群服务风格，有A, E, F三个集群。在这几天，有一个服务出现了流量在集群间分布不是按DR配置的权重比例，如下图所示，我们看到应用所有的流量都到了F区，但副本的分布是A:F=69:31，这明显是错的：分析首先我们进入到istio-ingress网关的pod内，通过如下命令查看/clusters的配置，过滤出该服务的cluster权重比例配置，看是否正常： # 进入到istio-ingress的pod里面执行curl -s http://localhost:15000/clusters | grep qingqiu-72b-triton-prod-v1 | grep weight# 得到结果如下outbound|80||image-auditing-test-v1.ai-ppt-beautify-test.svc.cluster.local::<F>:15443::weight::10outbound|80||image-auditing-test-v1.ai-ppt-bea...