LLM 推理知识大纲
1 min
基础知识
- 通信基础:集合通信操作和其实现
- LLM 基础:
分布式并行策略
- Data Parallellism (DP).
- Tensor Parallelism (TP).
- Pipeline Parallelism (PP).
- Sequence Parallelism (SP) & Context Parallelism (CP)
- Expert Parallelism (EP)
- 组合并行.
Attention 优化
- FlashAttention v1/v2/v3/v4
- Flash Decoding
- Sparse / Sliding Window / Linear Attention
KV Cache 管理
- Paged Attention:高效管理 KV Cache
- Prefix Cache:前缀 KV Cache 缓存
- Cache eviction / reuse
- Hybrid KV Cache Manager
- KV transfer
调度优化
- Static batching
- Continuous Batching
- Chunked Prefill
- Prefill-Decode 协同调度
- 长短请求混部
- Admission control / priority
Kernel 优化
Serving 架构
- Offline vs Online inference
- 单实例 / 多实例
- Router / Scheduler / Worker
- PD 分离
- Disaggregated Prefill
- Multi-model / Multi-LoRA serving
MoE 系统优化
- Router
- Top-k gating
- Expert Parallelism
- all-to-all cost
- grouped GEMM / fused MoE kernels
- load balance / expert placement / replication
模型压缩与部署成本优化
- Quantization
- KV cache quantization
- Distillation
- LoRA / Multi-LoRA serving
通信与系统基础
- 集合通信操作和其实现
- 通信-计算 overlap
- 带宽模型与性能分析