DeepSeek-v3 Infra笔记

🐬DeepSeek-v3 Infra笔记

type

status

date

slug

summary

Prefilling

4个node，共32张卡。

Attention: TP4(with SP), DP8, TP4较小，不会带来太多通信。 MOE: EP32, 只进行EP不进行DP，是为了保证每个rank上的expert的batch数，保证计算强度以利用GPU。moe的all-2-all通信。 dense MLP: TP1，没有并行

redudant experts:将高负载的expert进行冗余部署。专家的负载量在线测量，每10分钟根据负载量调整一次冗余expert。每张卡为8+1， 8个原始expert + 1个冗余expert

通信掩盖: 用两个计算量差不多的micro batch，隐藏moe all-2-all dispatch+combine。

dynamic redudancy strategy: 每个GPU都部署16个expert，但是每次只会有9个激活，在moe all-2-all 通信之前，算出最优的route策略。

Decoding

40个node， 320张卡

Attention: TP4(with SP), DP80 MOE: EP320, 256个原始expert, 64个共享/冗余专家。dispatch和combine直接点对点传输 redudant experts: 类似 通信掩盖：依然两个micro-batch，但是decoding阶段的attention会占用更多的时间，所以一个mirco batch 与另一个micro-batch的 dispatch + moe + combine 重叠。

由于每张卡moe只需要load一个expert的参数，所以memory access的overhead比较小，只需要一小部分SMs就可以完成 dispatch + MOE + combine

Deepseek EPLB Note

DeepSeek-v3 模型笔记