无辅助损失专家偏置代码实现的小问题 A Small Issue in the Code Implementation of Auxiliary-Loss-Free Load Balancing Expert Bias
在实现无辅助损失负载均衡(Auxiliary-Loss-Free Load Balancing)的专家偏置项时,代码中采用了如下定义: self.e_score_correction_bias = nn.Parameter(torch.empty((self.n_routed_experts)))。个人认为这一实现可能存在细微的问题。我们知道专家偏置项的更新是基于历史负载均衡表现做动态调整,而非通过反向传播来更新参数值。因此,建议通过设置 requires_grad=False 显式禁用该参数的梯度计算,以实现梯度截断。这样可以确保优化器不会更新该参数,从而使其更新机制与设计意图保持一致。不知道个人理解是否有误?
In the implementation of the expert bias for Auxiliary-Loss-Free Load Balancing, the code defines it as follows: self.e_score_correction_bias = nn.Parameter(torch.empty((self.n_routed_experts))).
Personally, I believe there might be a subtle issue with this approach. It is understood that the update of the expert bias is dynamically adjusted based on historical load balancing performance, rather than being updated through back propagation. Therefore, I suggest explicitly disabling gradient computation for this parameter by setting "requires_grad=False" to enforce gradient stop. This ensures that the optimizer does not update the parameter, aligning its update mechanism with the intended design.
I’m not sure if my understanding is correct?