Exact computations for multi-head latent attention

#9
by mseeger - opened

Hello,
I read the DeepSeek-V2 paper. They do not explain all details of how to do computations efficiently with MLA. In particular, they say one does not have to explicitly compute the k, q, v vectors, because the weight matrices labeled with "U" can be combined. But they do not give details in their paper.

This does not seem so easy to me, because you need to split k, q, v into heads. For example, you do not compute k^T q for the full vectors, but instead (k_h)^t (q_h) for every head h. So, from their paper, it would be easy to compute k^T q = (c_k)^T W c_q for some (d_c, d_c) matrix W, but to do the same thing for the inner products per head, I need n_h such matrices. Still maybe a reduction, but less as stated in the paper.

For the v vectors, the same story. One certainly cannot just fold things into the final W_o matrix, for the same reason one cannot fold W_o with W_v in the original MHA.

I've seen some articles here and there, but they all miss the (crucial!) fact that k, v, q do not have to be computed. Can somebody help?

Is there a Hugging Face model implementation for this? I don't see src/transformers/models/* with * anything like "deepseek".

When I work this out properly and use d_c = 4*d_h as stated in their paper, I am getting d^2 * (28 / n_h) = 4 * d^2 * (7 / n_h) scalars for the weights. This is less than MHA if n_h > 7. But it is less impressive than their paper makes it sound like.

Sign up or log in to comment