SmolLm and mergekit_moe: is lm_head missing ?

#14

by sylvain471 - opened Aug 31, 2024

Aug 31, 2024

Hello,

I am trying to play with mergekit-moe and SmolLm but I am facing a problem that I can't solve. I am not sure whether the problem is SmolLM related or mergekit-moe related.

Using a dummy merging config.yaml such as

base_model: HuggingFaceTB/SmolLM-135M
gate_mode: random
dtype: bfloat16
experts:
  - source_model: HuggingFaceTB/SmolLM-135M
  - source_model: HuggingFaceTB/SmolLM-135M

and running the command

mergekit-moe config.yaml merge --copy-tokenizer

I am getting the following error

Fetching 7 files: 100%|████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 70407.98it/s]
Fetching 7 files: 100%|█████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 8063.75it/s]
Fetching 7 files: 100%|████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 18213.48it/s]
Warm up loaders: 100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  4.13it/s]
Weights: 100%|█████████████████████████████████████████████████████████████████████████████████████▋| 272/273 [00:00<00:00, 2120.41it/s]
Traceback (most recent call last):
  File "/home/ubuntu/merging/.venv/bin/mergekit-moe", line 8, in <module>
    sys.exit(main())
  File "/home/ubuntu/merging/.venv/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/ubuntu/merging/.venv/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/ubuntu/merging/.venv/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ubuntu/merging/.venv/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/ubuntu/merging/mergekit/mergekit/options.py", line 82, in wrapper
    f(*args, **kwargs)
  File "/home/ubuntu/merging/mergekit/mergekit/scripts/moe.py", line 211, in main
    build(
  File "/home/ubuntu/merging/mergekit/mergekit/scripts/moe.py", line 82, in build
    out_arch.write_model(
  File "/home/ubuntu/merging/mergekit/mergekit/moe/mixtral.py", line 160, in write_model
    tensor = base_loader.get_tensor(
  File "/home/ubuntu/merging/mergekit/mergekit/io/lazy_tensor_loader.py", line 127, in get_tensor
    raise KeyError(key)
KeyError: 'lm_head.weight'

somehow lm_head.weight seems to be missing. But when I load SmolLM and inspect the layers I get

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(49152, 576)
    (layers): ModuleList(
      (0-29): 30 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear(in_features=576, out_features=576, bias=False)
          (k_proj): Linear(in_features=576, out_features=192, bias=False)
          (v_proj): Linear(in_features=576, out_features=192, bias=False)
          (o_proj): Linear(in_features=576, out_features=576, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=576, out_features=1536, bias=False)
          (up_proj): Linear(in_features=576, out_features=1536, bias=False)
          (down_proj): Linear(in_features=1536, out_features=576, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((576,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((576,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((576,), eps=1e-05)
    (rotary_emb): LlamaRotaryEmbedding()
  )
  (lm_head): Linear(in_features=576, out_features=49152, bias=False)
)

indicating that lm_head is right where it should be.

However when I inspect the layers from HF "Files and versions tab" lm_head does not appear as suggests the following screenshot

somehow lm_head seems to be missing...

Any thoughts?

eliebak

Hugging Face TB Research org Sep 3, 2024

•

edited Sep 3, 2024

Hey, It's due to the use of tie_word_embeddings=true parameter, the lm_head is the same as the embed_tokens layer (but transposed). You probably have to replace AutoModel by AutoModelForCausalLM somewhere in the mergekit-moe to make it work.

sylvain471

Sep 3, 2024

•

edited Sep 3, 2024

Hello, yes that was it!

Pointing toward model.embed_tokens.weight when asked for lm_head.weight solves the merging problem.

mergekit has json files that define the architecture of current models, I will add one for SmolLm to avoid interfering with the source code.

Thanks for the help!

sylvain471 changed discussion status to closed Sep 3, 2024

chnaaam

19 days ago

I had the same issue. Could you explain in more detail how to fix it?

sylvain471

19 days ago

hello, it's been a while since I looked at it. As I recall I only changed mergekit/moe/mixtral.py by adding lines 145-147 and 161-173 as in the following code, quick and dirty fix, hope this helps!

# Copyright (C) 2024 Charles O. Goddard
#
# This software is free software: you can redistribute it and/or
# modify it under the terms of the GNU Lesser General Public License as
# published by the Free Software Foundation, either version 3 of the
# License, or (at your option) any later version.
#
# This software is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# Lesser General Public License for more details.
#
# You should have received a copy of the GNU Lesser General Public License
# along with this program. If not, see http://www.gnu.org/licenses/.

import logging
from typing import List, Optional

import torch
import tqdm
import transformers

from mergekit.architecture import MISTRAL_INFO, WeightInfo
from mergekit.moe.arch import MoEOutputArchitecture
from mergekit.moe.common import initialize_io, noise_and_scale, select_dtype
from mergekit.moe.config import MoEMergeConfig
from mergekit.options import MergeOptions


class MixtralMoE(MoEOutputArchitecture):
    def name(self) -> str:
        return "Mixtral"

    def supports_config(
        self,
        config: MoEMergeConfig,
        explain: bool = False,
        trust_remote_code: bool = False,
    ) -> bool:
        if config.shared_experts:
            if explain:
                logging.warning("Mixtral does not support shared experts")
            return False

        model_types = []
        for model_ref in [config.base_model] + [e.source_model for e in config.experts]:
            model_cfg = model_ref.config(trust_remote_code=trust_remote_code)
            model_types.append(model_cfg.model_type)

        if len(set(model_types)) != 1:
            if explain:
                logging.warning(
                    "Mixtral requires all input models to have the same architecture"
                )
            return False
        if model_types[0] not in ("llama", "mistral"):
            if explain:
                logging.warning(
                    "Mixtral requires all input models to be Llama or Mistral models"
                )
            return False
        return True

    def _generate_config(
        self,
        base_config: transformers.PretrainedConfig,
        num_experts: int,
        shared_experts: Optional[int] = None,
        experts_per_token: Optional[int] = None,
    ) -> transformers.PretrainedConfig:
        if shared_experts:
            raise NotImplementedError("Shared experts not supported for Mixtral output")

        if not isinstance(base_config, transformers.MistralConfig):
            base_cfg_mistral = transformers.MistralConfig(**base_config.to_dict())
            base_cfg_mistral.sliding_window = None
            base_cfg_mistral.max_position_embeddings = (
                base_config.max_position_embeddings
            )
            base_config = base_cfg_mistral

        out_cfg = transformers.MixtralConfig(**base_config.to_dict())
        out_cfg.architectures = ["MixtralForCausalLM"]
        out_cfg.num_local_experts = num_experts
        out_cfg.num_experts_per_tok = experts_per_token or 2
        out_cfg.sliding_window = None

        if (out_cfg.num_local_experts & (out_cfg.num_local_experts - 1)) != 0:
            logging.warning(
                f"Your model has {out_cfg.num_local_experts} experts, which is "
                "not a power of two. The model will not be usable in llama.cpp."
            )
        return out_cfg

    def _remap_weight_name(self, weight: WeightInfo) -> str:
        if ".mlp." not in weight.name:
            # Everything but MLP is identical to base Mistral
            return weight.name

        res = weight.name
        for needle, replacement in [
            (".mlp.gate_proj", ".block_sparse_moe.experts.{expert_idx}.w1"),
            (".mlp.down_proj", ".block_sparse_moe.experts.{expert_idx}.w2"),
            (".mlp.up_proj", ".block_sparse_moe.experts.{expert_idx}.w3"),
        ]:
            res = res.replace(needle, replacement)
        return res

    def _router_weight_name(self, layer_idx: int) -> str:
        return f"model.layers.{layer_idx}.block_sparse_moe.gate.weight"

    def write_model(
        self,
        out_path: str,
        config: MoEMergeConfig,
        merge_options: MergeOptions,
        router_weights: List[torch.Tensor],
        shared_router_weights: Optional[List[torch.Tensor]] = None,
    ):
        base_model = config.base_model
        base_cfg = base_model.config(trust_remote_code=merge_options.trust_remote_code)

        assert len(router_weights) == base_cfg.num_hidden_layers, (
            f"Expected {base_cfg.num_hidden_layers} router weights, "
            f"got {len(router_weights)}"
        )

        out_dtype = select_dtype(config, base_cfg)
        out_cfg = self._generate_config(
            base_cfg,
            len(config.experts),
            len(config.shared_experts or []),
            config.experts_per_token,
        )
        out_cfg.torch_dtype = out_dtype
        out_cfg.save_pretrained(out_path)

        loaders, base_loader, writer = initialize_io(config, out_path, merge_options)
        for weight_info in tqdm.tqdm(
            MISTRAL_INFO.all_weights(base_cfg),
            desc="Weights",
        ):
            tensor_name = self._remap_weight_name(weight_info)
            if "{expert_idx}" in tensor_name:
                if tensor_name=="lm_head.weight":
                    print("1")
                    print(tensor_name.split('.')[0])
                for expert_index, expert in enumerate(config.experts):
                    expert_name = tensor_name.replace("{expert_idx}", str(expert_index))
                    expert_loader = loaders.get(expert.source_model)
                    tensor = expert_loader.get_tensor(
                        weight_info.name, aliases=weight_info.aliases
                    )
                    tensor = noise_and_scale(
                        tensor, expert, is_residual="down_proj" in tensor_name
                    )
                    writer.save_tensor(
                        expert_name,
                        tensor.to(dtype=out_dtype),
                        clone=merge_options.clone_tensors,
                    )
            else:
                if tensor_name=="lm_head.weight" and base_cfg.tie_word_embeddings:
                    print("2")
                    print(base_cfg)
                    print(tensor_name.split('.')[0])
                    tensor = base_loader.get_tensor(
                        "model.embed_tokens.weight", aliases=weight_info.aliases
                        )
                else:
                    tensor = base_loader.get_tensor(
                        tensor_name, aliases=weight_info.aliases
                    )
                writer.save_tensor(
                    tensor_name,
                    tensor.to(dtype=out_dtype),
                    clone=merge_options.clone_tensors,
                )

        for layer_idx, weight in enumerate(
            tqdm.tqdm(router_weights, desc="Router weights")
        ):
            writer.save_tensor(
                self._router_weight_name(layer_idx),
                weight.to(dtype=out_dtype).contiguous(),
                clone=merge_options.clone_tensors,
            )

        writer.finalize()

sylvain471 changed discussion status to open 19 days ago

chnaaam

15 days ago

@sylvain471 Thanks :)

sylvain471

13 days ago

great, I am glad it helped, I close the subject then

sylvain471 changed discussion status to closed 13 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment