Code used to convert this / could you do v3 base?

#3
by deltanym - opened

Been trying for a while to get v3 base running, using 8xh100 on modal, which has been frustrating to use, and my best setup right now takes >10min per response. I'll try it in autoawq because deepseek v3 support was recently added, but if you could do it, I would appreciate that a lot.

Cognitive Computations org

AutoAWQ is very slow, it would take about half a day, but here's the code:

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

MODEL_PATH = "in"
QUANT_PATH = "out"
QUANT_CONFIG = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM", "modules_to_not_convert": ["self_attn.kv_a_proj_with_mqa"]}

def main():
    model = AutoAWQForCausalLM.from_pretrained(MODEL_PATH, low_cpu_mem_usage=True)
    tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
    model.quantize(
        tokenizer,
        quant_config=QUANT_CONFIG,
    )
    model.save_quantized(QUANT_PATH, shard_size="10GB")
    tokenizer.save_pretrained(QUANT_PATH)
    print(f"Model is quantized and saved at \"{QUANT_PATH}\".")

if __name__ == "__main__":
    main()
v2ray changed discussion status to closed

Sign up or log in to comment