4k context by default?

#1
by mclassHF2023 - opened

Thanks for the GGUFs as always! Just a question:
The original model has 160k context length. I'm not familiar with all the rope_freq_base and compress_pos_emb settings, but something seems off.
I do believe that it's possible to get more than 4k context out of this, but don't really know 100% how?

Sorry for the bump, but I'm still confused by this one...

oh sorry about that, if you look at the original model's config.json file it'll give you some guidance:

https://huggingface.co./deepseek-ai/DeepSeek-V2.5-1210/blob/main/config.json#L39

  "rope_scaling": {
    "beta_fast": 32,
    "beta_slow": 1,
    "factor": 40,
    "mscale": 1.0,
    "mscale_all_dim": 1.0,
    "original_max_position_embeddings": 4096,
    "type": "yarn"
  },

so you'll need to set the ROPE to with yarn

you should be able to see the options if you use --help i think, but also you can find the values in the server README here:

https://github.com/ggerganov/llama.cpp/tree/53ff6b9b9fb25ed0ec0a213e05534fe7c3d0040f/examples/server

should work for llama-cli as well I'm pretty sure, just CTRL+F for 'yarn'

my best guess is you'll want:

--rope-scaling yarn --yarn-orig-ctx 4096 --yarn-attn-factor 40

beta_slow is default 1 and beta_fast is default 32

you might also be able to get away with JUST --rope-scaling yarn, there's an implication that it's able to load from the model's data, but I'm not positive

Sign up or log in to comment