Maximum context length (512)
I thought Llama2's maximum context length was 4,096 tokens. When I went to perform an inference through this model I saw that the maximum context length is 512. What is the reason for this modification?
Thank you
You have to set n_ctx parametr. 512 is default.
For example:
from llama_cpp import Llama
llm = Llama(model_path="wizardlm-1.0-uncensored-llama2-13b.Q5_K_S.gguf", n_ctx=4096, n_gpu_layers=-1)
Thanks a lot. And what for ctransformers implementation? I don't see the parameter either in AutoModelForCausalLM.from_pretrained nor in generation method.
I had to move from ctransformers to llama-cpp-python. https://github.com/abetlen/llama-cpp-python
There's currently the context_length
parameter available in ctransformers: https://github.com/marella/ctransformers#config. So you can set something like this:
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/Llama-2-7b-Chat-GGUF",
# ...
context_length=4096,
)
Is there a way to set this when using the Inference Endpoints/ API?
@karmiq
I did set context_length = 4096 but somehow it still says "Token indices sequence length is longer than the specified maximum sequence length for this model (2093 > 2048)."
I am using AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-7B-GGUF", hf=True,context_length=4096). Can you tell me what the issue is. Thanks
Update: I guess it's a versioning issue. I re installed and it works . Thanks
I am using llamacpp class from LangChain where you can increase the context length to the max. I am attaching the link to the doc here. The argument is n_ctx.
https://api.python.langchain.com/en/latest/_modules/langchain_community/llms/llamacpp.html#LlamaCpp