meta-llama/Llama-3.2-11B-Vision-Instruct · Is possible to make the model only return the response without the prompt?

20 days ago

With the example of the code you posted I can only make it return the entire prompt with the model response at the end, how it's usual in this models. But when you use a pipeline you can avoid all that stuff, is there a way to make it work like an LLM with pipeline to make it return only the model response/answer?

zwilliams6886

20 days ago

I'm not sure what's conventional as this is the most I've used transformers, but you can always strip it based on the special tokens right?

    start_token = "<|start_header_id|>assistant<|end_header_id|>"
    end_token = "<|eot_id|>"

    start_index = processor_output.find(start_token) + len(start_token)
    end_index = processor_output.rfind(end_token)

    if start_index != -1 and end_index != -1 and start_index < end_index:
        content = processor_output[start_index:end_index].strip()

I know skip_special_tokenscan be flagged while decoding, but they seem to provide good structure.

bibhas22

11 days ago

By nature, transformer model would output the input prompt first and then start to generate new tokens. You can easily strip out the input and the last end of text token (EOT) from the output.

#Generate tokens
output = model.generate(**inputs, max_new_tokens=250, temperature=0.1,)

#Strip out the input tokens and the last end of text token (EOT)
num_input_tokens = inputs["input_ids"].shape[1]
cleaned_output = output[0, num_input_tokens : -1]

print(processor.decode(cleaned_output))