How to infere in real time

#23
by thunder-007 - opened

Is there some way I can yield the text out from the model in real time like real llms does.

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b-it")
model = AutoModelForCausalLM.from_pretrained("google/gemma-2b-it")

input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer(input_text, return_tensors="pt")

outputs = model.generate(**input_ids)
print(tokenizer.decode(outputs[0]))
thunder-007 changed discussion title from How to get answers in real time to How to infere in real time

how do solve the following problem: ValueError: Tokenizer class GemmaTokenizer does not exist or is not currently imported.

@Jamison98 You probably need to update your transformers package. pip install -U transformers

@Jamison98 for me it helped to update both transformers (as @alxdk mentioned) AND update torch : pip install "torch>=2.1.1" -U

Is there some way I can yield the text out from the model in real time like real llms does.

@thunder-007 : you can use TextStreamer to see the output while tokens' generation is ongoing.

Example:

from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b-it")
model = AutoModelForCausalLM.from_pretrained("google/gemma-2b-it")
streamer = TextStreamer(tokenizer)
...
input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer(input_text, return_tensors="pt")

outputs = model.generate(**input_ids, max_new_tokens=500, streamer=streamer)
print(tokenizer.decode(outputs[0]))

Happy inferencing!

Reference: https://huggingface.co./docs/transformers/generation_strategies#streaming

osanseviero changed discussion status to closed

Sign up or log in to comment