HuggingFace Newbie - What input does this model expect?

#1
by joe-muller - opened

I am fairly new to HuggingFace and deploying models for inference. I am using beam.cloud to deploy this model but I'm not sure how to actually use it. When I send a list of messages, the response contains an output field with what looks like gibberish text completion.

What type of input does this model expect? Is there some where to see that on HuggingFace?

How is this model intended to be used? Should I be constantly sending partial transcripts until it tells me the turn is over?

Thanks!

image.png

Here's a quick example of trying to use the model with the transformers library. What task should I use? In the example it says text-generation but that doesn't give me understandable results:

from transformers import pipeline, Pipeline

messages = [
    {"role": "user", "content": "Who are you?"},
    {"role": "assistant", "content": "I am me."},
    {"role": "user", "content": "But who are you really?"},
    {"role": "assistant", "content": "I am me."},
    {"role": "user", "content": "But who does"}
]
pipe: Pipeline = pipeline("text-generation", model="livekit/turn-detector")
result = pipe(messages)
print(result)

Outputs:

[{'generated_text': [{'role': 'user', 'content': 'Who are you?'}, {'role': 'assistant', 'content': 'I am me.'}, {'role': 'user', 'content': 'But who are you really?'}, {'role': 'assistant', 'content': 'I am me.'}, {'role': 'user', 'content': 'But who does'}, {'role': 'assistant', 'content': 'youwhatwhatwhatwhat'}]}]

This seems to work better:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("livekit/turn-detector")

messages = [
    {"role": "user", "content": "Who are you?"},
    {"role": "assistant", "content": "I am John."},
    {"role": "user", "content": "What is your last name?"},
    {"role": "assistant", "content": "Smith."},
    {"role": "user", "content": "How do you spell the first"}
]

# Format messages using the chat template
text = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=False,
    add_special_tokens=False,
    tokenize=False
)

# Remove the EOU token from current utterance
ix = text.rfind("<|im_end|>")
text = text[:ix]

# Tokenize
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)

# Load model
model = AutoModelForSequenceClassification.from_pretrained(
    "livekit/turn-detector")

# Get prediction
with torch.no_grad():
    outputs = model(**inputs)
    probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
    print("probabilities", probabilities)
    # Use index 1 for the positive class probability
    eou_probability = probabilities[0, 1].item()

print(f"End of utterance probability: {eou_probability}")

It outputs something like this:

probabilities tensor([[0.9695, 0.0305]])
End of utterance probability: 0.030476752668619156

I imagine the first value in the tensor is the probability the speech will continue and the second value is the probability the speech is finished.

This seems to work better:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("livekit/turn-detector")

messages = [
    {"role": "user", "content": "Who are you?"},
    {"role": "assistant", "content": "I am John."},
    {"role": "user", "content": "What is your last name?"},
    {"role": "assistant", "content": "Smith."},
    {"role": "user", "content": "How do you spell the first"}
]

# Format messages using the chat template
text = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=False,
    add_special_tokens=False,
    tokenize=False
)

# Remove the EOU token from current utterance
ix = text.rfind("<|im_end|>")
text = text[:ix]

# Tokenize
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)

# Load model
model = AutoModelForSequenceClassification.from_pretrained(
    "livekit/turn-detector")

# Get prediction
with torch.no_grad():
    outputs = model(**inputs)
    probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
    print("probabilities", probabilities)
    # Use index 1 for the positive class probability
    eou_probability = probabilities[0, 1].item()

print(f"End of utterance probability: {eou_probability}")

It outputs something like this:

probabilities tensor([[0.9695, 0.0305]])
End of utterance probability: 0.030476752668619156

I imagine the first value in the tensor is the probability the speech will continue and the second value is the probability the speech is finished.

However, I found that after using this code, for the same input each time the probability result varies, even from 0 to 1.

Sign up or log in to comment