Trying to get bounding box confidence values for object detection
I am currently trying to produce the bounding boxes, confidence level and labels for prediction on an image.
The code I am using is below.
image = Image.open(image_path)
inputs = self.processor(text=self.prompt, images=image, return_tensors="pt")
generated_ids = self.model.generate(
input_ids=inputs["input_ids"],
pixel_values=inputs["pixel_values"],
max_new_tokens=128,
num_beams=2,
do_sample=False,
return_dict_in_generate=True,
output_scores=True,
)
generated_text = self.processor.batch_decode(
generated_ids.sequences, skip_special_tokens=False
)[0]
parsed_answer = self.processor.post_process_generation(
generated_text, task=self.prompt, image_size=(image.width, image.height)
)
transition_scores = self.model.compute_transition_scores(
generated_ids.sequences,
generated_ids.scores,
generated_ids.beam_indices,
normalize_logits=False,
)
bounding_box_tokens = generated_ids.sequences[0][4:-1].numpy()
bounding_box_scores = transition_scores[0][3:-1].numpy()
bounding_box_indexs = np.where(
np.logical_and(bounding_box_tokens >= 50269, bounding_box_tokens <= 51268)
)
bounding_box_scores = bounding_box_scores[bounding_box_indexs]
score_split_arrays = np.exp(
np.mean(
np.array_split(bounding_box_scores, len(bounding_box_scores) / 4),
axis=1,
)
)
return (
torch.Tensor(parsed_answer[self.prompt]["bboxes"]),
torch.Tensor(score_split_arrays),
parsed_answer[self.prompt]["labels"],
)
As you can see this is very similar to the example implementation. The key issue here is whether my approach to token isolation is correct. I am splitting the ends from the tokens and scores as they seem to belong to tokens that signify the ends of the sequence. I find token sequence indices where the token is between values that I believe signify the location tokens. I then use those to find the scores in the same indices. Here I am assuming that the scores are mapped to the same indices as their respective token. Is this the case? And more generally does this approach actually do what I believe it does based on my explanation?
hi
@leoxiaobin
do you have an intuition here? My guess is you would only be able to get confidence levels per token and would not be a directly score of the confidence level for a specific bounding box,
maybe from you experience modeling florence-2, would you have any suggesting on a good confidence level score here?
try the latest commit for confidence score
hi
@haipingwu
thanks for this update! quick question, can the scores also get extracted from description_with_bboxes_or_polygons
and phrase_grounding
?
is it possible to get score for CAPTION_TO_PHRASE_GROUNDING task?