Get error when run sample code
Hi! Thank you for your great work!
When I run the sample code, I get the following error:
Some weights of BeitModel were not initialized from the model checkpoint at cmarkea/dit-base-layout-detection and are newly initialized: ['beit.pooler.layernorm.bias', 'beit.pooler.layernorm.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Do you have any idea on this? Thank you in advance.
Hi WYYexperiments, thank you.
This is not an error but a warning. It comes from the fact that, in order to improve the model's performance, we trained the model on the original cross-entropy loss and a cost function aimed at predicting the bounding boxes. However, this warning will not affect the inference performance.
Hi Cyrile, thank you for the reply.
I get another error
logits = outputs.logits
^^^^^^^^^^^^^^
AttributeError: 'BeitModelOutputWithPooling' object has no attribute 'logits'
The output has two tensor, 'pooler_output' and 'last_hidden_state', no attribute 'logits'
Can I get your help in this?
Yes, you are right, sorry.
We should not use AutoModel, but BeitForSemanticSegmentation.
I have modified the example accordingly, and it will work (and the warning will also no longer appear)...
I get this, when I run your code to convert mask to bbox.
bbx, lab = detect_bboxes(mm.numpy())
^^^^^^^^
ValueError: too many values to unpack (expected 2)
Seems the detected_blocks value does not have label information.
May I have the code to do the visualization you put on the model card?
Hi, yes of course, here is an untested and non-debugged code snippet. Feel free to adapt it according to your needs.
from collections import OrderedDict
from PIL import Image
import torch
import matplotlib.pyplot as plt
from einops import rearrange
from torchvision.transforms.functional import pil_to_tensor
from torchvision.utils import draw_segmentation_masks
map_color = OrderedDict(
[("Caption", "red"),
("Footnote", "yellowgreen"),
("Formula", "skyblue"),
("List-item", "magenta"),
("Page-footer", "red"),
("Page-header", "darkorange"),
("Picture", "gold"),
("Section-header", "indigo"),
("Table", "sienna"),
("Text", "slategray"),
("Title", "teal")]
)
segmentation = img_proc.post_process_semantic_segmentation(output, target_sizes=[img.size[::-1]])
img_tensor = pil_to_tensor(img)
colors, masks, labels = [], [], []
for ii, (label, color) in enumerate(map_color.items()):
mask = segmentation[0] == (ii+1)
if mask.sum() > 0:
masks.append(mask)
labels.append(label)
colors.append(color)
masks = torch.stack(masks)
drawn_seg = draw_segmentation_masks(img_tensor, masks, alpha=0.5, colors=colors)
im_seg = Image.fromarray(rearrange(drawn_seg, 'C H W -> H W C').numpy())
plt.imshow(im_seg)
Hi Cyrile, thank you for your warm help. Your model has an impressive performance.
Can you share more info about the training you did?
I found we can only found Dit base checkpoints. There is no available object detection checkpoints provided by microsoft online.
I feel we need more build the model by ourselves.
You will find information about the training here: https://huggingface.co./cmarkea/dit-base-layout-detection/discussions/1#66bf025c43a701a8376ebd1c