Spaces:

jadechoghari
/

OmniParser

Running on Zero

App Files Files Community

Having the coodinates be returned

by TotoB12 - opened Nov 5, 2024

Discussion

TotoB12

Nov 5, 2024

Would it be possible to additionaly have the box coordinates be returned with the Text Output? Thanks.

TotoB12

Nov 5, 2024

•

edited Nov 5, 2024

I appologize, I cannot figure out how to push to the branch I made, since this space is in Dev-mode.
Here is what I wanted to add:

Modified line 81

    return image, str(parsed_content_list), str(label_coordinates)

Added line 108

        with gr.Column():
            image_output_component = gr.Image(type='pil', label='Image Output')
            text_output_component = gr.Textbox(label='Parsed screen elements', placeholder='Text Output')
            coordinates_output_component = gr.Textbox(label='Coordinates', placeholder='Coordinates Output') <-- this one

Modified line 125 (previously 124)

        outputs=[image_output_component, text_output_component, coordinates_output_component]

Many thanks

jadechoghari

Owner Nov 6, 2024

hello @TotoB12 just read this issue - thanks for taking time investigating!

Will this output the coordinates as well?

TotoB12

Nov 6, 2024

•

edited Nov 6, 2024

Hey!
Yes this will display the usual text output (with the text and icon box numbers), with the coordinates in a seperate output box.
I only got to test this on a modified CPU only Space, but I am pretty this this is all that is needed.
Here is what it would look like:

jadechoghari

Owner Nov 6, 2024

awesome! is that something the community wants?

TotoB12

Nov 6, 2024

The coordinates are one of the core features of this model. As per the current app.py at line 77:

    dino_labeled_img, label_coordinates, parsed_content_list = get_som_labeled_img(
        image_save_path,
        yolo_model,
        BOX_TRESHOLD=box_threshold,
        output_coord_in_ratio=True,
        ocr_bbox=ocr_bbox,
        draw_bbox_config=draw_bbox_config,
        caption_model_processor=caption_model_processor,
        ocr_text=text,
        iou_threshold=iou_threshold
    )

The coordinates are already being generated when the model is prompted, they are just not being shown.
On this Space, seeing the Text Output and labeled image is nice, but is useless for actual use in projects without the full data.
In the microsoft/OmniParser GitHub repository's issues tab, we can see that it is definitely an indespensible asset in the use of the model.

zorba111

Nov 8, 2024

Would really appreciate it if you could return the coordinates like this (as seen in the screenshot - the center point of the x1,y1,x2,y2 coordinates of the bounding boxes ), combining this with the actual screen width and height can we can get the actual screen coordinates (x,y center point of the ui element). This could save us a lot of time to locate the actual UI element.

My use case is to build a chrome extension, that can sort of control the browser, execute tasks etc (like the anthropic's computer use api). so, now, I have to use CSS selectors to locate the right element, this process is kind of error prone and some additional processes are necessary too.

If you could give this screen coordinates in the api response then I can just move the cursor to that position, it will massively save us lots of time and help me cache these coordinates too so, things could be cheaper and faster than anthropic's computer use (again I'm just ranting and could be wrong, please feel free to correct)
If anyone is building something similar, please ping, I'll be happy to get involved.

jadechoghari

Owner Nov 8, 2024

@zorba111 awesome! would be best to add @TotoB12 modified lines ..?

TotoB12

Nov 11, 2024

•

edited Nov 11, 2024

@zorba111 This is a great idea and would reduce the amount of work to be implemented on our projects. I am actually building a very similar app as yours, on a computer level. @jadechoghari I think this would be a valuable addition to this Space. I already got Microsot's Demo to have the coordinates be outputed.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment