@m-ric on Hugging Face: "𝗦𝗔𝗠 𝟮 𝗿𝗲𝗹𝗲𝗮𝘀𝗲𝗱: 𝗡𝗲𝘄 𝗦𝗢𝗧𝗔 𝗼𝗻 𝘀𝗲𝗴𝗺𝗲𝗻𝘁𝗮𝘁𝗶𝗼𝗻…"

Post

1751

𝗦𝗔𝗠 𝟮 𝗿𝗲𝗹𝗲𝗮𝘀𝗲𝗱: 𝗡𝗲𝘄 𝗦𝗢𝗧𝗔 𝗼𝗻 𝘀𝗲𝗴𝗺𝗲𝗻𝘁𝗮𝘁𝗶𝗼𝗻, 𝗯𝘆 𝗰𝗼𝗺𝗯𝗶𝗻𝗶𝗻𝗴 𝘀𝘆𝗻𝘁𝗵𝗲𝘁𝗶𝗰 𝗱𝗮𝘁𝗮 𝘄𝗶𝘁𝗵 𝗵𝘂𝗺𝗮𝗻 𝗳𝗲𝗲𝗱𝗯𝗮𝗰𝗸 🚀

It's a model for Object segmentation, for both image and video:
👉 input = a text prompt, or a click on a specific object
👉 output = the model draws a mask around the object. In video segmentation, the mask should follow the object's movements (it is then called a masklet)

💪 SAM 2 is 6x faster than the previous version, it now also works on a video, and it beats SOTA by far on both image and video segmentation tasks.

How did they pull that?

The main blocker for video segmentation was that data is really hard to collect: to build your training dataset, should you manually draw masks on every frame? That would be way too costly! ➡️ As a result, existing video segmentation datasets have a real lack of coverage: few examples, few masklets drawn.

💡 Key idea: researchers they decided to use a segmentation model to help them collect the dataset.

But then it’s a chicken and egg problem: you need the model to create the dataset and the opposite as well? 🤔

⇒ To solve this, they build a data generation system that they scale up progressively in 3 successive manual annotations phases:

𝗦𝘁𝗲𝗽 𝟭: Annotators use only SAM + manual editing tools on each frame ⇒ Create 16k masklets across 1.4k videos

𝗦𝘁𝗲𝗽 𝟮: Then train a first SAM 2, add it in the loop to temporally propagate frames, and correct by re-doing a mask manually when an error has occured ⇒ This gets a 5.1x speedup over data collection in phase 1! 🏃 Collect 60k masklets

𝗦𝘁𝗲𝗽 𝟯: Now SAM 2 is more powerful, it has the “single click” prompting option, thus annotators can use it with simple clicks to re-annotate data.

They even add a completely automatic step to generate 350k more masklets!
And in turn, the model perf gradually increases.

I find this a great example of combining synthetic data generation with human annotation 👏

Join the conversation