Post
1739
๐ฆ๐๐ ๐ฎ ๐ฟ๐ฒ๐น๐ฒ๐ฎ๐๐ฒ๐ฑ: ๐ก๐ฒ๐ ๐ฆ๐ข๐ง๐ ๐ผ๐ป ๐๐ฒ๐ด๐บ๐ฒ๐ป๐๐ฎ๐๐ถ๐ผ๐ป, ๐ฏ๐ ๐ฐ๐ผ๐บ๐ฏ๐ถ๐ป๐ถ๐ป๐ด ๐๐๐ป๐๐ต๐ฒ๐๐ถ๐ฐ ๐ฑ๐ฎ๐๐ฎ ๐๐ถ๐๐ต ๐ต๐๐บ๐ฎ๐ป ๐ณ๐ฒ๐ฒ๐ฑ๐ฏ๐ฎ๐ฐ๐ธ ๐
It's a model for Object segmentation, for both image and video:
๐ input = a text prompt, or a click on a specific object
๐ output = the model draws a mask around the object. In video segmentation, the mask should follow the object's movements (it is then called a masklet)
๐ช SAM 2 is 6x faster than the previous version, it now also works on a video, and it beats SOTA by far on both image and video segmentation tasks.
How did they pull that?
The main blocker for video segmentation was that data is really hard to collect: to build your training dataset, should you manually draw masks on every frame? That would be way too costly! โก๏ธ As a result, existing video segmentation datasets have a real lack of coverage: few examples, few masklets drawn.
๐ก Key idea: researchers they decided to use a segmentation model to help them collect the dataset.
But then itโs a chicken and egg problem: you need the model to create the dataset and the opposite as well? ๐ค
โ To solve this, they build a data generation system that they scale up progressively in 3 successive manual annotations phases:
๐ฆ๐๐ฒ๐ฝ ๐ญ: Annotators use only SAM + manual editing tools on each frame โ Create 16k masklets across 1.4k videos
๐ฆ๐๐ฒ๐ฝ ๐ฎ: Then train a first SAM 2, add it in the loop to temporally propagate frames, and correct by re-doing a mask manually when an error has occured โ This gets a 5.1x speedup over data collection in phase 1! ๐ Collect 60k masklets
๐ฆ๐๐ฒ๐ฝ ๐ฏ: Now SAM 2 is more powerful, it has the โsingle clickโ prompting option, thus annotators can use it with simple clicks to re-annotate data.
They even add a completely automatic step to generate 350k more masklets!
And in turn, the model perf gradually increases.
I find this a great example of combining synthetic data generation with human annotation ๐
It's a model for Object segmentation, for both image and video:
๐ input = a text prompt, or a click on a specific object
๐ output = the model draws a mask around the object. In video segmentation, the mask should follow the object's movements (it is then called a masklet)
๐ช SAM 2 is 6x faster than the previous version, it now also works on a video, and it beats SOTA by far on both image and video segmentation tasks.
How did they pull that?
The main blocker for video segmentation was that data is really hard to collect: to build your training dataset, should you manually draw masks on every frame? That would be way too costly! โก๏ธ As a result, existing video segmentation datasets have a real lack of coverage: few examples, few masklets drawn.
๐ก Key idea: researchers they decided to use a segmentation model to help them collect the dataset.
But then itโs a chicken and egg problem: you need the model to create the dataset and the opposite as well? ๐ค
โ To solve this, they build a data generation system that they scale up progressively in 3 successive manual annotations phases:
๐ฆ๐๐ฒ๐ฝ ๐ญ: Annotators use only SAM + manual editing tools on each frame โ Create 16k masklets across 1.4k videos
๐ฆ๐๐ฒ๐ฝ ๐ฎ: Then train a first SAM 2, add it in the loop to temporally propagate frames, and correct by re-doing a mask manually when an error has occured โ This gets a 5.1x speedup over data collection in phase 1! ๐ Collect 60k masklets
๐ฆ๐๐ฒ๐ฝ ๐ฏ: Now SAM 2 is more powerful, it has the โsingle clickโ prompting option, thus annotators can use it with simple clicks to re-annotate data.
They even add a completely automatic step to generate 350k more masklets!
And in turn, the model perf gradually increases.
I find this a great example of combining synthetic data generation with human annotation ๐