For Fake's Sake: a set of models for detecting generated and synthetic images

Many people on the internet have recently been tricked by fake images of Pope Francis wearing a coat or of Donald Trump's arrest. To help combat this issue, we provide detectors for such images generated by popular tools like Midjourney and Stable Diffusion.

Model Details

Model Description

Developed by: Sumsub AI team
Model type: Image classification
License: CC-By-SA-3.0
Types:
Finetuned from model: convnext_large_mlp.clip_laion2b_soup_ft_in12k_in1k_384

Demo

The demo page can be found here.

How to Get Started with the Model & Model Sources

Use the code below to get started with the model:

git lfs install
git clone https://huggingface.co./Sumsub/Sumsub-ffs-synthetic-2.0 sumsub-ffs-synthetic-v2

from sumsub-ffs-synthetic-v2.pipeline import PreTrainedPipeline
from PIL import Image

pipe = PreTrainedPipeline("sumsub-ffs-synthetic-v2/")

img = Image.open("sumsub-ffs-synthetic-v2/images/2.jpg")

result = pipe(img)
print(result)

You may need these prerequsites installed:

pip install -r requirements.txt
pip install "git+https://github.com/rwightman/pytorch-image-models"
pip install "git+https://github.com/huggingface/huggingface_hub"

Training Details

Training Data

The models were trained on the following datasets:

Real photos : MS COCO, VizWiz.
AI photos : Midjourney,Midjourney AI Art, Midjourney - Community Showcase, Midjourney, MIDJOURNEY, Midjourney, aiornot HuggingFace contest data, Stable Diffusion Wordnet Dataset.

Training Procedure

To improve the performance metrics, we used data augmentations such as rotation, crop, Mixup and CutMix. Each model was trained for 30 epochs using early stopping with batch size equal to 32.

Evaluation

For evaluation we used the following datasets:

AI photos:

DiffusionDB: a set of 2 million images generated by Stable Diffusion using prompts and hyperparameters specified by real users.
Kaggel SD Faces: set of 4k human face images generated using Stable Diffusion 1.4.
Stable Diffusion Wordnet Dataset: set of 200K images generated by Stable Diffusion.
Kaggle Midjourney 2022-250k: set of 250k images generated by Midjourney.
Kaggle Midjourney v5.1: set of 400k images generated by Midjourney version 5.1.

Realistic photos:

MS COCO: set of 120k real world images.
VizWiz Visual Question Answering dataset validation part : set of 20k photos typically stored on individuals' mobile devices. These images showcase examples of pictures people keep on their phones in their daily lives.

Metrics

Dataset	Accuracy
Kaggel SD Faces	0.984
DiffusionDB	0.920
Stable Diffusion Wordnet Dataset	0.950
MS COCO	0.953
Kaggle Midjourney 2022-250k	0.938
Kaggle Midjourney v5.1	0.971
VizWiz Visual Question Answering dataset validation part	0.998

Limitations

It should be noted that achieving 100% accuracy is not possible. Therefore, the model output should only be used as an indication that an image may have been (but not definitely) artificially generated.
Our models may face challenges in accurately predicting the class for real-world examples that are extremely vibrant and of exceptionally high quality. In such cases, the richness of colors and fine details may lead to misclassifications due to the complexity of the input. This could potentially cause the model to focus on visual aspects that are not necessarily indicative of the true class.

Citation

If you find this useful, please cite as:

@misc{sumsubaiornot, 
    publisher = {Sumsub},
    url       = {https://huggingface.co./Sumsub/Sumsub-ffs-synthetic-2.0},
    year      = {2023},
    author    = {Savelyev, Alexander and Toropov, Alexey and Goldman-Kalaydin, Pavel and Samarin, Alexey},
    title     = {For Fake's Sake: a set of models for detecting deepfakes, generated images and synthetic images}
}

References

Stöckl, Andreas. (2022). Evaluating a Synthetic Image Dataset Generated with Stable Diffusion. 10.48550/arXiv.2211.01777.
Lin, Tsung-Yi & Maire, Michael & Belongie, Serge & Hays, James & Perona, Pietro & Ramanan, Deva & Dollár, Piotr & Zitnick, C.. (2014). Microsoft COCO: Common Objects in Context.
Howard, Andrew & Zhu, Menglong & Chen, Bo & Kalenichenko, Dmitry & Wang, Weijun & Weyand, Tobias & Andreetto, Marco & Adam, Hartwig. (2017). MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications.
Liu, Zhuang & Mao, Hanzi & Wu, Chao-Yuan & Feichtenhofer, Christoph & Darrell, Trevor & Xie, Saining. (2022). A ConvNet for the 2020s.
Wang, Zijie & Montoya, Evan & Munechika, David & Yang, Haoyang & Hoover, Benjamin & Chau, Polo. (2022). DiffusionDB: A Large-scale Prompt Gallery Dataset for Text-to-Image Generative Models. 10.48550/arXiv.2210.14896.
Danna Gurari & Qing Li & Abigale J. Stangl & Anhong Guo & Chi Lin & Kristen Grauman & Jiebo Luo & Jeffrey P. Bigham (2018): VizWiz Grand Challenge: Answering Visual Questions from Blind People. CVPR 2018

Sumsub
/

Sumsub-ffs-synthetic-2.0