--- library_name: transformers tags: ['vision-text', 'CLIP', 'fine-tuning', 'RoBERTa', 'image-text-model', 'cosine-learning-rate'] --- # Model Card for CLIP-RoBERTa Fine-Tuned Model This model is a fine-tuned version of the CLIP model combining a vision model (`openai/clip-vit-base-patch32`) with a text model (`roberta-base`). The model is fine-tuned to better handle image-text matching tasks, utilizing a variety of training strategies for enhanced performance. Model fine tuned for Project: [DermAi-Viz](https://github.com/parthasarathydNU/derm-ai-viz) ## Model Details ### Model Description This model card describes a Vision-Text Dual Encoder model fine-tuned from the original `openai/clip-vit-base-patch32` and `roberta-base` models. The model is specifically adapted for tasks that involve joint processing of images and textual descriptions, leveraging both the image encoding capabilities of the CLIP model and the language understanding of RoBERTa. - **Developed by:** Dhruv Parthasarathy - **Model type:** Vision-Text Dual Encoder - **Language(s) (NLP):** English - **License:** Apache 2.0 - **Finetuned from model:** [openai/clip-vit-base-patch32](https://huggingface.co./openai/clip-vit-base-patch32) and [roberta-base](https://huggingface.co./roberta-base) ### Model Sources - **Repository:** [GitHub](https://github.com/parthasarathydNU/derm-ai-viz/tree/main) - **Demo:** [Optional, link to a demo if applicable] ## Uses ### Direct Use This model can be directly used for image-text matching tasks, such as searching for relevant images based on a textual query or generating captions for images. ### Downstream Use The model can be further fine-tuned for specific tasks such as image classification with text guidance, visual question answering, or any other task that benefits from multi-modal inputs. ### Out-of-Scope Use This model is not suitable for tasks unrelated to image and text processing. It may also not perform well on non-English texts or images outside the scope of the fine-tuning data. ## Bias, Risks, and Limitations As with most large models, this model may inherit biases present in the pretraining datasets. Users should be cautious when deploying this model in sensitive applications, particularly where fairness and bias are of concern. Additionally, the model's performance might degrade on data that is significantly different from the training data, especially in terms of cultural, linguistic, or visual context. ### Recommendations Users should validate the model's performance on their specific use case, particularly looking for biases in the model's predictions. Additional fine-tuning with a carefully curated dataset may be necessary to mitigate biases. It's also recommended to monitor the model's outputs regularly to ensure its predictions remain relevant and unbiased. ## How to Get Started with the Model ```python from transformers import VisionTextDualEncoderModel, VisionTextDualEncoderProcessor # Load the model and processor model = VisionTextDualEncoderModel.from_pretrained("your-username/clip-roberta-finetuned") processor = VisionTextDualEncoderProcessor.from_pretrained("your-username/clip-roberta-finetuned") # Example usage inputs = processor(text=["a photo of a cat"], images=["path_to_image.jpg"], return_tensors="pt", padding=True) outputs = model(**inputs) logits_per_image = outputs.logits_per_image ``` ## Training Details ### Training Data The model was fine-tuned on a custom dataset containing paired image-text data, focusing on diverse skin tones and various diseases. This dataset includes high-resolution images of dermatological conditions along with descriptive captions. The training data was carefully preprocessed to ensure high quality, with images resized, normalized, and tokenized text inputs. ### Training Procedure The fine-tuning process was conducted on a GPU-accelerated environment, leveraging the following setup and hyperparameters: #### Preprocessing - **Images:** Resized to the appropriate dimensions, normalized according to the standard mean and standard deviation used in CLIP models. - **Text:** Tokenized using the `roberta-base` tokenizer. #### Training Hyperparameters - **Training regime:** Mixed precision (`fp16`) training to optimize memory and computation. - **Batch size:** 32 for both training and evaluation. - **Learning rate:** 3e-5, with a cosine learning rate schedule with restarts. - **Weight decay:** 0.01 to prevent overfitting. - **Warmup steps:** 1000 steps to stabilize the learning process. - **Number of epochs:** 1000 to ensure thorough training. - **Gradient accumulation steps:** 4 to effectively simulate a larger batch size. - **Evaluation strategy:** Performed at the end of each epoch. - **Logging strategy:** Metrics logged at the end of each epoch. - **Checkpointing:** The best model was saved based on evaluation loss, with a maximum of 3 checkpoints retained. #### Speeds, Sizes, Times - **Training time:** [More Information Needed] - **Total epochs:** 1000 - **Checkpoint size:** [More Information Needed] ### Compute Resources The fine-tuning process was conducted on a system with the following specifications: - **GPU:** NVIDIA V100 - **Memory Usage:** Monitored with Weights & Biases, optimized through gradient accumulation and mixed precision. - **Training duration:** [More Information Needed] ## Evaluation ### Testing Data, Factors & Metrics #### Testing Data <--- The evaluation was conducted on a held-out test set, which included images and captions not seen during training. The test set was designed to be diverse, representing various skin tones and dermatological conditions. ----> #### Metrics The primary evaluation metric was `eval_loss`. Additionally, accuracy and recall metrics were used to assess the model's ability to correctly match images with their corresponding textual descriptions. ### Results - **Best evaluation loss:** [More Information Needed] - **Top-1 accuracy:** [More Information Needed] - **Recall@k:** [More Information Needed] #### Summary The model showed robust performance on the test set, particularly excelling in scenarios where both the image and text inputs were well-aligned with the training data. However, performance may vary with data significantly different from the training set. ## Environmental Impact Carbon emissions were monitored using Weights & Biases, with the following estimates: - **Hardware Type:** NVIDIA V100 GPU - **Training Hours:** [More Information Needed] - **Cloud Provider:** [More Information Needed] - **Compute Region:** [More Information Needed] - **Carbon Emitted:** [More Information Needed] ## Technical Specifications ### Model Architecture and Objective The model integrates CLIP's vision encoder (`clip-vit-base-patch32`) with RoBERTa's language model (`roberta-base`) in a dual encoder setup. The objective was to fine-tune this model to improve its performance on image-text matching tasks, specifically within the medical domain, focusing on dermatological images. ### Compute Infrastructure #### Hardware - **GPUs:** NVIDIA V100 - **Memory:** 16GB GPU memory, utilized efficiently with mixed precision training. #### Software - **Transformers version:** [More Information Needed] - **PyTorch version:** [More Information Needed] ## Citation **BibTeX:** ```bibtex @inproceedings{parthasarathy2024cliproberta, title={Fine-Tuned Vision-Text Dual Encoder Model for Image-Text Matching}, author={Parthasarathy, Dhruv}, year={2024}, howpublished={\url{https://huggingface.co./your-username/clip-roberta-finetuned}}, } ``` ## Model Card Authors This model card was prepared by Dhruv Parthasarathy. ## Model Card Contact For questions or issues with the model, please contact [parthasarathy.d@northeastern.edu](mailto:parthasarathy.d@northeastern.edu) or [linkedin.com/in/parthadhruv/](https://www.linkedin.com/in/parthadhruv/).