penfever commited on
Commit
4d7dacd
·
verified ·
1 Parent(s): 3df980c

Add files using upload-large-folder tool

Browse files
README.md ADDED
@@ -0,0 +1,169 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ tags:
6
+ - zero-shot-image-classification
7
+ - OpenCLIP
8
+ - clip
9
+ - biology
10
+ - biodiversity
11
+ - agronomy
12
+ - CV
13
+ - images
14
+ - animals
15
+ - species
16
+ - taxonomy
17
+ - rare species
18
+ - endangered species
19
+ - evolutionary biology
20
+ - multimodal
21
+ - knowledge-guided
22
+ datasets:
23
+ - ChihHsuan-Yang/Arboretum
24
+ - EOL
25
+ base_model:
26
+ - openai/clip-vit-base-patch16
27
+ - openai/clip-vit-large-patch14
28
+ pipeline_tag: zero-shot-image-classification
29
+ ---
30
+
31
+
32
+ # Model Card for ArborCLIP
33
+
34
+ <!-- Banner links -->
35
+ <div style="text-align:center;">
36
+ <a href="https://baskargroup.github.io/Arboretum/" target="_blank">
37
+ <img src="https://img.shields.io/badge/Project%20Page-Visit-blue" alt="Project Page" style="margin-right:10px;">
38
+ </a>
39
+ <a href="https://github.com/baskargroup/Arboretum" target="_blank">
40
+ <img src="https://img.shields.io/badge/GitHub-Visit-lightgrey" alt="GitHub" style="margin-right:10px;">
41
+ </a>
42
+ <a href="https://pypi.org/project/arbor-process/" target="_blank">
43
+ <img src="https://img.shields.io/badge/PyPI-arbor--process%200.1.0-orange" alt="PyPI arbor-process 0.1.0">
44
+ </a>
45
+ </div>
46
+
47
+
48
+ ARBORCLIP is a new suite of vision-language foundation models for biodiversity. These CLIP-style foundation models were trained on [ARBORETUM-40M](https://baskargroup.github.io/Arboretum/), which is a large-scale dataset of 40 million images of 33K species of plants and animals. The models are evaluated on zero-shot image classification tasks.
49
+
50
+ - **Model type:** Vision Transformer (ViT-B/16, ViT-L/14)
51
+ - **License:** MIT
52
+ - **Fine-tuned from model:** [OpenAI CLIP](https://github.com/mlfoundations/open_clip), [MetaCLIP](https://github.com/facebookresearch/MetaCLIP), [BioCLIP](https://github.com/Imageomics/BioCLIP)
53
+
54
+ These models were developed for the benefit of the AI community as an open-source product. Thus, we request that any derivative products are also open-source.
55
+
56
+
57
+ ### Model Description
58
+
59
+ ArborCLIP is based on OpenAI's [CLIP](https://openai.com/research/clip) model.
60
+ The models were trained on [ARBORETUM-40M](https://baskargroup.github.io/Arboretum/) for the following configurations:
61
+
62
+ - **ARBORCLIP-O:** Trained a ViT-B/16 backbone initialized from the [OpenCLIP's](https://github.com/mlfoundations/open_clip) checkpoint. The training was conducted for 40 epochs.
63
+ - **ARBORCLIP-B:** Trained a ViT-B/16 backbone initialized from the [BioCLIP's](https://github.com/Imageomics/BioCLIP) checkpoint. The training was conducted for 8 epochs.
64
+ - **ARBORCLIP-M:** Trained a ViT-L/14 backbone initialized from the [MetaCLIP's](https://github.com/facebookresearch/MetaCLIP) checkpoint. The training was conducted for 12 epochs.
65
+
66
+
67
+ To access the checkpoints of the above models, go to the `Files and versions` tab and download the weights. These weights can be directly used for zero-shot classification and finetuning. The filenames correspond to the specific model weights -
68
+ - **ARBORCLIP-O:** - `arborclip-vit-b-16-from-openai-epoch-40.pt`,
69
+ - **ARBORCLIP-B:** - `arborclip-vit-b-16-from-bioclip-epoch-8.pt`
70
+ - **ARBORCLIP-M** - `arborclip-vit-l-14-from-metaclip-epoch-12.pt`
71
+
72
+ ### Model Training
73
+ **See the [Model Training](https://github.com/baskargroup/Arboretum?tab=readme-ov-file#model-training) section on the [Github](https://github.com/baskargroup/Arboretum) for examples of how to use ArborCLIP models in zero-shot image classification tasks.**
74
+
75
+ We train three models using a modified version of the [BioCLIP / OpenCLIP](https://github.com/Imageomics/bioclip/tree/main/src/training) codebase. Each model is trained on Arboretum-40M, on 2 nodes, 8xH100 GPUs, on NYU's [Greene](https://sites.google.com/nyu.edu/nyu-hpc/hpc-systems/greene) high-performance compute cluster. We publicly release all code needed to reproduce our results on the [Github](https://github.com/baskargroup/Arboretum) page.
76
+
77
+ We optimize our hyperparameters prior to training with [Ray](https://docs.ray.io/en/latest/index.html). Our standard training parameters are as follows:
78
+
79
+ ```
80
+ --dataset-type webdataset
81
+ --pretrained openai
82
+ --text_type random
83
+ --dataset-resampled
84
+ --warmup 5000
85
+ --batch-size 4096
86
+ --accum-freq 1
87
+ --epochs 40
88
+ --workers 8
89
+ --model ViT-B-16
90
+ --lr 0.0005
91
+ --wd 0.0004
92
+ --precision bf16
93
+ --beta1 0.98
94
+ --beta2 0.99
95
+ --eps 1.0e-6
96
+ --local-loss
97
+ --gather-with-grad
98
+ --ddp-static-graph
99
+ --grad-checkpointing
100
+ ```
101
+
102
+ For more extensive documentation of the training process and the significance of each hyperparameter, we recommend referencing the [OpenCLIP](https://github.com/mlfoundations/open_clip) and [BioCLIP](https://github.com/Imageomics/BioCLIP) documentation, respectively.
103
+
104
+ ### Model Validation
105
+
106
+ For validating the zero-shot accuracy of our trained models and comparing to other benchmarks, we use the [VLHub](https://github.com/penfever/vlhub) repository with some slight modifications.
107
+
108
+ #### Pre-Run
109
+
110
+ After cloning the [Github](https://github.com/baskargroup/Arboretum) repository and navigating to the `Arboretum/model_validation` directory, we recommend installing all the project requirements into a conda container; `pip install -r requirements.txt`. Also, before executing a command in VLHub, please add `Arboretum/model_validation/src` to your PYTHONPATH.
111
+
112
+ ```bash
113
+ export PYTHONPATH="$PYTHONPATH:$PWD/src";
114
+ ```
115
+
116
+ #### Base Command
117
+
118
+ A basic Arboretum model evaluation command can be launched as follows. This example would evaluate a CLIP-ResNet50 checkpoint whose weights resided at the path designated via the `--resume` flag on the ImageNet validation set, and would report the results to Weights and Biases.
119
+
120
+ ```bash
121
+ python src/training/main.py --batch-size=32 --workers=8 --imagenet-val "/imagenet/val/" --model="resnet50" --zeroshot-frequency=1 --image-size=224 --resume "/PATH/TO/WEIGHTS.pth" --report-to wandb
122
+ ```
123
+
124
+ ### Training Dataset
125
+ - **Dataset Repository:** [Arboretum](https://github.com/baskargroup/Arboretum)
126
+ - **Dataset Paper:** Arboretum: A Large Multimodal Dataset Enabling AI for Biodiversity ([arXiv](https://arxiv.org/abs/2406.17720))
127
+ - **HF Dataset card:** [Arboretum](https://huggingface.co/datasets/ChihHsuan-Yang/Arboretum)
128
+
129
+
130
+ ### Model's Limitation
131
+ All the `ArborCLIP` models were evaluated on the challenging [CONFOUNDING-SPECIES](https://arxiv.org/abs/2306.02507) benchmark. However, all the models performed at or below random chance. This could be an interesting avenue for follow-up work and further expand the models capabilities.
132
+
133
+ In general, we found that models trained on web-scraped data performed better with common
134
+ names, whereas models trained on specialist datasets performed better when using scientific names.
135
+ Additionally, models trained on web-scraped data excel at classifying at the highest taxonomic
136
+ level (kingdom), while models begin to benefit from specialist datasets like [ARBORETUM-40M](https://baskargroup.github.io/Arboretum/) and
137
+ [Tree-of-Life-10M](https://huggingface.co/datasets/imageomics/TreeOfLife-10M) at the lower taxonomic levels (order and species). From a practical standpoint, `ArborCLIP` is highly accurate at the species level, and higher-level taxa can be deterministically derived from lower ones.
138
+
139
+ Addressing these limitations will further enhance the applicability of models like `ArborCLIP` in real-world biodiversity monitoring tasks.
140
+
141
+ ### Acknowledgements
142
+ This work was supported by the AI Research Institutes program supported by the NSF and USDA-NIFA under [AI Institute: for Resilient Agriculture](https://aiira.iastate.edu/), Award No. 2021-67021-35329. This was also
143
+ partly supported by the NSF under CPS Frontier grant CNS-1954556. Also, we gratefully
144
+ acknowledge the support of NYU IT [High Performance Computing](https://www.nyu.edu/life/information-technology/research-computing-services/high-performance-computing.html) resources, services, and staff
145
+ expertise.
146
+
147
+ <!--BibTex citation -->
148
+ <section class="section" id="BibTeX">
149
+ <div class="container is-max-widescreen content">
150
+ <h2 class="title">Citation</h2>
151
+ If you find the models and datasets useful in your research, please consider citing our paper:
152
+ <pre><code>@misc{yang2024arboretumlargemultimodaldataset,
153
+ title={Arboretum: A Large Multimodal Dataset Enabling AI for Biodiversity},
154
+ author={Chih-Hsuan Yang, Benjamin Feuer, Zaki Jubery, Zi K. Deng, Andre Nakkab,
155
+ Md Zahid Hasan, Shivani Chiranjeevi, Kelly Marshall, Nirmal Baishnab, Asheesh K Singh,
156
+ Arti Singh, Soumik Sarkar, Nirav Merchant, Chinmay Hegde, Baskar Ganapathysubramanian},
157
+ year={2024},
158
+ eprint={2406.17720},
159
+ archivePrefix={arXiv},
160
+ primaryClass={cs.CV},
161
+ url={https://arxiv.org/abs/2406.17720},
162
+ }</code></pre>
163
+ </div>
164
+ </section>
165
+ <!--End BibTex citation -->
166
+
167
+ ---
168
+
169
+ For more details and access to the Arboretum dataset, please visit the [Project Page](https://baskargroup.github.io/Arboretum/).
open_clip_config.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_cfg": {
3
+ "embed_dim": 512,
4
+ "vision_cfg": {
5
+ "image_size": 224,
6
+ "layers": 12,
7
+ "width": 768,
8
+ "patch_size": 16
9
+ },
10
+ "text_cfg": {
11
+ "context_length": 77,
12
+ "vocab_size": 49408,
13
+ "width": 512,
14
+ "heads": 8,
15
+ "layers": 12
16
+ }
17
+ },
18
+ "preprocess_cfg": {
19
+ "mean": [
20
+ 0.48145466,
21
+ 0.4578275,
22
+ 0.40821073
23
+ ],
24
+ "std": [
25
+ 0.26862954,
26
+ 0.26130258,
27
+ 0.27577711
28
+ ]
29
+ }
30
+ }
open_clip_pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b38aaaba419b1d3a6e507ef61181e3b786c9678eaf1c87b52b34cdb48c6b9b87
3
+ size 1051423822
special_tokens_map.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<|startoftext|>",
4
+ "lstrip": false,
5
+ "normalized": true,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "<|endoftext|>",
11
+ "lstrip": false,
12
+ "normalized": true,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": "<|endoftext|>",
17
+ "unk_token": {
18
+ "content": "<|endoftext|>",
19
+ "lstrip": false,
20
+ "normalized": true,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ }
24
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "bos_token": {
4
+ "__type": "AddedToken",
5
+ "content": "<|startoftext|>",
6
+ "lstrip": false,
7
+ "normalized": true,
8
+ "rstrip": false,
9
+ "single_word": false
10
+ },
11
+ "do_lower_case": true,
12
+ "eos_token": {
13
+ "__type": "AddedToken",
14
+ "content": "<|endoftext|>",
15
+ "lstrip": false,
16
+ "normalized": true,
17
+ "rstrip": false,
18
+ "single_word": false
19
+ },
20
+ "errors": "replace",
21
+ "model_max_length": 77,
22
+ "pad_token": "<|endoftext|>",
23
+ "special_tokens_map_file": "./special_tokens_map.json",
24
+ "tokenizer_class": "CLIPTokenizer",
25
+ "unk_token": {
26
+ "__type": "AddedToken",
27
+ "content": "<|endoftext|>",
28
+ "lstrip": false,
29
+ "normalized": true,
30
+ "rstrip": false,
31
+ "single_word": false
32
+ }
33
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff