roman-bachmann
commited on
Commit
•
ef14d7a
1
Parent(s):
a24d8a4
Init
Browse files- LICENSE +10 -0
- README.md +64 -0
- config.json +28 -0
- model.safetensors +3 -0
LICENSE
ADDED
@@ -0,0 +1,10 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
Sample Code License
|
2 |
+
Version: 1.1
|
3 |
+
|
4 |
+
IMPORTANT: This software is supplied to you by École Polytechnique Fédérale de Lausanne (“EPFL”) and Apple Inc. ("Apple") in consideration of your agreement to the following terms, and your use, installation, modification or redistribution of this software constitutes acceptance of these terms. If you do not agree with these terms, please do not use, install, modify or redistribute this software.
|
5 |
+
|
6 |
+
In consideration of your agreement to abide by the following terms, and subject to these terms, EPFL and Apple (collectively, “Licensor”) grant you a personal, non-exclusive license, under Licensor’s copyrights in this original software (the " Software"), to use, reproduce, modify and redistribute the Software, with or without modifications, in source and/or binary forms for non-commercial use; provided that if you redistribute the Software in its entirety and without modifications, you must retain this notice and the following text and disclaimers in all such redistributions of the Software. Neither the name, trademarks, service marks or logos of Licensor may be used to endorse or promote products derived from the Software without specific prior written permission from Licensor. Except as expressly stated in this notice, no other rights or licenses, express or implied, are granted by Licensor herein, including but not limited to any patent rights that may be infringed by your derivative works or by other works in which the Software may be incorporated.
|
7 |
+
|
8 |
+
The Software is provided by Licensor on an "AS IS" basis. LICENSOR MAKES NO WARRANTIES, EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE, REGARDING THE SOFTWARE OR ITS USE AND OPERATION ALONE OR IN COMBINATION WITH YOUR PRODUCTS. IN NO EVENT SHALL LICENSOR BE LIABLE FOR ANY SPECIAL, INDIRECT, INCIDENTAL OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) ARISING IN ANY WAY OUT OF THE USE, REPRODUCTION, MODIFICATION AND/OR DISTRIBUTION OF THE SOFTWARE, HOWEVER CAUSED AND WHETHER UNDER THEORY OF CONTRACT, TORT (INCLUDING NEGLIGENCE), STRICT LIABILITY OR OTHERWISE, EVEN IF LICENSOR HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
9 |
+
|
10 |
+
Copyright (C) 2024. All Rights Reserved.
|
README.md
ADDED
@@ -0,0 +1,64 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: other
|
3 |
+
license_name: sample-code-license
|
4 |
+
license_link: LICENSE
|
5 |
+
library_name: ml-4m
|
6 |
+
---
|
7 |
+
|
8 |
+
# 4M: Massively Multimodal Masked Modeling
|
9 |
+
|
10 |
+
*A framework for training any-to-any multimodal foundation models. <br>Scalable. Open-sourced. Across tens of modalities and tasks.*
|
11 |
+
|
12 |
+
[`Website`](https://4m.epfl.ch) | [`GitHub`](https://github.com/apple/ml-4m) | [`BibTeX`](#citation)
|
13 |
+
|
14 |
+
Official implementation and pre-trained models for :
|
15 |
+
|
16 |
+
[**4M: Massively Multimodal Masked Modeling**](https://arxiv.org/abs/2312.06647), NeurIPS 2023 (Spotlight) <br>
|
17 |
+
*[David Mizrahi](https://dmizrahi.com/)\*, [Roman Bachmann](https://roman-bachmann.github.io/)\*, [Oğuzhan Fatih Kar](https://ofkar.github.io/), [Teresa Yeo](https://aserety.github.io/), [Mingfei Gao](https://fly6464.github.io/), [Afshin Dehghan](https://www.afshindehghan.com/), [Amir Zamir](https://vilab.epfl.ch/zamir/)*
|
18 |
+
|
19 |
+
**4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities**, arXiv 2024 <br>
|
20 |
+
*[Roman Bachmann](https://roman-bachmann.github.io/)\*, [Oğuzhan Fatih Kar](https://ofkar.github.io/)\*, [David Mizrahi](https://dmizrahi.com/)\*, [Ali Garjani](https://garjania.github.io/), [Mingfei Gao](https://fly6464.github.io/), [David Griffiths](https://www.dgriffiths.uk/), [Jiaming Hu](https://scholar.google.com/citations?user=vm3imKsAAAAJ&hl=en), [Afshin Dehghan](https://www.afshindehghan.com/), [Amir Zamir](https://vilab.epfl.ch/zamir/)*
|
21 |
+
|
22 |
+
4M is a framework for training "any-to-any" foundation models, using tokenization and masking to scale to many diverse modalities.
|
23 |
+
Models trained using 4M can perform a wide range of vision tasks, transfer well to unseen tasks and modalities, and are flexible and steerable multimodal generative models.
|
24 |
+
We are releasing code and models for "4M: Massively Multimodal Masked Modeling" (here denoted 4M-7), as well as "4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities" (here denoted 4M-21).
|
25 |
+
|
26 |
+
|
27 |
+
## Installation
|
28 |
+
For install instructions, please see https://github.com/apple/ml-4m.
|
29 |
+
|
30 |
+
|
31 |
+
## Usage
|
32 |
+
|
33 |
+
The Canny and SAM edges tokenizer can be loaded from Hugging Face Hub as follows:
|
34 |
+
```python
|
35 |
+
from fourm.vq.vqvae import DiVAE
|
36 |
+
tok_edge = DiVAE.from_pretrained('EPFL-VILAB/4M_tokenizers_edge_8k_224-512')
|
37 |
+
```
|
38 |
+
|
39 |
+
Please see https://github.com/apple/ml-4m/blob/main/README_TOKENIZATION.md for more detailed instructions and https://github.com/apple/ml-4m for other tokenizer and 4M model checkpoints.
|
40 |
+
|
41 |
+
Safetensors checkpoints are hosted under https://huggingface.co/EPFL-VILAB/4M.
|
42 |
+
|
43 |
+
## Citation
|
44 |
+
|
45 |
+
If you find this repository helpful, please consider citing our work:
|
46 |
+
```
|
47 |
+
@inproceedings{4m,
|
48 |
+
title={{4M}: Massively Multimodal Masked Modeling},
|
49 |
+
author={David Mizrahi and Roman Bachmann and O{\u{g}}uzhan Fatih Kar and Teresa Yeo and Mingfei Gao and Afshin Dehghan and Amir Zamir},
|
50 |
+
booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
|
51 |
+
year={2023},
|
52 |
+
}
|
53 |
+
|
54 |
+
@article{4m21,
|
55 |
+
title={{4M-21}: An Any-to-Any Vision Model for Tens of Tasks and Modalities},
|
56 |
+
author={Roman Bachmann and O{\u{g}}uzhan Fatih Kar and David Mizrahi and Ali Garjani and Mingfei Gao and David Griffiths and Jiaming Hu and Afshin Dehghan and Amir Zamir},
|
57 |
+
journal={arXiv 2024},
|
58 |
+
year={2024},
|
59 |
+
}
|
60 |
+
```
|
61 |
+
|
62 |
+
## License
|
63 |
+
|
64 |
+
The model weights in this repository are released under the Sample Code license as found in the [LICENSE](LICENSE) file.
|
config.json
ADDED
@@ -0,0 +1,28 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"beta_schedule": "shifted_cosine:0.5",
|
3 |
+
"clip_sample": false,
|
4 |
+
"codebook_size": 8192,
|
5 |
+
"conditioning": "concat",
|
6 |
+
"dec_type": "uvit_b_p4_f16",
|
7 |
+
"enc_type": "vit_b_enc",
|
8 |
+
"image_size": 512,
|
9 |
+
"image_size_dec": null,
|
10 |
+
"image_size_enc": 256,
|
11 |
+
"latent_dim": 32,
|
12 |
+
"n_channels": 1,
|
13 |
+
"n_labels": null,
|
14 |
+
"norm_codes": true,
|
15 |
+
"norm_latents": false,
|
16 |
+
"num_codebooks": 1,
|
17 |
+
"num_train_timesteps": 1000,
|
18 |
+
"patch_proj": true,
|
19 |
+
"patch_size": 16,
|
20 |
+
"post_mlp": true,
|
21 |
+
"prediction_type": "sample",
|
22 |
+
"quant_type": "lucid",
|
23 |
+
"scheduler": "ddim",
|
24 |
+
"sync_codebook": false,
|
25 |
+
"thresholding": true,
|
26 |
+
"undo_std": false,
|
27 |
+
"zero_terminal_snr": true
|
28 |
+
}
|
model.safetensors
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:b02bacf7163723650d72c6167155d05a14726c0c4a064f11b79fc4a35e5ccbe2
|
3 |
+
size 872040584
|