xiangan commited on
Commit
75b40c3
·
verified ·
1 Parent(s): dcd6928

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +37 -0
README.md CHANGED
@@ -1,6 +1,11 @@
1
  ---
2
  license: mit
3
  ---
 
 
 
 
 
4
 
5
  We adopted the official [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT) and the official training dataset [LLaVA-NeXT-Data](https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data) for evaluating the foundational visual models.
6
 
@@ -12,3 +17,35 @@ We adopted the official [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT) and
12
  | DFN5B (ViT-H-14-378px) | × | 64.36 | 70.87 | 38.59 | 473.00 | **48.00** |
13
  | **MLCD (ViT-L-14-336px)** | × | 67.84 | 76.46 | 43.48 | 531.00 | 44.30 |
14
  | **MLCD (ViT-bigG-14-336px)** | √ | **71.07** | **79.63** | **44.38** | **572.00** | 46.78 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
  ---
4
+ ## MLCD-ViT-bigG Model Card
5
+
6
+
7
+ MLCD-ViT-bigG is a state-of-the-art vision transformer model enhanced with 2D Rotary Position Embedding (RoPE2D), achieving superior performance on document understanding and visual question answering tasks. Developed by DeepGlint AI, this model demonstrates exceptional capabilities in processing complex visual-language interactions.
8
+
9
 
10
  We adopted the official [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT) and the official training dataset [LLaVA-NeXT-Data](https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data) for evaluating the foundational visual models.
11
 
 
17
  | DFN5B (ViT-H-14-378px) | × | 64.36 | 70.87 | 38.59 | 473.00 | **48.00** |
18
  | **MLCD (ViT-L-14-336px)** | × | 67.84 | 76.46 | 43.48 | 531.00 | 44.30 |
19
  | **MLCD (ViT-bigG-14-336px)** | √ | **71.07** | **79.63** | **44.38** | **572.00** | 46.78 |
20
+
21
+ ## Installation
22
+
23
+ ```shell
24
+ pip install torch transformers
25
+ git clone https://github.com/deepglint/unicom
26
+ cd unicom/mlcd
27
+ ```
28
+
29
+ ## Usage
30
+
31
+ ```python
32
+ from vit_rope2d_hf import MLCDVisionModel
33
+ from transformers import AutoImageProcessor
34
+ from PIL import Image
35
+ import torch
36
+
37
+ # Load model and processor
38
+ model = MLCDVisionModel.from_pretrained("DeepGlint-AI/mlcd-vit-bigG-patch14-336")
39
+ processor = AutoImageProcessor.from_pretrained("DeepGlint-AI/mlcd-vit-bigG-patch14-336")
40
+
41
+ # Process single image
42
+ image = Image.open("document.jpg").convert("RGB")
43
+ inputs = processor(images=image, return_tensors="pt")
44
+
45
+ # Get visual features
46
+ with torch.no_grad():
47
+ outputs = model(**inputs)
48
+ features = outputs.last_hidden_state
49
+
50
+ print(f"Extracted features shape: {features.shape}")
51
+ ```