Upload 5 files

Browse files

Files changed (5) hide show

README.md +177 -0
checkpoint.pth +3 -0
customized_mini_gpt4.py +149 -0
rinna.png +0 -0
sample.jpg +0 -0

README.md CHANGED Viewed

@@ -1,3 +1,180 @@
 ---
 license: mit
 ---

 ---
 license: mit
+datasets:
+- conceptual_12m
+- HuggingFaceM4/COCO
+- visual_genome
+language:
+- ja
+- en
 ---
+# bilingual-gpt-neox-4b-minigpt4
+![rinna-icon](./rinna.png)
+# Overview
+This repository provides an English-Japanese bilingual multimodal conversational model like MiniGPT-4 by combining GPT-NeoX model of 3.8 billion parameters and BLIP-2.
+The model is based on [`rinna/bilingual-gpt-neox-4b`](https://huggingface.co/rinna/bilingual-gpt-neox-4b) and [BLIP-2](https://huggingface.co/docs/transformers/main/model_doc/blip-2).
+* **Model architecture**
+    Similar with [BLIP-2](https://huggingface.co/docs/transformers/main/model_doc/blip-2) and [Vision-CAIR/MiniGPT-4](https://huggingface.co/Vision-CAIR/MiniGPT-4), the model consists of an LLM, vision-encoder with ViT and Q-Former, and linear-layer for connecting the LLM and vision-encoder.
+    [`rinna/bilingual-gpt-neox-4b`](https://huggingface.co/rinna/bilingual-gpt-neox-4b) (A 36-layer, 2816-hidden-size transformer-based language model) is used as the LLM instead of [Vicuna](https://github.com/lm-sys/FastChat), which is used in the original [Vision-CAIR/MiniGPT-4](https://huggingface.co/Vision-CAIR/MiniGPT-4).
+* **Finetuning**
+    The finetuning data is the subset of the following datasets.
+    * English datasets
+      * [Conceptual 12M (CC12M)](https://huggingface.co/datasets/conceptual_12m)
+      * [COCO 2014](https://huggingface.co/datasets/HuggingFaceM4/COCO)
+      * [Visual Genome](https://huggingface.co/datasets/visual_genome)
+    * Japanese datasets
+      * [STAIR-captions](https://github.com/STAIR-Lab-CIT/STAIR-captions)
+      * [Japanese Visual Genome VQA dataset](https://github.com/yahoojapan/ja-vg-vqa)
+    Based on the implementation of [Vision-CAIR/MiniGPT-4](https://huggingface.co/Vision-CAIR/MiniGPT-4), only "first pretraining stage" described in [MiniGPT-4 paper](https://arxiv.org/abs/2304.10592) with the above datasets was conducted, and "second-stage finetuning" proposed in the paper with an aligned image-text dataset created with ChatGPT was NOT conducted.
+* **Model Series**
+    | Variant | Link |
+    | :-- | :--|
+    | Bilingual 4B MiniGPT4 | https://huggingface.co/rinna/bilingual-gpt-neox-4b-minigpt4 |
+    | Bilingual 4B SFT | https://huggingface.co/rinna/bilingual-gpt-neox-4b-instruction-sft |
+    | Bilingual 4B 8K | https://huggingface.co/rinna/bilingual-gpt-neox-4b-8k |
+    | Bilingual 4B | https://huggingface.co/rinna/bilingual-gpt-neox-4b |
+    | Japanese 3.6B PPO | https://huggingface.co/rinna/japanese-gpt-neox-3.6b-instruction-ppo |
+    | Japanese 3.6B SFT-v2 | https://huggingface.co/rinna/japanese-gpt-neox-3.6b-instruction-sft-v2 |
+    | Japanese 3.6B SFT | https://huggingface.co/rinna/japanese-gpt-neox-3.6b-instruction-sft |
+    | Japanese 3.6B | https://huggingface.co/rinna/japanese-gpt-neox-3.6b |
+* **Authors**
+    [Koh Mitsuda](https://huggingface.co/mitsu-koh), [Tianyu Zhao](https://huggingface.co/tianyuz), and [Kei Sawada](https://huggingface.co/keisawada)
+---
+# I/O Format
+A special format has been adopted to construct inputs.
+* An input prompt is formatted as a conversation between `ユーザー` and `システム`.
+* Each input utterance consists of (1) its speaker (`"ユーザー"` or `"システム"`), (2) a colon (`":"`), (3) a whitespace (`" "`), and (4) utterance text (e.g. `"猫はどんな体勢をしていますか？"`).
+* An utterance including an image is formatted as (1) its speaker (`"ユーザー"`), (2) a colon (`":"`), (3) a whitespace (`" "`), (4) a placeholder of the image (`"<Img><ImageHere></Img>"`), (5) another whitespace (`" "`), (6) utterance text (e.g. `"What can you see?"`).
+  * The placeholder (`<ImageHere>`) is automatically replaced with the embedding of an input image in the function `get_context_emb`.
+* The input prompt should be ended with `"システム: "` to acknowledge the model to generate a response.
+* All the utterances in the input prompt should be separated by a newline `\n`.
+Following is an example to construct input from a conversation.
+~~~python
+prompt = [
+    {
+        "speaker": "ユーザー",
+        "text": "<Img><ImageHere></Img> What can you see?"
+    },
+    {
+        "speaker": "システム",
+        "text": "a cat on a table with a laptop"
+    },
+    {
+        "speaker": "ユーザー",
+        "text": "猫はどんな体勢をしていますか？"
+    },
+]
+prompt = [
+    f"{uttr['speaker']}: {uttr['text']}"
+    for uttr in prompt
+]
+prompt = "\n".join(prompt)
+prompt = (
+    prompt
+    + "\n"
+    + "システム: "
+)
+print(prompt)
+"""
+ユーザー: <Img><ImageHere></Img> What can you see?
+システム: a cat on a table with a laptop
+ユーザー: 猫はどんな体勢をしていますか？
+システム:
+"""
+~~~
+---
+# How to use the model
+**1. Download dependencies**
+* BLIP-2 implementation included in MiniGPT-4 is used for inference.
+* `customized_mini_gpt4.py` is a script to replace LLM from LLaMA architecture to GPT-NeoX one.
+* `checkpoint.pth` is a finetuned weight of the linear layer (file size: 177 MB).
+```bash
+git clone https://github.com/Vision-CAIR/MiniGPT-4.git
+cd ./MiniGPT-4
+git checkout 22d8888 # latest version as of July 31, 2023.
+wget https://huggingface.co/rinna/bilingual-gpt-neox-4b-minigpt4/resolve/main/customized_mini_gpt4.py
+wget https://huggingface.co/rinna/bilingual-gpt-neox-4b-minigpt4/resolve/main/checkpoint.pth
+```
+**2. Inference**
+Please run this script in `MiniGPT-4` directory.
+~~~~python
+import torch
+import requests
+from PIL import Image
+from minigpt4.processors.blip_processors import Blip2ImageEvalProcessor
+from customized_mini_gpt4 import CustomizedMiniGPT4
+ckpt_path = "./checkpoint.pth"
+model = CustomizedMiniGPT4(gpt_neox_model="rinna/bilingual-gpt-neox-4b")
+tokenizer = model.gpt_neox_tokenizer
+if torch.cuda.is_available():
+    model = model.to("cuda")
+if ckpt_path is not None:
+    print("Load BLIP2-LLM Checkpoint: {}".format(ckpt_path))
+    ckpt = torch.load(ckpt_path, map_location="cpu")
+    model.load_state_dict(ckpt['model'], strict=False)
+vis_processor = Blip2ImageEvalProcessor()
+image_url = "https://huggingface.co/rinna/bilingual-gpt-neox-4b-minigpt4-preview/resolve/main/sample.jpg"
+raw_image = Image.open(requests.get(image_url, stream=True).raw).convert('RGB')
+image = vis_processor(raw_image).unsqueeze(0).to(model.device)
+image_emb = model.encode_img(image)
+embs = model.get_context_emb(prompt, [image_emb])
+output_ids = model.gpt_neox_model.generate(
+    inputs_embeds=embs,
+    max_new_tokens=512,
+    do_sample=True,
+    temperature=1.0,
+    top_p=0.85,
+    pad_token_id=tokenizer.pad_token_id,
+    bos_token_id=tokenizer.bos_token_id,
+    eos_token_id=tokenizer.eos_token_id
+)
+output = tokenizer.decode(output_ids.tolist()[0], skip_special_tokens=True)
+print(output)
+"""横になっています。"""
+~~~~
+---
+# Acknowledgement
+* [Vision-CAIR/MiniGPT-4](https://huggingface.co/Vision-CAIR/MiniGPT-4)
+  * [BLIP-2](https://huggingface.co/docs/transformers/main/model_doc/blip-2)
+  * [Lavis](https://github.com/salesforce/LAVIS)
+# Licenese
+[The MIT license](https://opensource.org/licenses/MIT)

checkpoint.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:170633d5fd203d8b5b4d6d5ca3e3ce5bc8bb6cf66671ee96c0a6f4a1e38197e6
+size 177115114

customized_mini_gpt4.py ADDED Viewed

	@@ -0,0 +1,149 @@

+import torch
+import torch.nn as nn
+from minigpt4.models.mini_gpt4 import MiniGPT4
+from minigpt4.models.blip2 import Blip2Base, disabled_train
+from transformers.models.gpt_neox import GPTNeoXForCausalLM
+from transformers import AutoTokenizer
+class CustomizedGPTNeoXForCausalLM(GPTNeoXForCausalLM):
+    def prepare_inputs_for_generation(self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs):
+        input_shape = input_ids.shape
+        # cut decoder_input_ids if past is used
+        if past_key_values and past_key_values[0] is not None:
+            input_ids = input_ids[:, -1:]
+        position_ids = kwargs.get("position_ids", None)
+        if attention_mask is not None and position_ids is None:
+            # create position_ids on the fly for batch generation
+            position_ids = attention_mask.long().cumsum(-1) - 1
+            position_ids.masked_fill_(attention_mask == 0, 1)
+            if past_key_values:
+                position_ids = position_ids[:, -1].unsqueeze(-1)
+        # if model is used as a decoder in encoder-decoder model, the decoder attention mask is created on the fly
+        if attention_mask is None:
+            attention_mask = input_ids.new_ones(input_shape)
+        # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
+        if inputs_embeds is not None and past_key_values is None:
+            model_inputs = {"inputs_embeds": inputs_embeds}
+        else:
+            model_inputs = {"input_ids": input_ids}
+        model_inputs.update(
+            {
+                "attention_mask": attention_mask,
+                "position_ids": position_ids,
+                "past_key_values": past_key_values,
+            }
+        )
+        return model_inputs
+class CustomizedMiniGPT4(Blip2Base):
+    """
+    BLIP2 GPT-NeoX model.
+    """
+    def __init__(
+        self,
+        gpt_neox_model="rinna/bilingual-gpt-neox-4b",
+        vit_model="eva_clip_g",
+        q_former_model="https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/blip2_pretrained_flant5xxl.pth",
+        img_size=224,
+        drop_path_rate=0,
+        use_grad_checkpoint=False,
+        vit_precision="fp16",
+        freeze_vit=True,
+        freeze_qformer=True,
+        num_query_token=32,
+        low_resource=False,  # use 8 bit and put vit in cpu
+        device_8bit=0,  # the device of 8bit model should be set when loading and cannot be changed anymore.
+    ):
+        super().__init__()
+        self.tokenizer = self.init_tokenizer()
+        self.low_resource = low_resource
+        print('Loading VIT', flush=True)
+        self.visual_encoder, self.ln_vision = self.init_vision_encoder(
+            vit_model, img_size, drop_path_rate, use_grad_checkpoint, vit_precision
+        )
+        if freeze_vit:
+            for name, param in self.visual_encoder.named_parameters():
+                param.requires_grad = False
+            self.visual_encoder = self.visual_encoder.eval()
+            self.visual_encoder.train = disabled_train
+            for name, param in self.ln_vision.named_parameters():
+                param.requires_grad = False
+            self.ln_vision = self.ln_vision.eval()
+            self.ln_vision.train = disabled_train
+            print("freeze vision encoder")
+        print('Loading VIT Done')
+        print('Loading Q-Former', flush=True)
+        self.Qformer, self.query_tokens = self.init_Qformer(
+            num_query_token, self.visual_encoder.num_features
+        )
+        self.Qformer.cls = None
+        self.Qformer.bert.embeddings.word_embeddings = None
+        self.Qformer.bert.embeddings.position_embeddings = None
+        for layer in self.Qformer.bert.encoder.layer:
+            layer.output = None
+            layer.intermediate = None
+        self.load_from_pretrained(url_or_filename=q_former_model)
+        if freeze_qformer:
+            for name, param in self.Qformer.named_parameters():
+                param.requires_grad = False
+            self.Qformer = self.Qformer.eval()
+            self.Qformer.train = disabled_train
+            self.query_tokens.requires_grad = False
+            print("freeze Qformer")
+        print('Loading Q-Former Done')
+        print('Loading LLM', flush=True)
+        self.gpt_neox_tokenizer = AutoTokenizer.from_pretrained(gpt_neox_model, use_fast=False)
+        if self.low_resource:
+            self.gpt_neox_model = CustomizedGPTNeoXForCausalLM.from_pretrained(
+                gpt_neox_model,
+                torch_dtype=torch.float16,
+                load_in_8bit=True,
+                device_map={'': device_8bit}
+            )
+        else:
+            self.gpt_neox_model = CustomizedGPTNeoXForCausalLM.from_pretrained(
+                gpt_neox_model,
+                torch_dtype=torch.float16,
+            )
+        for name, param in self.gpt_neox_model.named_parameters():
+            param.requires_grad = False
+        print('Loading LLM Done')
+        self.llama_proj = nn.Linear(
+            self.Qformer.config.hidden_size, self.gpt_neox_model.config.hidden_size
+        )
+    def vit_to_cpu(self):
+        MiniGPT4.vit_to_cpu(self)
+    def encode_img(self, image):
+        inputs_gpt_neox, _ = MiniGPT4.encode_img(self, image)
+        return inputs_gpt_neox
+    def get_context_emb(self, prompt, img_list):
+        prompt_segs = prompt.split('<ImageHere>')
+        assert len(prompt_segs) == len(img_list) + 1, "Unmatched numbers of image placeholders and images."
+        seg_tokens = [
+            self.gpt_neox_tokenizer(seg, return_tensors="pt", add_special_tokens=False).to(self.device).input_ids
+            for i, seg in enumerate(prompt_segs)
+        ]
+        seg_embs = [self.gpt_neox_model.gpt_neox.embed_in(seg_t) for seg_t in seg_tokens]
+        mixed_embs = [emb for pair in zip(seg_embs[:-1], img_list) for emb in pair] + [seg_embs[-1]]
+        mixed_embs = torch.cat(mixed_embs, dim=1)
+        return mixed_embs

rinna.png ADDED Viewed

sample.jpg ADDED Viewed