--- license: mit datasets: - conceptual_12m - HuggingFaceM4/COCO - visual_genome language: - ja - en pipeline_tag: image-text-to-text --- # bilingual-gpt-neox-4b-minigpt4 ![rinna-icon](./rinna.png) # Overview This repository provides an English-Japanese bilingual multimodal conversational model like MiniGPT-4 by combining GPT-NeoX model of 3.8 billion parameters and BLIP-2. The model is based on [`rinna/bilingual-gpt-neox-4b`](https://huggingface.co./rinna/bilingual-gpt-neox-4b) and [BLIP-2](https://huggingface.co./docs/transformers/main/model_doc/blip-2). * **Model architecture** Similar with [BLIP-2](https://huggingface.co./docs/transformers/main/model_doc/blip-2) and [Vision-CAIR/MiniGPT-4](https://huggingface.co./Vision-CAIR/MiniGPT-4), the model consists of an LLM, vision-encoder with ViT and Q-Former, and linear-layer for connecting the LLM and vision-encoder. [`rinna/bilingual-gpt-neox-4b`](https://huggingface.co./rinna/bilingual-gpt-neox-4b) (A 36-layer, 2816-hidden-size transformer-based language model) is used as the LLM instead of [Vicuna](https://github.com/lm-sys/FastChat), which is used in the original [Vision-CAIR/MiniGPT-4](https://huggingface.co./Vision-CAIR/MiniGPT-4). * **Finetuning** The finetuning data is the subset of the following datasets. * English datasets * [Conceptual 12M (CC12M)](https://huggingface.co./datasets/conceptual_12m) * [COCO 2014](https://huggingface.co./datasets/HuggingFaceM4/COCO) * [Visual Genome](https://huggingface.co./datasets/visual_genome) * Japanese datasets * [STAIR-captions](https://github.com/STAIR-Lab-CIT/STAIR-captions) * [Japanese Visual Genome VQA dataset](https://github.com/yahoojapan/ja-vg-vqa) Based on the implementation of [Vision-CAIR/MiniGPT-4](https://huggingface.co./Vision-CAIR/MiniGPT-4), only "first pretraining stage" described in [MiniGPT-4 paper](https://arxiv.org/abs/2304.10592) with the above datasets was conducted, and "second-stage finetuning" proposed in the paper with an aligned image-text dataset created with ChatGPT was NOT conducted. * **Model Series** | Variant | Link | | :-- | :--| | Bilingual 4B MiniGPT4 | https://huggingface.co./rinna/bilingual-gpt-neox-4b-minigpt4 | | Bilingual 4B PPO | https://huggingface.co./rinna/bilingual-gpt-neox-4b-instruction-ppo | | Bilingual 4B SFT | https://huggingface.co./rinna/bilingual-gpt-neox-4b-instruction-sft | | Bilingual 4B 8K | https://huggingface.co./rinna/bilingual-gpt-neox-4b-8k | | Bilingual 4B | https://huggingface.co./rinna/bilingual-gpt-neox-4b | | Japanese 3.6B PPO | https://huggingface.co./rinna/japanese-gpt-neox-3.6b-instruction-ppo | | Japanese 3.6B SFT-v2 | https://huggingface.co./rinna/japanese-gpt-neox-3.6b-instruction-sft-v2 | | Japanese 3.6B SFT | https://huggingface.co./rinna/japanese-gpt-neox-3.6b-instruction-sft | | Japanese 3.6B | https://huggingface.co./rinna/japanese-gpt-neox-3.6b | * **Contributors** [Koh Mitsuda](https://huggingface.co./mitsu-koh), [Tianyu Zhao](https://huggingface.co./tianyuz), and [Kei Sawada](https://huggingface.co./keisawada) --- # I/O Format A special format has been adopted to construct inputs. * An input prompt is formatted as a conversation between `ユーザー` and `システム`. * Each input utterance consists of (1) its speaker (`"ユーザー"` or `"システム"`), (2) a colon (`":"`), (3) a whitespace (`" "`), and (4) utterance text (e.g. `"猫はどんな体勢をしていますか?"`). * An utterance including an image is formatted as (1) its speaker (`"ユーザー"`), (2) a colon (`":"`), (3) a whitespace (`" "`), (4) a placeholder of the image (`""`), (5) another whitespace (`" "`), (6) utterance text (e.g. `"What can you see?"`). * The placeholder (``) is automatically replaced with the embedding of an input image in the function `get_context_emb`. * The input prompt should be ended with `"システム: "` to acknowledge the model to generate a response. * All the utterances in the input prompt should be separated by a newline `\n`. Following is an example to construct input from a conversation. ~~~python prompt = [ { "speaker": "ユーザー", "text": " What can you see?" }, { "speaker": "システム", "text": "a cat on a table with a laptop" }, { "speaker": "ユーザー", "text": "猫はどんな体勢をしていますか?" }, ] prompt = [ f"{uttr['speaker']}: {uttr['text']}" for uttr in prompt ] prompt = "\n".join(prompt) prompt = ( prompt + "\n" + "システム: " ) print(prompt) """ ユーザー: What can you see? システム: a cat on a table with a laptop ユーザー: 猫はどんな体勢をしていますか? システム: """ ~~~ --- # How to use the model **1. Download dependencies** * BLIP-2 implementation included in MiniGPT-4 is used for inference. * `customized_mini_gpt4.py` is a script to replace LLM from LLaMA architecture to GPT-NeoX one. * `checkpoint.pth` is a finetuned weight of the linear layer (file size: 177 MB). ```bash git clone https://github.com/Vision-CAIR/MiniGPT-4.git cd ./MiniGPT-4 git checkout 22d8888 # latest version as of July 31, 2023. wget https://huggingface.co./rinna/bilingual-gpt-neox-4b-minigpt4/resolve/main/customized_mini_gpt4.py wget https://huggingface.co./rinna/bilingual-gpt-neox-4b-minigpt4/resolve/main/checkpoint.pth ``` **2. Inference** Please run this script in `MiniGPT-4` directory. ~~~~python import torch import requests from PIL import Image from minigpt4.processors.blip_processors import Blip2ImageEvalProcessor from customized_mini_gpt4 import CustomizedMiniGPT4 ckpt_path = "./checkpoint.pth" model = CustomizedMiniGPT4(gpt_neox_model="rinna/bilingual-gpt-neox-4b") tokenizer = model.gpt_neox_tokenizer if torch.cuda.is_available(): model = model.to("cuda") if ckpt_path is not None: print("Load BLIP2-LLM Checkpoint: {}".format(ckpt_path)) ckpt = torch.load(ckpt_path, map_location="cpu") model.load_state_dict(ckpt['model'], strict=False) vis_processor = Blip2ImageEvalProcessor() image_url = "https://huggingface.co./rinna/bilingual-gpt-neox-4b-minigpt4/resolve/main/sample.jpg" raw_image = Image.open(requests.get(image_url, stream=True).raw).convert('RGB') image = vis_processor(raw_image).unsqueeze(0).to(model.device) image_emb = model.encode_img(image) embs = model.get_context_emb(prompt, [image_emb]) output_ids = model.gpt_neox_model.generate( inputs_embeds=embs, max_new_tokens=512, do_sample=True, temperature=1.0, top_p=0.85, pad_token_id=tokenizer.pad_token_id, bos_token_id=tokenizer.bos_token_id, eos_token_id=tokenizer.eos_token_id ) output = tokenizer.decode(output_ids.tolist()[0], skip_special_tokens=True) print(output) """横になっています。""" ~~~~ --- # How to cite ```bibtex @misc{rinna-bilingual-gpt-neox-4b-minigpt4, title = {rinna/bilingual-gpt-neox-4b-minigpt4}, author = {Mitsuda, Koh and Zhao, Tianyu and Sawada, Kei}, url = {https://huggingface.co./rinna/bilingual-gpt-neox-4b-minigpt4} } @inproceedings{sawada2024release, title = {Release of Pre-Trained Models for the {J}apanese Language}, author = {Sawada, Kei and Zhao, Tianyu and Shing, Makoto and Mitsui, Kentaro and Kaga, Akio and Hono, Yukiya and Wakatsuki, Toshiaki and Mitsuda, Koh}, booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)}, month = {5}, year = {2024}, pages = {13898--13905}, url = {https://aclanthology.org/2024.lrec-main.1213}, note = {\url{https://arxiv.org/abs/2404.01657}} } ``` --- # Acknowledgement * [Vision-CAIR/MiniGPT-4](https://huggingface.co./Vision-CAIR/MiniGPT-4) * [BLIP-2](https://huggingface.co./docs/transformers/main/model_doc/blip-2) * [Lavis](https://github.com/salesforce/LAVIS) # Licenese [The MIT license](https://opensource.org/licenses/MIT)