File size: 5,105 Bytes
b9bcebd 4d79168 b9bcebd 5e37cda b9bcebd 9e139a4 b9bcebd |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 |
---
thumbnail: https://github.com/rinnakk/japanese-pretrained-models/blob/master/rinna.png
datasets:
- mc4
- wikipedia
- EleutherAI/pile
- oscar-corpus/colossal-oscar-1.0
- cc100
language:
- ja
- en
tags:
- qwen
inference: false
---
# `rinna/nekomata-14b`
![rinna-icon](./rinna.png)
# Overview
We conduct continual pre-training of [qwen-14b](https://huggingface.co./Qwen/Qwen-14B) on **66B** tokens from a mixture of Japanese and English datasets. The continual pre-training significantly improves the model's performance on Japanese tasks. It also enjoys the following great features provided by the original Qwen model.
* The inclusive Qwen vocabulary (vocab size > 150k) enables the model to processs Japanese texts much more efficiently than the previously released [youri series](https://huggingface.co./collections/rinna/youri-7b-654053610cb8e9d8e6289efc).
* The model supports a maximum sequence length of 8192.
The name `nekomata` comes from the Japanese word [`猫又/ねこまた/Nekomata`](https://ja.wikipedia.org/wiki/%E7%8C%AB%E5%8F%88), which is a kind of Japanese mythical creature ([`妖怪/ようかい/Youkai`](https://ja.wikipedia.org/wiki/%E5%A6%96%E6%80%AA)).
* **Library**
The model was trained using code based on [aws-neuron/neuronx-nemo-megatron](https://github.com/aws-neuron/neuronx-nemo-megatron/).
* **Model architecture**
A 40-layer, 5120-hidden-size transformer-based language model. Please refer to the [Qwen paper](https://arxiv.org/abs/2309.16609) for architecture details.
* **Continual pre-training**
The model was initialized with the [qwen-14b](https://huggingface.co./Qwen/Qwen-14B) model and continually trained on around **66B** tokens from a mixture of the following corpora
- [Japanese CC-100](http://data.statmt.org/cc-100/ja.txt.xz)
- [Japanese C4](https://huggingface.co./datasets/mc4)
- [Japanese OSCAR](https://huggingface.co./datasets/oscar-corpus/colossal-oscar-1.0)
- [The Pile](https://huggingface.co./datasets/EleutherAI/pile)
- [Wikipedia](https://dumps.wikimedia.org/other/cirrussearch)
- rinna curated Japanese dataset
* **Training Infrastructure**
`nekomata-14B` was trained on 16 nodes of Amazon EC2 trn1.32xlarge instance powered by AWS Trainium purpose-built ML accelerator chip. The pre-training job was completed within a timeframe of approximately 7 days.
* **Authors**
- [Tianyu Zhao](https://huggingface.co./tianyuz)
- [Akio Kaga](https://huggingface.co./rakaga)
- [Kei Sawada](https://huggingface.co./keisawada)
---
# Benchmarking
Please refer to [rinna's LM benchmark page](https://rinnakk.github.io/research/benchmarks/lm/index.html).
---
# How to use the model
~~~~python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("rinna/nekomata-14b", trust_remote_code=True)
# Use GPU with bf16
# model = AutoModelForCausalLM.from_pretrained("rinna/nekomata-14b", device_map="auto", trust_remote_code=True, bf16=True)
# Use GPU with fp16
# model = AutoModelForCausalLM.from_pretrained("rinna/nekomata-14b", device_map="auto", trust_remote_code=True, fp16=True)
# Use CPU
# model = AutoModelForCausalLM.from_pretrained("rinna/nekomata-14b", device_map="cpu", trust_remote_code=True)
# Automatically select device and precision
model = AutoModelForCausalLM.from_pretrained("rinna/nekomata-14b", device_map="auto", trust_remote_code=True)
text = "西田幾多郎は、"
token_ids = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt")
with torch.no_grad():
output_ids = model.generate(
token_ids.to(model.device),
max_new_tokens=200,
min_new_tokens=200,
do_sample=True,
temperature=1.0,
top_p=0.95,
pad_token_id=tokenizer.pad_token_id,
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.eos_token_id
)
output = tokenizer.decode(output_ids.tolist()[0])
print(output)
~~~~
---
# Tokenization
The model uses the original Qwen tokenizer. It augments the [`cl100k` tiktoken tokenizer](https://github.com/openai/tiktoken) and has a vocabulary size of 151,936. The inclusive vocabulary helps the model to reach a better tokenization efficiency, especially for Japanese texts.
We compared the `Qwen` tokenizer (as used in `nekomata`) and the `llama-2` tokenizer (as used in `youri`) on different text collections and found that the Qwen tokenizer achieves a much better byte2token rate (i.e. the average number of tokens produced from 1 byte of text) as following. A lower byte2token rate indicates a better tokenization efficiency.
| Tokenizer | Japanese | English | Multilingual |
| --- | --- | --- | --- |
| Qwen | 0.24 | 0.27 | 0.27 |
| llama-2 | 0.40 | 0.29 | 0.36 |
---
# How to cite
~~~
@misc{RinnaNekomata14b,
url={https://huggingface.co./rinna/nekomata-14b},
title={rinna/nekomata-14b},
author={Zhao, Tianyu and Kaga, Akio and Sawada, Kei}
}
~~~
---
# License
[Tongyi Qianwen LICENSE AGREEMENT](https://github.com/QwenLM/Qwen/blob/main/Tongyi%20Qianwen%20LICENSE%20AGREEMENT) |