fukugawa
/

transformer-lm-japanese-1.0b

Text Generation

Inference Endpoints

Model card Files Files and versions Community

fukugawa commited on Sep 6

Commit

e6e20ca

•

1 Parent(s): 63f7abe

Update README.md

Files changed (1) hide show

README.md +93 -3

README.md CHANGED Viewed

@@ -1,3 +1,93 @@
----
-license: apache-2.0
----

+---
+language:
+- ja
+license: apache-2.0
+tags:
+- ja
+- japanese
+- text-generation
+- lm
+- jax
+- flax
+- lm1b
+datasets:
+- wiki40b
+---
+# transformer-lm-japanese-1.0b
+This is a JAX/Flax-based transformer language model trained on a Japanese dataset. It is based on the official Flax example code ([lm1b](https://github.com/google/flax/tree/main/examples/lm1b)).
+## Source Code
+We've modified Flax's 'lm1b' example to train on Japanese dataset. You can find the code on Github.
+* [transformer-lm-japanese](https://github.com/FookieMonster/transformer-lm-japanese)
+## Our Blog Post
+* [【0.1Bから作るLLM】 JAX/Flaxで作るTransformer言語モデル](https://zenn.dev/fukugawa/articles/4446573ec0f697)
+## Model Details
+| Model | Params | Layers | Dim | Heads | Dataset | Dataset size | Training time | PPL |
+|-|-|-|-|-|-|-|-|-|
+| transformer-lm-japanese-1.0b | 1.0B | 18 | 2048 | 16 | wiki40b/ja | 2.19GB | 4 days | 31.47 |
+## Usage: FlaxAutoModel
+#### Requirements:
+```
+pip install transformers>=4.39.0
+pip install jax==0.4.31
+pip install flax==0.8.3
+pip install sentencepiece==0.1.99
+# For CPU
+pip install -U "jax[cpu]==0.4.31"
+# For GPU
+pip install -U "jax[cuda12]==0.4.31"
+```
+Note: Set **trust_remote_code=True** to load our custom model.
+~~~~python
+from transformers import AutoTokenizer, FlaxAutoModelForCausalLM
+tokenizer = AutoTokenizer.from_pretrained("fukugawa/transformer-lm-japanese-1.0b", trust_remote_code=True)
+model = FlaxAutoModelForCausalLM.from_pretrained("fukugawa/transformer-lm-japanese-1.0b", trust_remote_code=True)
+text = "日本の首都は、"
+token_ids = tokenizer.encode(text, return_tensors="jax", add_special_tokens=False)
+output_ids = model.generate(
+  token_ids,
+  do_sample=True,
+  temperature=0.6,
+  top_k=20,
+  max_new_tokens=100
+)
+output = tokenizer.decode(output_ids[0][0], skip_special_tokens=True)
+print(output)
+~~~~
+We tested text generation in a Python 3.10 environment on GCP as follows
+* GPU Type: NVIDIA L4 (x 1)
+* Machine Type: g2-standard-16 (16 CPUs, 64GB Memory)
+* Disk: 256GB
+* OS: Ubuntu 22.04 LTS x86/64
+## Dataset
+* [wiki40b/ja](https://www.tensorflow.org/datasets/catalog/wiki40b?hl=ja#wiki40bja) (2.19GB)
+## Tokenization
+* [sentencepiece](https://github.com/google/sentencepiece)
+## Author
+[Ryoichi Fukugawa](https://zenn.dev/fukugawa)