Files changed (1) hide show
  1. README.md +3 -29
README.md CHANGED
@@ -1,29 +1,3 @@
1
- ---
2
- license: apache-2.0
3
- ---
4
-
5
- Upstage `solar-1-mini` tokenizer
6
- - Vocab size: 64,000
7
- - Langauge support: English, Korean, Japanese and more
8
-
9
- Please use this tokenizer for tokenizing inputs for the Upstage [solar-1-mini-chat](https://developers.upstage.ai/docs/apis/chat) model.
10
-
11
- You can load it with the tokenizer library like this:
12
-
13
- ```python
14
- from tokenizers import Tokenizer
15
-
16
- tokenizer = Tokenizer.from_pretrained("upstage/solar-1-mini-tokenizer")
17
- text = "Hi, how are you?"
18
- enc = tokenizer.encode(text)
19
- print("Encoded input:")
20
- print(enc)
21
-
22
- inv_vocab = {v: k for k, v in tokenizer.get_vocab().items()}
23
- tokens = [inv_vocab[token_id] for token_id in enc.ids]
24
- print("Tokens:")
25
- print(tokens)
26
-
27
- number_of_tokens = len(enc.ids)
28
- print("Number of tokens:", number_of_tokens)
29
- ```
 
1
+ ---
2
+ license: apache-2.0
3
+ ---