File size: 13,967 Bytes
1df7ad4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
729bc74
1df7ad4
 
729bc74
 
 
 
 
 
 
 
 
1df7ad4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
729bc74
1df7ad4
 
 
 
 
 
 
 
 
 
 
 
 
729bc74
1df7ad4
 
729bc74
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1df7ad4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
729bc74
1df7ad4
 
729bc74
 
 
 
 
 
 
 
 
 
 
1df7ad4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
729bc74
1df7ad4
 
729bc74
 
 
 
 
 
 
 
 
1df7ad4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
729bc74
1df7ad4
 
729bc74
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1df7ad4
 
 
 
 
 
 
 
 
 
 
 
 
729bc74
 
 
 
 
 
 
 
1df7ad4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "6bad311a-c949-4246-9e6b-6d4ec76699b7",
   "metadata": {},
   "source": [
    "# 4.3 基于llama的基因数据词典扩充"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d42860cf-14fc-48f5-ac6c-1fd92a6a92ba",
   "metadata": {},
   "source": [
    "前面介绍了huggingface自带的分词器构建代码,这里介绍下更为通用的sentencepiece,部分huggingface其实就是来自于这个框架。\n",
    "\n",
    "SentencePiece 是一个语言无关的分词框架,由 Google 开发并开源。它不同于传统的基于词汇表(如词典)的分词方法,而是采用一种无监督的学习方式来训练模型,从而将文本分割成“子词”单元(subword units)。这种方法使得 SentencePiece 在处理未知词、罕见词以及多语言文本时表现出色。\n",
    "\n",
    "### 主要特点\n",
    "\n",
    "1. **语言无关**:\n",
    "   - SentencePiece 不依赖于任何特定语言的规则或词典,因此它可以应用于任何语言,甚至是混合语言的文本。\n",
    "\n",
    "2. **子词分词**:\n",
    "   - 它生成的是子词级别的 token,而不是完整的单词。这种方式可以有效地处理 OOV (out-of-vocabulary) 问题,并且有助于减少词汇表的大小。\n",
    "\n",
    "3. **无监督学习**:\n",
    "   - SentencePiece 使用无监督的方法从原始文本中学习分词规则,这意味着你只需要提供未标注的文本数据即可训练分词模型。\n",
    "\n",
    "4. **灵活的分词粒度**:\n",
    "   - 可以通过调整参数控制分词的粒度,即生成的子词单元的平均长度。这允许根据具体应用需求优化性能。\n",
    "\n",
    "5. **支持 BPE 和 Unigram LM**:\n",
    "   - SentencePiece 实现了两种流行的分词算法:Byte Pair Encoding (BPE) 和 Unigram Language Model (Unigram LM)。这两种方法各有优劣,可以根据任务选择合适的一种。\n",
    "\n",
    "6. **易于集成**:\n",
    "   - 提供了多种编程语言的绑定,包括 Python、C++、Go 等,方便在不同环境中使用。\n",
    "\n",
    "### 工作流程\n",
    "\n",
    "1. **准备语料库**:\n",
    "   - 收集用于训练分词模型的未标注文本数据。\n",
    "\n",
    "2. **训练模型**:\n",
    "   - 使用 `sentencepiece_trainer` 工具对收集到的文本进行训练,生成分词模型文件。\n",
    "   ```bash\n",
    "   spm_train --input=your_corpus.txt --model_prefix=myprefix --vocab_size=8000\n",
    "   ```\n",
    "\n",
    "3. **编码和解码**:\n",
    "   - 训练完成后,可以使用生成的模型对新文本进行编码(分词)和解码(还原)。\n",
    "   ```python\n",
    "   import sentencepiece as spm\n",
    "\n",
    "   # 加载训练好的模型\n",
    "   sp = spm.SentencePieceProcessor(model_file='myprefix.model')\n",
    "\n",
    "   # 分词\n",
    "   encoded = sp.encode(\"这是一个测试句子。\", out_type=str)\n",
    "   print(encoded)\n",
    "\n",
    "   # 还原\n",
    "   decoded = sp.decode(encoded)\n",
    "   print(decoded)\n",
    "   ```\n",
    "\n",
    "### 应用场景\n",
    "\n",
    "- **自然语言处理 (NLP)**:广泛应用于各种 NLP 任务,如机器翻译、文本分类、情感分析等。\n",
    "- **多语言支持**:特别适合处理包含多种语言的文本。\n",
    "- **低资源语言**:对于那些缺乏丰富词汇资源的语言尤其有用。\n",
    "- **预训练语言模型**:许多现代预训练语言模型(如 BERT、T5、mBART)都采用了 SentencePiece 作为其分词工具。\n",
    "\n",
    "### 小结\n",
    "\n",
    "SentencePiece 是一个强大而灵活的分词框架,适用于广泛的文本处理任务。它的无监督学习特性、语言无关性和高效的子词分词能力使其成为处理复杂和多样化文本数据的理想选择。希望这个简单的介绍能帮助你理解 SentencePiece 的基本概念和应用场景。如果有更多问题或需要进一步的帮助,请随时提问!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a8dedb50-a428-4146-8edf-84e699abf81b",
   "metadata": {},
   "source": [
    "## GENE分词器构建"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "39b5bf12-eaf0-432e-a2b0-99ba437daf3e",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install sentencepiece"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3b732b8e-53d1-4bfa-891b-2d63b886cc4a",
   "metadata": {},
   "outputs": [],
   "source": [
    "import sentencepiece as spm\n",
    "\n",
    "spm.SentencePieceTrainer.train(input='../01-data_env/data/dna_1g.txt,../01-data_env/data/protein_1g.txt',\n",
    "                               model_prefix='gene_bpe_seg', \n",
    "                               vocab_size=60000,\n",
    "                               model_type='bpe', #默认是unigram\n",
    "                               num_threads=10,\n",
    "                              )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "19a06b82-31b8-48cb-9c83-ec016da2da8a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['▁TCG', 'ACGGC', 'ACGCG', 'ACAGC', 'AGCG', 'AGCCCC', 'GCGC', 'ACCCG', 'AGCGCG', 'AKCG', 'FVGP', 'MV', 'HLKV', 'HLE', 'ADV', 'ASSC', 'RS', 'AVI', 'YL', 'TS', 'EEP', 'FEG', 'VLGL', 'RLKE', 'GI', 'AI', 'TGC', 'WPR', 'WP', 'DEM', 'DE', 'RS', 'AVW', 'RV', 'EPY', 'TR', 'HFG', 'RVL', 'YS', 'FGV']\n"
     ]
    }
   ],
   "source": [
    "from sentencepiece import SentencePieceProcessor\n",
    "model_path = \"gene_bpe_seg.model\"\n",
    "sp_model = SentencePieceProcessor(model_file=model_path)\n",
    "mm = sp_model.EncodeAsPieces(\"TCGACGGCACGCGACAGCAGCGAGCCCCGCGCACCCGAGCGCGAKCGFVGPMVHLKVHLEADVASSCRSAVIYLTSEEPFEGVLGLRLKEGIAITGCWPRWPDEMDERSAVWRVEPYTRHFGRVLYSFGV\")\n",
    "print(mm)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "958f7bd6-060f-48f4-8afe-02c3032312eb",
   "metadata": {},
   "source": [
    "## 合并词典到llama\n",
    "\n",
    "我们以基础版本的llama为例,进行合并,请注意llama的使用限制。\n",
    "\n",
    "新版本的llama需要自行认证下载。[链接](https://huggingface.co./meta-llama)\n",
    "\n",
    "```\n",
    "#建议在终端下执行\n",
    "pip install -U huggingface_hub\n",
    "export HF_ENDPOINT=https://hf-mirror.com\n",
    "huggingface-cli download --resume-download yahma/llama-7b-hf --local-dir llama-7b-hf\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "3bafcc33-2923-4026-bc39-c6ec716d2e3c",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "os.environ[\"PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION\"]=\"python\"\n",
    "from transformers import LlamaTokenizer\n",
    "from sentencepiece import sentencepiece_model_pb2 as sp_pb2_model\n",
    "import sentencepiece as spm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "66cb86ed-3225-4bb0-8aca-6005bc918d03",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "32000 60000\n",
      "['<s>', '</s>', '<unk>']\n",
      "[1, 2, 0]\n",
      "{'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>'}\n"
     ]
    }
   ],
   "source": [
    "llama_tokenizer_dir = \"llama-7b-hf\" \n",
    "dna_sp_model_file = \"gene_bpe_seg.model\"\n",
    "\n",
    "# load\n",
    "llama_tokenizer = LlamaTokenizer.from_pretrained(llama_tokenizer_dir)\n",
    "dna_sp_model = spm.SentencePieceProcessor()\n",
    "dna_sp_model.Load(dna_sp_model_file)\n",
    "\n",
    "llama_spm = sp_pb2_model.ModelProto()\n",
    "llama_spm.ParseFromString(llama_tokenizer.sp_model.serialized_model_proto())\n",
    "dna_spm = sp_pb2_model.ModelProto()\n",
    "dna_spm.ParseFromString(dna_sp_model.serialized_model_proto())\n",
    "\n",
    "# print number of tokens\n",
    "print(len(llama_tokenizer),len(dna_sp_model))\n",
    "print(llama_tokenizer.all_special_tokens)\n",
    "print(llama_tokenizer.all_special_ids)\n",
    "print(llama_tokenizer.special_tokens_map)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "7ba4240e-bc08-4be0-8ca3-c4e7a47fa055",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "32000\n",
      "Before:32000\n",
      "New model pieces: 91643\n"
     ]
    }
   ],
   "source": [
    "## Add dna tokens to LLaMA tokenizer\n",
    "llama_spm_tokens_set=set(p.piece for p in llama_spm.pieces)\n",
    "print(len(llama_spm_tokens_set))\n",
    "print(f\"Before:{len(llama_spm_tokens_set)}\")\n",
    "for p in dna_spm.pieces:\n",
    "    piece = p.piece\n",
    "    score = p.score\n",
    "    if piece not in llama_spm_tokens_set:\n",
    "        new_p = sp_pb2_model.ModelProto().SentencePiece()\n",
    "        new_p.piece = piece\n",
    "        new_p.score = score # 0?\n",
    "        llama_spm.pieces.append(new_p)\n",
    "print(f\"New model pieces: {len(llama_spm.pieces)}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "a240a7d8-c1a9-4473-a5c5-157a25f97c16",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "gene-LLaMA tokenizer has been saved to merged_gene_eng_tokenizer_hf\n"
     ]
    }
   ],
   "source": [
    "## Save\n",
    "output_sp_dir = 'merged_gene_eng_tokenizer_sp'\n",
    "output_hf_dir = 'merged_gene_eng_tokenizer_hf' # the path to save dna-LLaMA tokenizer\n",
    "os.makedirs(output_sp_dir,exist_ok=True)\n",
    "with open(output_sp_dir+'/gene_eng_llama_tokenizer.model', 'wb') as f:\n",
    "    f.write(llama_spm.SerializeToString())\n",
    "\n",
    "tokenizer = LlamaTokenizer(vocab_file=output_sp_dir+'/gene_eng_llama_tokenizer.model')\n",
    "tokenizer.save_pretrained(output_hf_dir)\n",
    "print(f\"gene-LLaMA tokenizer has been saved to {output_hf_dir}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "cbd1f648-f8a0-4f16-b516-2ce3e7c7cfee",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['<s>', '</s>', '<unk>']\n",
      "[1, 2, 0]\n",
      "{'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>'}\n",
      "Test text:\n",
      " TCGACGGCACGCGACAGCAGCGAGCCCCGCGCACCCGAGCGCGAKCGFVGPMVHLKVHLEADVASSCRSAVIYLTSEEPFEGVLGLRLKEGIAITGCWPRWPDEMDERSAVWRVEPYTRHFGRVLYSFGV,\n",
      "The primary use of LLaMA is research on large language models, including\n",
      "Tokenized by LLaMA tokenizer:['▁T', 'CG', 'AC', 'G', 'GC', 'AC', 'GC', 'G', 'AC', 'AG', 'CA', 'GC', 'G', 'AG', 'CC', 'CC', 'GC', 'GC', 'AC', 'CC', 'GA', 'GC', 'GC', 'GA', 'K', 'CG', 'F', 'V', 'G', 'PM', 'V', 'HL', 'K', 'V', 'H', 'LE', 'AD', 'VA', 'SS', 'CR', 'S', 'AV', 'I', 'Y', 'LT', 'SEE', 'PF', 'EG', 'V', 'L', 'GL', 'RL', 'KE', 'G', 'IA', 'IT', 'GC', 'W', 'PR', 'WP', 'DE', 'MD', 'ERS', 'AV', 'WR', 'VE', 'PY', 'TR', 'H', 'F', 'GR', 'V', 'LY', 'SF', 'GV', ',', '<0x0A>', 'The', '▁primary', '▁use', '▁of', '▁L', 'La', 'MA', '▁is', '▁research', '▁on', '▁large', '▁language', '▁models', ',', '▁including']\n",
      "Tokenized by GENE-LLaMA tokenizer:['▁TCG', 'ACGGC', 'ACGCG', 'ACAG', 'CA', 'GCG', 'AGCCCC', 'GCGC', 'ACCCG', 'AGCGCG', 'AKCG', 'FVGP', 'MVHL', 'KV', 'HLE', 'ADV', 'ASSC', 'RSAV', 'I', 'YL', 'TSEE', 'P', 'FEG', 'VLGL', 'RLK', 'EGI', 'AI', 'TGC', 'W', 'PRW', 'P', 'DEM', 'DER', 'SAV', 'W', 'RVE', 'PY', 'TRH', 'FG', 'RVLY', 'SFGV', ',', '<0x0A>', 'The', '▁primary', '▁use', '▁of', '▁L', 'La', 'MA', '▁is', '▁research', '▁on', '▁large', '▁language', '▁models', ',', '▁including']\n"
     ]
    }
   ],
   "source": [
    "# Test\n",
    "llama_tokenizer = LlamaTokenizer.from_pretrained(llama_tokenizer_dir)\n",
    "dna_llama_tokenizer = LlamaTokenizer.from_pretrained(output_hf_dir)\n",
    "print(tokenizer.all_special_tokens)\n",
    "print(tokenizer.all_special_ids)\n",
    "print(tokenizer.special_tokens_map)\n",
    "text='''TCGACGGCACGCGACAGCAGCGAGCCCCGCGCACCCGAGCGCGAKCGFVGPMVHLKVHLEADVASSCRSAVIYLTSEEPFEGVLGLRLKEGIAITGCWPRWPDEMDERSAVWRVEPYTRHFGRVLYSFGV,\n",
    "The primary use of LLaMA is research on large language models, including'''\n",
    "print(\"Test text:\\n\",text)\n",
    "print(f\"Tokenized by LLaMA tokenizer:{llama_tokenizer.tokenize(text)}\")\n",
    "print(f\"Tokenized by GENE-LLaMA tokenizer:{dna_llama_tokenizer.tokenize(text)}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "46ae7605-2ef8-4927-bff3-2c0325f8df0d",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}