File size: 27,117 Bytes

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "1e6d4978-4f0f-4268-aa23-d864857bd6c8",
   "metadata": {},
   "source": [
    "# 4.6 基于llama的基因大模型持续预训练"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2c201732-e736-463c-8446-637bf517479f",
   "metadata": {},
   "source": [
    "LLaMA（**Large Language Model Meta AI**）是由 Meta（Facebook）开发的一系列大型语言模型，专注于提供高性能和高效的大语言模型，面向学术研究和开发社区。LLaMA 系列主要强调训练效率、模型性能和对计算资源的高效利用，是 GPT 系列模型的有力竞争者之一。\n",
    "\n",
    "---\n",
    "\n",
    "### **1. LLaMA 模型概述**\n",
    "\n",
    "#### **1.1 LLaMA 1**\n",
    "- **发布**：2023 年 2 月。\n",
    "- **模型参数规模**：\n",
    "  - 7B（70 亿）\n",
    "  - 13B（130 亿）\n",
    "  - 33B（330 亿）\n",
    "  - 65B（650 亿）\n",
    "- **特点**：\n",
    "  - 专注于效率：与 GPT-3 等模型相比，LLaMA 在相同的训练成本下实现了更高的性能。\n",
    "  - 针对研究开放：提供预训练模型权重供研究使用。\n",
    "  - 使用高质量的数据：模型训练使用大量从网络中筛选的高质量文本数据，包括维基百科、书籍和其他高质量来源。\n",
    "- **性能**：\n",
    "  - 在许多 NLP 任务中，LLaMA 的性能超过 GPT-3 和其他同类模型。\n",
    "  - 参数规模较小的版本（如 LLaMA-13B）性能可与 GPT-3（175B 参数）媲美。\n",
    "\n",
    "#### **1.2 LLaMA 2**\n",
    "- **发布**：2023 年 7 月。\n",
    "- **改进**：\n",
    "  - 增强的训练数据：相比 LLaMA 1，使用了更多的高质量数据。\n",
    "  - 引入微调版本：发布了开箱即用的对话模型（LLaMA 2-Chat）。\n",
    "  - 更好的开源支持：LLaMA 2 在商业用途上比 LLaMA 1 更加开放。\n",
    "- **模型参数规模**：\n",
    "  - 7B（70 亿）\n",
    "  - 13B（130 亿）\n",
    "  - 70B（700 亿）\n",
    "- **性能**：\n",
    "  - LLaMA 2 的性能相比 LLaMA 1 有显著提升。\n",
    "  - LLaMA 2-Chat 在对话任务中的表现优于许多现有开源模型。\n",
    "  - 在多个标准基准（如 MMLU）上超过 GPT-4 和 Claude 的开源实现。\n",
    "\n",
    "---\n",
    "\n",
    "### **2. LLaMA 的关键技术特点**\n",
    "\n",
    "#### **2.1 高效的架构设计**\n",
    "- 基于 Transformer 架构。\n",
    "- 针对训练效率和推理速度进行了优化，适合研究和开发。\n",
    "\n",
    "#### **2.2 模型压缩**\n",
    "- 提供更小的参数规模（如 7B 和 13B），以便在更低的计算资源上运行。\n",
    "- 在性能与参数量之间实现了很好的平衡。\n",
    "\n",
    "#### **2.3 训练数据**\n",
    "- 使用从互联网中提取的高质量数据，注重数据清洗和筛选，避免低质量文本对模型的负面影响。\n",
    "\n",
    "#### **2.4 微调能力**\n",
    "- 支持指令微调（Instruction Tuning）和 RLHF（基于人类反馈的强化学习），特别是在 LLaMA 2-Chat 模型中表现优异。\n",
    "\n",
    "---\n",
    "\n",
    "### **3. LLaMA 的性能对比**\n",
    "\n",
    "#### **与 GPT-3 比较**\n",
    "- LLaMA 1-13B 参数模型在许多任务上的性能接近 GPT-3-175B。\n",
    "- LLaMA 2-70B 在多个任务上超过 GPT-3。\n",
    "\n",
    "#### **与其他开源模型比较**\n",
    "- LLaMA 2 在许多基准测试中优于其他开源模型（如 Falcon 和 MPT）。\n",
    "- LLaMA 2-Chat 提供了与 ChatGPT 类似的对话能力，适用于对话任务。\n",
    "\n",
    "---\n",
    "\n",
    "### **4. 应用场景**\n",
    "\n",
    "1. **研究**：\n",
    "   - 开源权重适合学术研究，推动了对大语言模型的进一步探索。\n",
    "\n",
    "2. **对话系统**：\n",
    "   - LLaMA 2-Chat 专为对话任务设计，适合开发智能客服、聊天机器人等应用。\n",
    "\n",
    "3. **生成任务**：\n",
    "   - 支持文本生成、补全、摘要等任务。\n",
    "\n",
    "4. **微调与定制**：\n",
    "   - 可以基于特定领域数据进行微调，如医学、法律、教育等领域的专用模型。\n",
    "\n",
    "---\n",
    "\n",
    "### **5. 开源与获取方式**\n",
    "\n",
    "#### **1. 开源**\n",
    "- LLaMA 1：需要申请权限才能获得模型权重。\n",
    "- LLaMA 2：更加开放，允许商业用途，模型和权重可以通过 Meta 的合作平台获取（如 Hugging Face 和 AWS）。\n",
    "\n",
    "#### **2. 下载与使用**\n",
    "使用 Hugging Face 加载模型：\n",
    "```python\n",
    "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
    "\n",
    "model_name = \"meta-llama/Llama-2-7b-hf\"  # 替换为具体模型\n",
    "tokenizer = AutoTokenizer.from_pretrained(model_name)\n",
    "model = AutoModelForCausalLM.from_pretrained(model_name)\n",
    "\n",
    "# 使用模型生成文本\n",
    "inputs = tokenizer(\"Hello, how are you?\", return_tensors=\"pt\")\n",
    "outputs = model.generate(**inputs, max_length=50)\n",
    "print(tokenizer.decode(outputs[0], skip_special_tokens=True))\n",
    "```\n",
    "\n",
    "---\n",
    "\n",
    "### **6. 总结**\n",
    "\n",
    "#### **优势**\n",
    "- **高性能**：在多个基准任务上表现出色。\n",
    "- **高效训练**：小参数模型能与大模型媲美。\n",
    "- **开放性**：LLaMA 2 提供了较为开放的商用许可。\n",
    "\n",
    "#### **局限**\n",
    "- 模型需要高质量数据和强大算力训练，对推理设备也有一定要求。\n",
    "\n",
    "LLaMA 系列以其高效和开放的特点，为大模型研究和应用带来了强大动力，是当前大语言模型生态的重要组成部分。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7fb0d648-f891-47b9-a644-af5263fa9718",
   "metadata": {},
   "source": [
    "---\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8b3c9ebb-213b-4dc4-a712-5a819fea3197",
   "metadata": {},
   "source": [
    "**大模型的持续预训练**（Continual Pretraining of Large Models）是指在基础预训练模型（如 GPT、BERT 等）的基础上，通过引入新的数据或特定领域的数据继续进行预训练的过程。这一过程旨在让模型在特定场景或任务中表现更好，同时保留其通用能力。\n",
    "\n",
    "---\n",
    "\n",
    "### **1. 持续预训练的概念**\n",
    "\n",
    "持续预训练是一种在通用大模型的预训练基础上，进一步优化和适配模型的方法，主要包括以下两种场景：\n",
    "1. **领域适配**：\n",
    "   - 将预训练模型在特定领域的数据上继续训练，使其对该领域的语料理解更深刻，例如法律、医学、金融等领域。\n",
    "2. **性能优化**：\n",
    "   - 通过引入更多的通用数据或多样化的数据类型，扩展模型的通用能力，提高性能。\n",
    "\n",
    "---\n",
    "\n",
    "### **2. 持续预训练的目标**\n",
    "\n",
    "1. **提升领域性能**：\n",
    "   - 在特定领域任务上，模型能够更好地理解特定领域的语言模式和知识。\n",
    "   \n",
    "2. **增强模型鲁棒性**：\n",
    "   - 通过引入新的数据或增强数据多样性，使模型对未见数据表现更稳定。\n",
    "\n",
    "3. **优化资源利用**：\n",
    "   - 通过复用已有的大模型权重，只需训练少量额外步骤，避免从零开始重新训练模型。\n",
    "\n",
    "---\n",
    "\n",
    "### **3. 持续预训练的步骤**\n",
    "\n",
    "#### **（1）数据准备**\n",
    "- **领域数据**：针对特定领域（如医学、法律、科技）收集高质量语料。\n",
    "- **新语料整合**：补充模型未见过的多样化语料。\n",
    "- **数据清洗**：确保数据无噪声、语言风格一致。\n",
    "\n",
    "#### **（2）模型初始化**\n",
    "- 使用现有的预训练模型作为初始权重，例如 Hugging Face 提供的 GPT-2 或 BERT 模型。\n",
    "\n",
    "#### **（3）训练设置**\n",
    "- **超参数调整**：\n",
    "  - 通常使用较小的学习率（例如 `1e-5` 或 `2e-5`）以避免破坏已有的知识。\n",
    "- **训练策略**：\n",
    "  - 冻结部分参数（如嵌入层或前几层）以保留通用能力，仅调整高层或新加入的部分。\n",
    "\n",
    "#### **（4）评估和验证**\n",
    "- 使用领域任务的数据集对模型进行评估，验证其在目标任务中的改进效果。\n",
    "\n",
    "---\n",
    "\n",
    "### **4. 持续预训练的常见方法**\n",
    "\n",
    "#### **（1）全量持续预训练**\n",
    "- 对整个模型的参数进行调整。\n",
    "- **优点**：适合较大规模的新数据训练，能显著提升领域性能。\n",
    "- **缺点**：计算资源需求大，可能导致模型过拟合。\n",
    "\n",
    "#### **（2）冻结部分参数**\n",
    "- 冻结低层参数，仅微调高层。\n",
    "- **优点**：保留通用知识，减少计算开销。\n",
    "- **缺点**：对领域特定知识的适配可能不足。\n",
    "\n",
    "#### **（3）参数高效微调（PEFT）**\n",
    "- 使用 PEFT 方法（如 LoRA、Adapter）进行预训练：\n",
    "  - **LoRA**：通过低秩矩阵分解，微调部分关键模块。\n",
    "  - **Adapter**：在 Transformer 层中插入小型适配模块。\n",
    "- **优点**：显著减少需要更新的参数量。\n",
    "\n",
    "---\n",
    "\n",
    "### **5. 持续预训练的典型应用**\n",
    "\n",
    "1. **领域适配**\n",
    "   - **医学**：将预训练模型在 PubMed 或生物医学数据集上进行持续预训练。\n",
    "   - **法律**：使用法律文档进一步训练基础模型。\n",
    "   - **金融**：通过金融新闻、报告语料提升模型在金融领域的表现。\n",
    "\n",
    "2. **多语言扩展**\n",
    "   - 引入多语言语料，扩展模型的多语言能力。\n",
    "\n",
    "3. **数据更新**\n",
    "   - 持续加入新数据（如时事新闻）以适配最新语言模式。\n",
    "\n",
    "4. **特殊任务优化**\n",
    "   - 针对特定任务（如代码生成、对话）引入专用数据进行训练。\n",
    "\n",
    "---\n",
    "\n",
    "### **6. 实现持续预训练的代码示例**\n",
    "\n",
    "以下示例基于 Hugging Face 实现 GPT-2 的持续预训练：\n",
    "\n",
    "```python\n",
    "from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments\n",
    "from datasets import load_dataset\n",
    "\n",
    "# 1. 加载预训练模型和分词器\n",
    "model_name = \"gpt2\"\n",
    "tokenizer = AutoTokenizer.from_pretrained(model_name)\n",
    "model = AutoModelForCausalLM.from_pretrained(model_name)\n",
    "\n",
    "# 2. 加载新语料数据\n",
    "dataset = load_dataset(\"text\", data_files={\"train\": \"domain_corpus.txt\"})\n",
    "\n",
    "# 3. 数据预处理\n",
    "def tokenize_function(examples):\n",
    "    return tokenizer(examples[\"text\"], truncation=True, max_length=1024, padding=\"max_length\")\n",
    "\n",
    "tokenized_dataset = dataset.map(tokenize_function, batched=True)\n",
    "\n",
    "# 4. 设置训练参数\n",
    "training_args = TrainingArguments(\n",
    "    output_dir=\"./gpt2_domain_adapted\",\n",
    "    overwrite_output_dir=True,\n",
    "    per_device_train_batch_size=4,\n",
    "    num_train_epochs=3,\n",
    "    learning_rate=5e-5,\n",
    "    save_steps=500,\n",
    "    save_total_limit=2,\n",
    "    logging_dir=\"./logs\",\n",
    "    evaluation_strategy=\"no\",  # 评估策略可以根据需要调整\n",
    "    fp16=True,  # 混合精度训练\n",
    ")\n",
    "\n",
    "# 5. 定义 Trainer 并启动训练\n",
    "trainer = Trainer(\n",
    "    model=model,\n",
    "    args=training_args,\n",
    "    train_dataset=tokenized_dataset[\"train\"],\n",
    "    tokenizer=tokenizer,\n",
    ")\n",
    "\n",
    "trainer.train()\n",
    "\n",
    "# 6. 保存模型\n",
    "model.save_pretrained(\"./gpt2_domain_adapted\")\n",
    "tokenizer.save_pretrained(\"./gpt2_domain_adapted\")\n",
    "```\n",
    "\n",
    "---\n",
    "\n",
    "### **7. 持续预训练的挑战**\n",
    "\n",
    "1. **灾难性遗忘**：\n",
    "   - 持续预训练可能导致模型丧失之前学到的知识。\n",
    "   - **解决方法**：使用少量原始数据进行联合训练。\n",
    "\n",
    "2. **计算资源需求**：\n",
    "   - 需要大量显存和算力，特别是对于大规模模型和数据。\n",
    "\n",
    "3. **数据质量和多样性**：\n",
    "   - 新引入的数据可能包含噪声，影响模型性能。\n",
    "\n",
    "---\n",
    "\n",
    "### **8. 持续预训练的优势**\n",
    "\n",
    "- 提高特定领域或任务的性能。\n",
    "- 更高效地利用已有模型权重，避免从头训练。\n",
    "- 保留原始模型的通用能力，同时增强领域适应性。\n",
    "\n",
    "---\n",
    "\n",
    "### **总结**\n",
    "\n",
    "持续预训练是适配领域任务和提升模型性能的重要方法，通过引入新数据或优化模型训练策略，可以让大模型在特定场景中表现更优。配合参数高效微调方法（如 LoRA），还可显著降低计算开销，提升训练效率。这种技术在学术研究、工业应用和前沿领域（如法律、医学等）中均具有广泛价值。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ca41ad33-18fb-44da-8f79-0380b5c9dcaa",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "3038550c-cc92-45c9-8bb4-46c58688bfc5",
   "metadata": {},
   "source": [
    "## 本节任务\n",
    "本节任务是基于llama。训练一个能够处理dna和protein蛋白质数据的基础预训练大模型，数据为第一章中的预训练数据，包括英文数据。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aec90d65-ac62-4394-a526-ca62d8bdbad4",
   "metadata": {},
   "source": [
    "## 环境设置\n",
    "并行环境对transformer、peft等的版本要求比较高，如果版本不匹配可能会出现各种异常问题\n",
    "之前的课程，都是单GPU运行，一般不存在版本问题，默认安装的都是最新版本。但运行并行环境时，需要确认下版本再运行，本课程运行并行环境如下：\n",
    "\n",
    "* Python 3.12.3\n",
    "* transformers                   4.45.2\n",
    "* peft                           0.3.0.dev0\n",
    "* deepspeed                      0.15.2\n",
    "* accelerate                     1.0.0\n",
    "\n",
    "如果不是，可以重新安装即可：\n",
    "```\n",
    "pip install transformers==4.45.2 deepspeed==0.15.2 accelerate==1.0.0\n",
    "\n",
    "#peft参考使用的是chinese llama的版本，需要git安装\n",
    "\n",
    "git clone https://github.com/huggingface/peft.git\n",
    "\n",
    "cd peft\n",
    "\n",
    "git checkout 13e53fc\n",
    "\n",
    "pip install . \n",
    "```\n",
    "如果有环境问题，可以查看本目录下的pip_list.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b1bd33b8-2e05-4b59-9d8f-c48de194cfd6",
   "metadata": {},
   "source": [
    "## 代码运行\n",
    "\n",
    "```\n",
    "# 复制第一章训练数据,包括dna，protein，还有英文数据，添加英文数据是为了避免遗忘问题\n",
    "\n",
    "mkdir train_data\n",
    "cp ../01-data_env/data/*.txt train_data/\n",
    "使用这些数据，6卡4090大概大致需要训练16个小时，autodl也需要近200块钱了。\n",
    "\n",
    "建议学习时，可以使用1/10的数据训练：\n",
    "awk ‘NR%10==1’ dna_1g.txt > dna.txt\n",
    "rm dna_1g.txt\n",
    "其他2类数据依次类推\n",
    "\n",
    "这样大概需要2到3个小时就能训练完成了\n",
    "\n",
    "\n",
    "#持续预训练\n",
    "./run_pt.sh\n",
    "\n",
    "#合并模型\n",
    "./merge_pt_model.sh\n",
    "\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4960a36c-7529-4db8-b91d-df91245f79d9",
   "metadata": {},
   "source": [
    "## 模型验证"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "69b3e97f-a801-4264-a651-a854bcfba9c6",
   "metadata": {},
   "outputs": [],
   "source": [
    "from transformers import AutoTokenizer, AutoConfig,AutoModel\n",
    "from transformers import DataCollatorForLanguageModeling\n",
    "from transformers import Trainer, TrainingArguments\n",
    "from transformers import  AutoConfig, AutoModelForCausalLM,LlamaForCausalLM,LlamaTokenizer\n",
    "from tokenizers import Tokenizer\n",
    "from datasets import load_dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "339435d9-9379-4b30-ae8b-50feee1ba714",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "LlamaTokenizer(name_or_path='dnahlm-merge-hf', vocab_size=91643, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '</s>'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={\n",
       "\t0: AddedToken(\"<unk>\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n",
       "\t1: AddedToken(\"<s>\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n",
       "\t2: AddedToken(\"</s>\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n",
       "}"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tokenizer = LlamaTokenizer.from_pretrained(\"dnahlm-merge-hf\")\n",
    "tokenizer.pad_token = tokenizer.eos_token\n",
    "tokenizer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "d0f154bb-b1ab-4611-a14c-9b403043fd96",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "342e4ab139b64bb78f0429c2f92c8310",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "LlamaForCausalLM(\n",
       "  (model): LlamaModel(\n",
       "    (embed_tokens): Embedding(91643, 4096, padding_idx=0)\n",
       "    (layers): ModuleList(\n",
       "      (0-31): 32 x LlamaDecoderLayer(\n",
       "        (self_attn): LlamaSdpaAttention(\n",
       "          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)\n",
       "          (k_proj): Linear(in_features=4096, out_features=4096, bias=False)\n",
       "          (v_proj): Linear(in_features=4096, out_features=4096, bias=False)\n",
       "          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)\n",
       "          (rotary_emb): LlamaRotaryEmbedding()\n",
       "        )\n",
       "        (mlp): LlamaMLP(\n",
       "          (gate_proj): Linear(in_features=4096, out_features=11008, bias=False)\n",
       "          (up_proj): Linear(in_features=4096, out_features=11008, bias=False)\n",
       "          (down_proj): Linear(in_features=11008, out_features=4096, bias=False)\n",
       "          (act_fn): SiLU()\n",
       "        )\n",
       "        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-06)\n",
       "        (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-06)\n",
       "      )\n",
       "    )\n",
       "    (norm): LlamaRMSNorm((4096,), eps=1e-06)\n",
       "    (rotary_emb): LlamaRotaryEmbedding()\n",
       "  )\n",
       "  (lm_head): Linear(in_features=4096, out_features=91643, bias=False)\n",
       ")"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model = LlamaForCausalLM.from_pretrained(\"dnahlm-merge-hf\") #continue pretrain\n",
    "model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "792a9f78-1828-4695-9f6e-479a704ea7e8",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "LlamaConfig {\n",
       "  \"_name_or_path\": \"dnahlm-merge-hf\",\n",
       "  \"architectures\": [\n",
       "    \"LlamaForCausalLM\"\n",
       "  ],\n",
       "  \"attention_bias\": false,\n",
       "  \"attention_dropout\": 0.0,\n",
       "  \"bos_token_id\": 1,\n",
       "  \"eos_token_id\": 2,\n",
       "  \"head_dim\": 128,\n",
       "  \"hidden_act\": \"silu\",\n",
       "  \"hidden_size\": 4096,\n",
       "  \"initializer_range\": 0.02,\n",
       "  \"intermediate_size\": 11008,\n",
       "  \"max_position_embeddings\": 2048,\n",
       "  \"mlp_bias\": false,\n",
       "  \"model_type\": \"llama\",\n",
       "  \"num_attention_heads\": 32,\n",
       "  \"num_hidden_layers\": 32,\n",
       "  \"num_key_value_heads\": 32,\n",
       "  \"pad_token_id\": 0,\n",
       "  \"pretraining_tp\": 1,\n",
       "  \"rms_norm_eps\": 1e-06,\n",
       "  \"rope_scaling\": null,\n",
       "  \"rope_theta\": 10000.0,\n",
       "  \"tie_word_embeddings\": false,\n",
       "  \"torch_dtype\": \"float16\",\n",
       "  \"transformers_version\": \"4.45.2\",\n",
       "  \"use_cache\": true,\n",
       "  \"vocab_size\": 91643\n",
       "}"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from transformers import AutoConfig\n",
    "# 加载配置\n",
    "config = AutoConfig.from_pretrained('dnahlm-merge-hf')\n",
    "config"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "49021c65-54bb-4a97-a96d-b030cc3dcd13",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Test text:\n",
      " GCTGACTCTGCCAGGATGGAATGAAATTAGGTTGTTTTAATTATAATGTAAAGTCAGTTCTAGTCAGACATAGTCACATAGGCAAGTAAGGGAACCTAAAATTGCTTGGAAT,\n",
      "KCGFVGPMVHLKVHLEADVASSCRSAVIYLTSEEPFEGVLGLRLKEGIAITGCWPRWPDEMDERSAVWRVEPYTRHFGRVLYSFGV,\n",
      "The primary use of LLaMA is research on large language models, including\n",
      "Tokenized by DNA-LLaMA tokenizer:['▁GC', 'TGA', 'CT', 'C', 'TGCC', 'AGGATGG', 'AATG', 'AAATT', 'AGGTTG', 'TTTTAATT', 'ATAATGTAA', 'AGTCAG', 'TTCTAG', 'TCAG', 'ACATAG', 'TC', 'ACATAGG', 'CA', 'AGTAAGGG', 'AAC', 'CT', 'AAAATTGC', 'TTGG', 'AAT', ',', '<0x0A>', 'KCG', 'FVGP', 'MVHL', 'KV', 'HLE', 'ADV', 'ASSC', 'RSAV', 'I', 'YL', 'TSEE', 'P', 'FEG', 'VLGL', 'RLK', 'EGI', 'AI', 'TGC', 'W', 'PRW', 'P', 'DEM', 'DER', 'SAV', 'W', 'RVE', 'PY', 'TRH', 'FG', 'RVLY', 'SFGV', ',', '<0x0A>', 'The', '▁primary', '▁use', '▁of', '▁L', 'La', 'MA', '▁is', '▁research', '▁on', '▁large', '▁language', '▁models', ',', '▁including']\n"
     ]
    }
   ],
   "source": [
    "text='''GCTGACTCTGCCAGGATGGAATGAAATTAGGTTGTTTTAATTATAATGTAAAGTCAGTTCTAGTCAGACATAGTCACATAGGCAAGTAAGGGAACCTAAAATTGCTTGGAAT,\n",
    "KCGFVGPMVHLKVHLEADVASSCRSAVIYLTSEEPFEGVLGLRLKEGIAITGCWPRWPDEMDERSAVWRVEPYTRHFGRVLYSFGV,\n",
    "The primary use of LLaMA is research on large language models, including'''\n",
    "print(\"Test text:\\n\",text)\n",
    "print(f\"Tokenized by DNA-LLaMA tokenizer:{tokenizer.tokenize(text)}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "ebf869c8-866d-4770-8f64-79d671f88663",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "e497889a1c3c484cb57c4b6fd93b45ab",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Some parameters are on the meta device because they were offloaded to the cpu.\n",
      "/root/miniconda3/lib/python3.12/site-packages/transformers/generation/utils.py:1220: UserWarning: Using the model-agnostic default `max_length` (=20) to control the generation length. We recommend setting `max_new_tokens` to control the maximum length of the generation.\n",
      "  warnings.warn(\n",
      "Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[{'generated_text': 'The key to life is to accept the fact that you are going to die. The key to'}]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import torch\n",
    "from transformers import pipeline\n",
    "\n",
    "model_id = \"dnahlm-merge-hf\"\n",
    "\n",
    "pipe = pipeline(\n",
    "    \"text-generation\", \n",
    "    model=model_id, \n",
    "    #torch_dtype=torch.bfloat16, \n",
    "    device_map=\"auto\",\n",
    ")\n",
    "\n",
    "pipe(\"The key to life is\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "40a22c70-f1c4-4cd5-a118-2f5db40790e6",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'generated_text': 'GGAATGAAATTAGGTTGTTTTAATTATAATGTAAAGTCAGTTCTCTCCTCCTCCTCCTC'}]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pipe(\"GGAATGAAATTAGGTTGTTTTAATTATAATGTAAAGTCAGTTCT\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "aec95d0a-4269-4540-bf14-4ce157b9a194",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'generated_text': 'KCGFVGPMVHLKVHLEADVASSCRSAVIYLTSEEPFEGVLGLRLKETLK'}]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pipe(\"KCGFVGPMVHLKVHLEADVASSCRSAVIYLTSEEPFEGVLGLRLK\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c1cfab60-2820-4885-8961-0290c49dfbec",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}