{ "cells": [ { "cell_type": "markdown", "id": "1e6d4978-4f0f-4268-aa23-d864857bd6c8", "metadata": {}, "source": [ "# 4.6 基于llama的基因大模型持续预训练" ] }, { "cell_type": "markdown", "id": "2c201732-e736-463c-8446-637bf517479f", "metadata": {}, "source": [ "LLaMA(**Large Language Model Meta AI**)是由 Meta(Facebook)开发的一系列大型语言模型,专注于提供高性能和高效的大语言模型,面向学术研究和开发社区。LLaMA 系列主要强调训练效率、模型性能和对计算资源的高效利用,是 GPT 系列模型的有力竞争者之一。\n", "\n", "---\n", "\n", "### **1. LLaMA 模型概述**\n", "\n", "#### **1.1 LLaMA 1**\n", "- **发布**:2023 年 2 月。\n", "- **模型参数规模**:\n", " - 7B(70 亿)\n", " - 13B(130 亿)\n", " - 33B(330 亿)\n", " - 65B(650 亿)\n", "- **特点**:\n", " - 专注于效率:与 GPT-3 等模型相比,LLaMA 在相同的训练成本下实现了更高的性能。\n", " - 针对研究开放:提供预训练模型权重供研究使用。\n", " - 使用高质量的数据:模型训练使用大量从网络中筛选的高质量文本数据,包括维基百科、书籍和其他高质量来源。\n", "- **性能**:\n", " - 在许多 NLP 任务中,LLaMA 的性能超过 GPT-3 和其他同类模型。\n", " - 参数规模较小的版本(如 LLaMA-13B)性能可与 GPT-3(175B 参数)媲美。\n", "\n", "#### **1.2 LLaMA 2**\n", "- **发布**:2023 年 7 月。\n", "- **改进**:\n", " - 增强的训练数据:相比 LLaMA 1,使用了更多的高质量数据。\n", " - 引入微调版本:发布了开箱即用的对话模型(LLaMA 2-Chat)。\n", " - 更好的开源支持:LLaMA 2 在商业用途上比 LLaMA 1 更加开放。\n", "- **模型参数规模**:\n", " - 7B(70 亿)\n", " - 13B(130 亿)\n", " - 70B(700 亿)\n", "- **性能**:\n", " - LLaMA 2 的性能相比 LLaMA 1 有显著提升。\n", " - LLaMA 2-Chat 在对话任务中的表现优于许多现有开源模型。\n", " - 在多个标准基准(如 MMLU)上超过 GPT-4 和 Claude 的开源实现。\n", "\n", "---\n", "\n", "### **2. LLaMA 的关键技术特点**\n", "\n", "#### **2.1 高效的架构设计**\n", "- 基于 Transformer 架构。\n", "- 针对训练效率和推理速度进行了优化,适合研究和开发。\n", "\n", "#### **2.2 模型压缩**\n", "- 提供更小的参数规模(如 7B 和 13B),以便在更低的计算资源上运行。\n", "- 在性能与参数量之间实现了很好的平衡。\n", "\n", "#### **2.3 训练数据**\n", "- 使用从互联网中提取的高质量数据,注重数据清洗和筛选,避免低质量文本对模型的负面影响。\n", "\n", "#### **2.4 微调能力**\n", "- 支持指令微调(Instruction Tuning)和 RLHF(基于人类反馈的强化学习),特别是在 LLaMA 2-Chat 模型中表现优异。\n", "\n", "---\n", "\n", "### **3. LLaMA 的性能对比**\n", "\n", "#### **与 GPT-3 比较**\n", "- LLaMA 1-13B 参数模型在许多任务上的性能接近 GPT-3-175B。\n", "- LLaMA 2-70B 在多个任务上超过 GPT-3。\n", "\n", "#### **与其他开源模型比较**\n", "- LLaMA 2 在许多基准测试中优于其他开源模型(如 Falcon 和 MPT)。\n", "- LLaMA 2-Chat 提供了与 ChatGPT 类似的对话能力,适用于对话任务。\n", "\n", "---\n", "\n", "### **4. 应用场景**\n", "\n", "1. **研究**:\n", " - 开源权重适合学术研究,推动了对大语言模型的进一步探索。\n", "\n", "2. **对话系统**:\n", " - LLaMA 2-Chat 专为对话任务设计,适合开发智能客服、聊天机器人等应用。\n", "\n", "3. **生成任务**:\n", " - 支持文本生成、补全、摘要等任务。\n", "\n", "4. **微调与定制**:\n", " - 可以基于特定领域数据进行微调,如医学、法律、教育等领域的专用模型。\n", "\n", "---\n", "\n", "### **5. 开源与获取方式**\n", "\n", "#### **1. 开源**\n", "- LLaMA 1:需要申请权限才能获得模型权重。\n", "- LLaMA 2:更加开放,允许商业用途,模型和权重可以通过 Meta 的合作平台获取(如 Hugging Face 和 AWS)。\n", "\n", "#### **2. 下载与使用**\n", "使用 Hugging Face 加载模型:\n", "```python\n", "from transformers import AutoModelForCausalLM, AutoTokenizer\n", "\n", "model_name = \"meta-llama/Llama-2-7b-hf\" # 替换为具体模型\n", "tokenizer = AutoTokenizer.from_pretrained(model_name)\n", "model = AutoModelForCausalLM.from_pretrained(model_name)\n", "\n", "# 使用模型生成文本\n", "inputs = tokenizer(\"Hello, how are you?\", return_tensors=\"pt\")\n", "outputs = model.generate(**inputs, max_length=50)\n", "print(tokenizer.decode(outputs[0], skip_special_tokens=True))\n", "```\n", "\n", "---\n", "\n", "### **6. 总结**\n", "\n", "#### **优势**\n", "- **高性能**:在多个基准任务上表现出色。\n", "- **高效训练**:小参数模型能与大模型媲美。\n", "- **开放性**:LLaMA 2 提供了较为开放的商用许可。\n", "\n", "#### **局限**\n", "- 模型需要高质量数据和强大算力训练,对推理设备也有一定要求。\n", "\n", "LLaMA 系列以其高效和开放的特点,为大模型研究和应用带来了强大动力,是当前大语言模型生态的重要组成部分。" ] }, { "cell_type": "markdown", "id": "7fb0d648-f891-47b9-a644-af5263fa9718", "metadata": {}, "source": [ "---\n", "---" ] }, { "cell_type": "markdown", "id": "8b3c9ebb-213b-4dc4-a712-5a819fea3197", "metadata": {}, "source": [ "**大模型的持续预训练**(Continual Pretraining of Large Models)是指在基础预训练模型(如 GPT、BERT 等)的基础上,通过引入新的数据或特定领域的数据继续进行预训练的过程。这一过程旨在让模型在特定场景或任务中表现更好,同时保留其通用能力。\n", "\n", "---\n", "\n", "### **1. 持续预训练的概念**\n", "\n", "持续预训练是一种在通用大模型的预训练基础上,进一步优化和适配模型的方法,主要包括以下两种场景:\n", "1. **领域适配**:\n", " - 将预训练模型在特定领域的数据上继续训练,使其对该领域的语料理解更深刻,例如法律、医学、金融等领域。\n", "2. **性能优化**:\n", " - 通过引入更多的通用数据或多样化的数据类型,扩展模型的通用能力,提高性能。\n", "\n", "---\n", "\n", "### **2. 持续预训练的目标**\n", "\n", "1. **提升领域性能**:\n", " - 在特定领域任务上,模型能够更好地理解特定领域的语言模式和知识。\n", " \n", "2. **增强模型鲁棒性**:\n", " - 通过引入新的数据或增强数据多样性,使模型对未见数据表现更稳定。\n", "\n", "3. **优化资源利用**:\n", " - 通过复用已有的大模型权重,只需训练少量额外步骤,避免从零开始重新训练模型。\n", "\n", "---\n", "\n", "### **3. 持续预训练的步骤**\n", "\n", "#### **(1)数据准备**\n", "- **领域数据**:针对特定领域(如医学、法律、科技)收集高质量语料。\n", "- **新语料整合**:补充模型未见过的多样化语料。\n", "- **数据清洗**:确保数据无噪声、语言风格一致。\n", "\n", "#### **(2)模型初始化**\n", "- 使用现有的预训练模型作为初始权重,例如 Hugging Face 提供的 GPT-2 或 BERT 模型。\n", "\n", "#### **(3)训练设置**\n", "- **超参数调整**:\n", " - 通常使用较小的学习率(例如 `1e-5` 或 `2e-5`)以避免破坏已有的知识。\n", "- **训练策略**:\n", " - 冻结部分参数(如嵌入层或前几层)以保留通用能力,仅调整高层或新加入的部分。\n", "\n", "#### **(4)评估和验证**\n", "- 使用领域任务的数据集对模型进行评估,验证其在目标任务中的改进效果。\n", "\n", "---\n", "\n", "### **4. 持续预训练的常见方法**\n", "\n", "#### **(1)全量持续预训练**\n", "- 对整个模型的参数进行调整。\n", "- **优点**:适合较大规模的新数据训练,能显著提升领域性能。\n", "- **缺点**:计算资源需求大,可能导致模型过拟合。\n", "\n", "#### **(2)冻结部分参数**\n", "- 冻结低层参数,仅微调高层。\n", "- **优点**:保留通用知识,减少计算开销。\n", "- **缺点**:对领域特定知识的适配可能不足。\n", "\n", "#### **(3)参数高效微调(PEFT)**\n", "- 使用 PEFT 方法(如 LoRA、Adapter)进行预训练:\n", " - **LoRA**:通过低秩矩阵分解,微调部分关键模块。\n", " - **Adapter**:在 Transformer 层中插入小型适配模块。\n", "- **优点**:显著减少需要更新的参数量。\n", "\n", "---\n", "\n", "### **5. 持续预训练的典型应用**\n", "\n", "1. **领域适配**\n", " - **医学**:将预训练模型在 PubMed 或生物医学数据集上进行持续预训练。\n", " - **法律**:使用法律文档进一步训练基础模型。\n", " - **金融**:通过金融新闻、报告语料提升模型在金融领域的表现。\n", "\n", "2. **多语言扩展**\n", " - 引入多语言语料,扩展模型的多语言能力。\n", "\n", "3. **数据更新**\n", " - 持续加入新数据(如时事新闻)以适配最新语言模式。\n", "\n", "4. **特殊任务优化**\n", " - 针对特定任务(如代码生成、对话)引入专用数据进行训练。\n", "\n", "---\n", "\n", "### **6. 实现持续预训练的代码示例**\n", "\n", "以下示例基于 Hugging Face 实现 GPT-2 的持续预训练:\n", "\n", "```python\n", "from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments\n", "from datasets import load_dataset\n", "\n", "# 1. 加载预训练模型和分词器\n", "model_name = \"gpt2\"\n", "tokenizer = AutoTokenizer.from_pretrained(model_name)\n", "model = AutoModelForCausalLM.from_pretrained(model_name)\n", "\n", "# 2. 加载新语料数据\n", "dataset = load_dataset(\"text\", data_files={\"train\": \"domain_corpus.txt\"})\n", "\n", "# 3. 数据预处理\n", "def tokenize_function(examples):\n", " return tokenizer(examples[\"text\"], truncation=True, max_length=1024, padding=\"max_length\")\n", "\n", "tokenized_dataset = dataset.map(tokenize_function, batched=True)\n", "\n", "# 4. 设置训练参数\n", "training_args = TrainingArguments(\n", " output_dir=\"./gpt2_domain_adapted\",\n", " overwrite_output_dir=True,\n", " per_device_train_batch_size=4,\n", " num_train_epochs=3,\n", " learning_rate=5e-5,\n", " save_steps=500,\n", " save_total_limit=2,\n", " logging_dir=\"./logs\",\n", " evaluation_strategy=\"no\", # 评估策略可以根据需要调整\n", " fp16=True, # 混合精度训练\n", ")\n", "\n", "# 5. 定义 Trainer 并启动训练\n", "trainer = Trainer(\n", " model=model,\n", " args=training_args,\n", " train_dataset=tokenized_dataset[\"train\"],\n", " tokenizer=tokenizer,\n", ")\n", "\n", "trainer.train()\n", "\n", "# 6. 保存模型\n", "model.save_pretrained(\"./gpt2_domain_adapted\")\n", "tokenizer.save_pretrained(\"./gpt2_domain_adapted\")\n", "```\n", "\n", "---\n", "\n", "### **7. 持续预训练的挑战**\n", "\n", "1. **灾难性遗忘**:\n", " - 持续预训练可能导致模型丧失之前学到的知识。\n", " - **解决方法**:使用少量原始数据进行联合训练。\n", "\n", "2. **计算资源需求**:\n", " - 需要大量显存和算力,特别是对于大规模模型和数据。\n", "\n", "3. **数据质量和多样性**:\n", " - 新引入的数据可能包含噪声,影响模型性能。\n", "\n", "---\n", "\n", "### **8. 持续预训练的优势**\n", "\n", "- 提高特定领域或任务的性能。\n", "- 更高效地利用已有模型权重,避免从头训练。\n", "- 保留原始模型的通用能力,同时增强领域适应性。\n", "\n", "---\n", "\n", "### **总结**\n", "\n", "持续预训练是适配领域任务和提升模型性能的重要方法,通过引入新数据或优化模型训练策略,可以让大模型在特定场景中表现更优。配合参数高效微调方法(如 LoRA),还可显著降低计算开销,提升训练效率。这种技术在学术研究、工业应用和前沿领域(如法律、医学等)中均具有广泛价值。" ] }, { "cell_type": "code", "execution_count": null, "id": "ca41ad33-18fb-44da-8f79-0380b5c9dcaa", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "3038550c-cc92-45c9-8bb4-46c58688bfc5", "metadata": {}, "source": [ "## 本节任务\n", "本节任务是基于llama。训练一个能够处理dna和protein蛋白质数据的基础预训练大模型,数据为第一章中的预训练数据,包括英文数据。" ] }, { "cell_type": "markdown", "id": "aec90d65-ac62-4394-a526-ca62d8bdbad4", "metadata": {}, "source": [ "## 环境设置\n", "并行环境对transformer、peft等的版本要求比较高,如果版本不匹配可能会出现各种异常问题\n", "之前的课程,都是单GPU运行,一般不存在版本问题,默认安装的都是最新版本。但运行并行环境时,需要确认下版本再运行,本课程运行并行环境如下:\n", "\n", "* Python 3.12.3\n", "* transformers 4.45.2\n", "* peft 0.3.0.dev0\n", "* deepspeed 0.15.2\n", "* accelerate 1.0.0\n", "\n", "如果不是,可以重新安装即可:\n", "```\n", "pip install transformers==4.45.2 deepspeed==0.15.2 accelerate==1.0.0\n", "\n", "#peft参考使用的是chinese llama的版本,需要git安装\n", "\n", "git clone https://github.com/huggingface/peft.git\n", "\n", "cd peft\n", "\n", "git checkout 13e53fc\n", "\n", "pip install . \n", "```\n", "如果有环境问题,可以查看本目录下的pip_list.txt" ] }, { "cell_type": "markdown", "id": "b1bd33b8-2e05-4b59-9d8f-c48de194cfd6", "metadata": {}, "source": [ "## 代码运行\n", "\n", "```\n", "# 复制第一章训练数据,包括dna,protein,还有英文数据,添加英文数据是为了避免遗忘问题\n", "\n", "mkdir train_data\n", "cp ../01-data_env/data/*.txt train_data/\n", "使用这些数据,6卡4090大概大致需要训练16个小时,autodl也需要近200块钱了。\n", "\n", "建议学习时,可以使用1/10的数据训练:\n", "awk ‘NR%10==1’ dna_1g.txt > dna.txt\n", "rm dna_1g.txt\n", "其他2类数据依次类推\n", "\n", "这样大概需要2到3个小时就能训练完成了\n", "\n", "\n", "#持续预训练\n", "./run_pt.sh\n", "\n", "#合并模型\n", "./merge_pt_model.sh\n", "\n", "```" ] }, { "cell_type": "markdown", "id": "4960a36c-7529-4db8-b91d-df91245f79d9", "metadata": {}, "source": [ "## 模型验证" ] }, { "cell_type": "code", "execution_count": 1, "id": "69b3e97f-a801-4264-a651-a854bcfba9c6", "metadata": {}, "outputs": [], "source": [ "from transformers import AutoTokenizer, AutoConfig,AutoModel\n", "from transformers import DataCollatorForLanguageModeling\n", "from transformers import Trainer, TrainingArguments\n", "from transformers import AutoConfig, AutoModelForCausalLM,LlamaForCausalLM,LlamaTokenizer\n", "from tokenizers import Tokenizer\n", "from datasets import load_dataset" ] }, { "cell_type": "code", "execution_count": 2, "id": "339435d9-9379-4b30-ae8b-50feee1ba714", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LlamaTokenizer(name_or_path='dnahlm-merge-hf', vocab_size=91643, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': '', 'eos_token': '', 'unk_token': '', 'pad_token': ''}, clean_up_tokenization_spaces=False), added_tokens_decoder={\n", "\t0: AddedToken(\"\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n", "\t1: AddedToken(\"\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n", "\t2: AddedToken(\"\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n", "}" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tokenizer = LlamaTokenizer.from_pretrained(\"dnahlm-merge-hf\")\n", "tokenizer.pad_token = tokenizer.eos_token\n", "tokenizer" ] }, { "cell_type": "code", "execution_count": 3, "id": "d0f154bb-b1ab-4611-a14c-9b403043fd96", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "342e4ab139b64bb78f0429c2f92c8310", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Loading checkpoint shards: 0%| | 0/3 [00:00', 'KCG', 'FVGP', 'MVHL', 'KV', 'HLE', 'ADV', 'ASSC', 'RSAV', 'I', 'YL', 'TSEE', 'P', 'FEG', 'VLGL', 'RLK', 'EGI', 'AI', 'TGC', 'W', 'PRW', 'P', 'DEM', 'DER', 'SAV', 'W', 'RVE', 'PY', 'TRH', 'FG', 'RVLY', 'SFGV', ',', '<0x0A>', 'The', '▁primary', '▁use', '▁of', '▁L', 'La', 'MA', '▁is', '▁research', '▁on', '▁large', '▁language', '▁models', ',', '▁including']\n" ] } ], "source": [ "text='''GCTGACTCTGCCAGGATGGAATGAAATTAGGTTGTTTTAATTATAATGTAAAGTCAGTTCTAGTCAGACATAGTCACATAGGCAAGTAAGGGAACCTAAAATTGCTTGGAAT,\n", "KCGFVGPMVHLKVHLEADVASSCRSAVIYLTSEEPFEGVLGLRLKEGIAITGCWPRWPDEMDERSAVWRVEPYTRHFGRVLYSFGV,\n", "The primary use of LLaMA is research on large language models, including'''\n", "print(\"Test text:\\n\",text)\n", "print(f\"Tokenized by DNA-LLaMA tokenizer:{tokenizer.tokenize(text)}\")" ] }, { "cell_type": "code", "execution_count": 6, "id": "ebf869c8-866d-4770-8f64-79d671f88663", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "e497889a1c3c484cb57c4b6fd93b45ab", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Loading checkpoint shards: 0%| | 0/3 [00:00