{ "cells": [ { "cell_type": "markdown", "id": "c2e5c9f4-4378-4d39-bc4f-fb4b4a2b2481", "metadata": {}, "source": [ "# 4.4 deepspeed分布式训练简介" ] }, { "cell_type": "markdown", "id": "75b8219d-8069-4b18-96c8-d5024ee049f1", "metadata": {}, "source": [ "## 大模型并行训练简介\n", "\n", "大模型的并行训练旨在克服单个 GPU 显存的限制和加速训练过程,通常适用于参数规模较大的模型(如 GPT-3、T5 等)。并行训练主要包括以下几种方法,每种方法适用于不同的场景和模型特性。\n", "\n", "---\n", "\n", "### **1. 数据并行(Data Parallelism)**\n", "\n", "#### **原理**\n", "- 将数据切分成多个小批次,每个 GPU 处理其中一部分。\n", "- 模型副本被复制到每个 GPU。\n", "- 每个 GPU 独立计算梯度,最终通过梯度同步(如 AllReduce 操作)更新参数。\n", "\n", "#### **特点**\n", "- **优点**:\n", " - 实现简单,是最常用的并行方法。\n", " - 对模型大小没有限制。\n", "- **缺点**:\n", " - 模型副本需要完整加载到每个 GPU,占用显存。\n", " - 在超大规模模型中,显存压力较大。\n", "\n", "#### **适用场景**\n", "- 参数规模适中,显存可以容纳整个模型的场景。\n", "\n", "---\n", "\n", "### **2. 模型并行(Model Parallelism)**\n", "\n", "#### **原理**\n", "- 将模型切分成不同的部分,将不同部分分配到不同的 GPU。\n", "- 前向传播和后向传播时,数据在模型的不同部分之间传递。\n", "\n", "#### **特点**\n", "- **优点**:\n", " - 不需要复制整个模型,可以支持超大规模模型。\n", "- **缺点**:\n", " - GPU 之间通信频繁,可能成为性能瓶颈。\n", " - 实现复杂,切分模型需要精心设计。\n", " \n", "#### **适用场景**\n", "- 单个 GPU 无法容纳完整模型参数的场景。\n", "\n", "#### **具体实现**\n", "- 将 Transformer 的不同层分配到不同的 GPU。\n", "- 常用工具:DeepSpeed 的 Pipeline Parallelism、NVIDIA Megatron-LM。\n", "\n", "---\n", "\n", "### **3. 张量并行(Tensor Parallelism)**\n", "\n", "#### **原理**\n", "- 将模型内部的张量(如权重矩阵)切分为多个子张量,并分配到不同 GPU。\n", "- GPU 之间协作完成矩阵计算。\n", "\n", "#### **特点**\n", "- **优点**:\n", " - 减少了每个 GPU 的显存占用,同时保持模型整体完整性。\n", "- **缺点**:\n", " - 实现较复杂,需要优化通信操作。\n", " - 通信开销较高,适合较大批量的训练。\n", "\n", "#### **适用场景**\n", "- 参数非常大的模型(如 GPT-3)。\n", "- 需要极致优化显存的场景。\n", "\n", "#### **具体实现**\n", "- NVIDIA 的 Megatron-LM 和 Hugging Face Transformers 提供了张量并行的支持。\n", "\n", "---\n", "\n", "### **4. 管道并行(Pipeline Parallelism)**\n", "\n", "#### **原理**\n", "- 将模型分为不同的部分(通常是按层划分),每部分分配到不同的 GPU。\n", "- 数据按照流水线的方式流经每个 GPU。\n", "\n", "#### **特点**\n", "- **优点**:\n", " - 减少每个 GPU 的显存压力。\n", " - 通过流水线增加计算效率。\n", "- **缺点**:\n", " - 引入流水线延迟。\n", " - 实现复杂,需管理数据依赖和同步。\n", "\n", "#### **适用场景**\n", "- 模型非常深,层数较多的场景。\n", "\n", "#### **具体实现**\n", "- DeepSpeed 的 Pipeline Parallelism。\n", "\n", "---\n", "\n", "### **5. 混合并行(Hybrid Parallelism)**\n", "\n", "#### **原理**\n", "- 将数据并行、模型并行、张量并行和管道并行组合使用,充分利用多 GPU 资源。\n", "- 不同的并行方法在不同维度协同工作。\n", "\n", "#### **特点**\n", "- **优点**:\n", " - 灵活且适应性强,适合超大规模模型。\n", "- **缺点**:\n", " - 配置复杂,依赖于框架和训练任务。\n", "\n", "#### **适用场景**\n", "- 超大规模模型(如 GPT-3 或参数量 >1T)。\n", "- 多机多卡的大型训练环境。\n", "\n", "#### **具体实现**\n", "- NVIDIA Megatron-LM 和 DeepSpeed 的混合并行支持。\n", "\n", "---\n", "\n", "### **6. ZeRO 优化并行(Zero Redundancy Optimizer)**\n", "\n", "#### **原理**\n", "- 通过分片存储模型参数、优化器状态和梯度,显著减少每个 GPU 的显存占用。\n", "\n", "#### **特点**\n", "- **优点**:\n", " - 极大降低显存需求。\n", " - 支持超大规模模型。\n", "- **缺点**:\n", " - 对 GPU 间通信要求较高。\n", " - 比数据并行复杂。\n", "\n", "#### **适用场景**\n", "- 超大模型的高效训练。\n", "\n", "#### **具体实现**\n", "- DeepSpeed 提供的 ZeRO Stage 1/2/3。\n", "\n", "---\n", "\n", "### **方法对比**\n", "\n", "| 并行方法 | 主要优点 | 主要缺点 | 适用场景 |\n", "|---------------|-------------------------------|-------------------------------|---------------------------|\n", "| 数据并行 | 简单高效,易实现 | 模型副本占用大量显存 | 模型规模适中,显存足够 |\n", "| 模型并行 | 支持大模型 | 通信开销大,切分复杂 | 超大模型,显存有限 |\n", "| 张量并行 | 高效利用显存 | 实现复杂,通信频繁 | 参数规模极大的模型 |\n", "| 管道并行 | 显存需求降低,适合深模型 | 流水线延迟,数据同步复杂 | 层数多的大型模型 |\n", "| 混合并行 | 灵活适配超大规模模型 | 配置复杂,依赖框架 | 超大规模模型(如 GPT-3) |\n", "| ZeRO 并行 | 极大节省显存,占用少 | 通信成本高 | 超大规模模型显存优化 |\n", "\n", "---\n", "\n", "### **总结**\n", "- **中等规模模型**:优先使用 **数据并行**。\n", "- **单卡显存不足**:采用 **模型并行** 或 **张量并行**。\n", "- **超大规模模型**:使用 **混合并行** 或 DeepSpeed 的 **ZeRO 优化**。\n", "\n", "对于现代超大规模模型,通常采用混合并行方法,比如 NVIDIA 的 Megatron-LM 和微软的 DeepSpeed,它们综合了多种并行策略,能够有效利用计算资源并加速训练。如果您有具体的硬件环境或模型需求,可以进一步探讨适合的并行方案!" ] }, { "cell_type": "code", "execution_count": null, "id": "06ddaa4d-e04a-41e0-beb5-f04dfaebcd54", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "c0d29667-1e75-46df-8f65-cae27609ee3f", "metadata": {}, "source": [ "## deepspeed简介\n", "\n", "DeepSpeed 是微软开发的开源深度学习优化库,专为大规模模型训练和推理设计,能够显著提升训练速度、降低显存占用,并支持分布式计算。以下是 DeepSpeed 的关键特点和功能:\n", "\n", "---\n", "\n", "### **1. 核心特点**\n", "\n", "#### **(1)高效分布式训练**\n", "DeepSpeed 提供先进的分布式技术(如 ZeRO 优化器),支持数百亿甚至上万亿参数的模型训练,同时降低单设备显存需求。\n", "\n", "#### **(2)显存优化**\n", "通过显存分片(ZeRO)、梯度累积和混合精度训练,DeepSpeed 能够在有限显存的情况下训练大模型。\n", "\n", "#### **(3)性能提升**\n", "DeepSpeed 优化了通信和计算过程,可提升多 GPU 分布式训练效率。\n", "\n", "#### **(4)灵活性**\n", "与 PyTorch 无缝集成,并兼容 Hugging Face `transformers` 和其他主流深度学习库。\n", "\n", "#### **(5)推理优化**\n", "支持高效推理(如量化和张量并行),适合大模型的生产部署。\n", "\n", "---\n", "\n", "### **2. 核心技术**\n", "\n", "#### **(1)ZeRO 优化器**\n", "ZeRO(Zero Redundancy Optimizer)是 DeepSpeed 的核心技术之一,分为 3 个阶段:\n", "- **Stage 1**:分片优化器状态(如动量、方差)。\n", "- **Stage 2**:分片优化器状态和梯度。\n", "- **Stage 3**:分片优化器状态、梯度和模型参数,实现全分片优化。\n", "\n", "每个阶段都进一步减少显存需求,Stage 3 可支持超大规模模型(如 GPT-3)。\n", "\n", "\n", "\n", "#### **(2)混合精度训练**\n", "通过 FP16 或 BF16(半精度浮点数)计算,显著减少显存占用并提升计算效率。\n", "\n", "#### **(3)数据并行与模型并行**\n", "- 数据并行:将数据划分到多个设备,每个设备计算部分梯度。\n", "- 模型并行:将模型的不同部分分配到多个设备。\n", "- 张量并行:将张量运算分解到多个 GPU 上。\n", "\n", "#### **(4)梯度累积**\n", "支持更大的有效批量大小,适合显存受限的设备。\n", "\n", "#### **(5)推理优化**\n", "- 推理阶段的显存优化和加速技术。\n", "- 量化推理,减少模型大小和运行时开销。\n", "\n", "---\n", "\n", "### **3. 适用场景**\n", "\n", "#### **(1)大规模模型训练**\n", "适合训练数十亿或上万亿参数的模型,如 GPT-3、BERT、T5 等。\n", "\n", "#### **(2)分布式训练**\n", "支持单机多卡、多机多卡分布式训练,能高效利用多 GPU 环境。\n", "\n", "#### **(3)显存受限的模型微调**\n", "通过显存优化技术,能在较小显存设备(如 16GB GPU)上微调大模型。\n", "\n", "#### **(4)高效推理**\n", "适用于大语言模型的生产部署,支持推理加速和量化。\n", "\n", "---\n", "\n", "### **4. 优势与局限性**\n", "\n", "#### **优势**\n", "1. 显存需求显著降低,适合超大规模模型训练。\n", "2. 支持多种分布式模式,扩展性强。\n", "3. 与 PyTorch 和 Hugging Face 无缝集成。\n", "4. 推理优化技术降低部署成本。\n", "\n", "#### **局限性**\n", "1. 配置和调优可能较为复杂。\n", "2. 对小规模模型或数据集的性能提升有限。\n", "\n", "---\n", "\n", "### **5. 安装与基本用法**\n", "\n", "#### **安装**\n", "```bash\n", "pip install deepspeed\n", "```\n", "\n", "#### **基本用法**\n", "DeepSpeed 通过配置文件启用特性,例如 ZeRO 优化器:\n", "```python\n", "from transformers import GPT2LMHeadModel, TrainingArguments, Trainer\n", "import deepspeed\n", "\n", "# 配置 DeepSpeed\n", "deepspeed_config = {\n", " \"train_batch_size\": 64,\n", " \"gradient_accumulation_steps\": 8,\n", " \"fp16\": {\n", " \"enabled\": True\n", " },\n", " \"zero_optimization\": {\n", " \"stage\": 2,\n", " \"overlap_comm\": True\n", " }\n", "}\n", "\n", "# 保存配置文件\n", "import json\n", "with open(\"deepspeed_config.json\", \"w\") as f:\n", " json.dump(deepspeed_config, f)\n", "\n", "# 集成到 Hugging Face Trainer\n", "training_args = TrainingArguments(\n", " output_dir=\"./results\",\n", " per_device_train_batch_size=4,\n", " num_train_epochs=3,\n", " learning_rate=5e-5,\n", " fp16=True,\n", " deepspeed=\"./deepspeed_config.json\" # DeepSpeed 配置文件\n", ")\n", "\n", "trainer = Trainer(\n", " model=GPT2LMHeadModel.from_pretrained(\"gpt2\"),\n", " args=training_args,\n", " train_dataset=train_dataset,\n", " eval_dataset=eval_dataset\n", ")\n", "\n", "trainer.train()\n", "```\n", "\n", "---\n", "\n", "### **6. 总结**\n", "\n", "DeepSpeed 是大模型训练的强力工具,特别是在多 GPU 环境下,其显存优化和分布式训练技术能显著提升训练效率。适用于以下场景:\n", "- 超大规模模型的训练和微调。\n", "- 多机多卡环境的分布式训练。\n", "- 高效推理部署。\n", "\n", "如果需要进一步优化模型训练或部署性能,DeepSpeed 是值得尝试的工具!" ] }, { "cell_type": "code", "execution_count": null, "id": "a5372798-ced3-420c-b853-badd3ff05dc1", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "cd848439-bac8-46b2-9a0f-59ae7c343954", "metadata": {}, "source": [ "## deepspeed具体设置\n", "\n", "\n", "是的,DeepSpeed 支持多种并行策略,包括 **数据并行**、**模型并行** 和 **张量并行**,并且可以通过其配置文件灵活地设置这些并行模式。\n", "\n", "---\n", "\n", "### **1. 数据并行**\n", "\n", "#### **原理**\n", "在数据并行中,DeepSpeed 将数据批次划分到多个 GPU,每个 GPU 上都有模型的完整副本,计算独立的梯度。最终通过 `AllReduce` 操作同步梯度并更新模型参数。\n", "\n", "#### **如何设置**\n", "DeepSpeed 默认支持数据并行,启用 `zero_optimization` 后会自动结合 ZeRO 优化器进行分片数据并行:\n", "```json\n", "{\n", " \"train_batch_size\": 64,\n", " \"gradient_accumulation_steps\": 8,\n", " \"fp16\": {\n", " \"enabled\": true\n", " },\n", " \"zero_optimization\": {\n", " \"stage\": 1\n", " }\n", "}\n", "```\n", "\n", "---\n", "\n", "### **2. 模型并行**\n", "\n", "#### **原理**\n", "模型并行将模型的不同部分(如 Transformer 层或权重张量)分布到多个 GPU。DeepSpeed 本身不直接实现模型并行,但可以与模型并行框架(如 NVIDIA Megatron-LM)集成。\n", "\n", "#### **如何设置**\n", "如果使用模型并行(如层级分割):\n", "1. 使用 DeepSpeed 的 Pipeline Parallelism:\n", " ```json\n", " {\n", " \"train_batch_size\": 64,\n", " \"pipeline_parallel_size\": 2, # 设置流水线并行 GPU 数量\n", " \"fp16\": {\n", " \"enabled\": true\n", " },\n", " \"zero_optimization\": {\n", " \"stage\": 1\n", " }\n", " }\n", " ```\n", "\n", "2. 与 NVIDIA Megatron-LM 集成:\n", " 在代码中使用 Megatron-LM 的模型并行支持,然后结合 DeepSpeed:\n", " ```python\n", " from megatron import get_model_parallel_world_size\n", " import deepspeed\n", "\n", " model = MyModel(...)\n", " model = deepspeed.initialize(\n", " model=model,\n", " model_parallel_size=get_model_parallel_world_size(),\n", " config=\"./deepspeed_config.json\"\n", " )\n", " ```\n", "\n", "---\n", "\n", "### **3. 张量并行**\n", "\n", "#### **原理**\n", "张量并行将模型参数张量(如权重矩阵)分片到多个 GPU,并通过通信协作完成计算。DeepSpeed 提供了张量并行的支持(在 ZeRO Stage 3 中),或者通过集成 Megatron-LM 实现。\n", "\n", "#### **如何设置**\n", "1. **使用 ZeRO Stage 3**:\n", " ZeRO Stage 3 会分片模型参数和优化器状态,类似于张量并行的效果:\n", " ```json\n", " {\n", " \"train_batch_size\": 64,\n", " \"gradient_accumulation_steps\": 8,\n", " \"fp16\": {\n", " \"enabled\": true\n", " },\n", " \"zero_optimization\": {\n", " \"stage\": 3,\n", " \"offload_optimizer\": {\n", " \"device\": \"cpu\",\n", " \"pin_memory\": true\n", " },\n", " \"offload_param\": {\n", " \"device\": \"cpu\",\n", " \"pin_memory\": true\n", " }\n", " }\n", " }\n", " ```\n", "\n", "2. **集成 Megatron-LM**:\n", " 如果需要更复杂的张量并行方案(如矩阵切分),可以通过 Megatron-LM 实现,然后与 DeepSpeed 集成。\n", "\n", "---\n", "\n", "### **4. 混合并行**\n", "\n", "#### **原理**\n", "混合并行结合了数据并行、模型并行和张量并行。DeepSpeed 提供了对这些模式的集成支持,允许您灵活配置。\n", "\n", "#### **如何设置**\n", "结合数据并行和流水线并行:\n", "```json\n", "{\n", " \"train_batch_size\": 64,\n", " \"gradient_accumulation_steps\": 8,\n", " \"fp16\": {\n", " \"enabled\": true\n", " },\n", " \"pipeline_parallel_size\": 2, # 流水线并行\n", " \"zero_optimization\": {\n", " \"stage\": 2\n", " }\n", "}\n", "```\n", "\n", "与张量并行结合:\n", "1. 在代码中配置张量并行:\n", " ```python\n", " from megatron import get_tensor_parallel_world_size\n", " model = MyModel(...)\n", " model = deepspeed.initialize(\n", " model=model,\n", " tensor_parallel_size=get_tensor_parallel_world_size(),\n", " config=\"./deepspeed_config.json\"\n", " )\n", " ```\n", "\n", "2. DeepSpeed 配置文件中启用 ZeRO Stage 3。\n", "\n", "---\n", "\n", "### **5. 选择并行策略**\n", "\n", "| 并行模式 | **支持方式** | **适用场景** |\n", "|---------------|------------------------------------------|-----------------------------------------|\n", "| 数据并行 | 默认支持,结合 ZeRO 优化器 | 模型参数较小,显存压力不大的场景 |\n", "| 模型并行 | 使用 Pipeline Parallelism 或集成 Megatron-LM | 模型参数非常大,单 GPU 无法容纳的场景 |\n", "| 张量并行 | ZeRO Stage 3 或集成 Megatron-LM | 参数矩阵非常大,需要分片计算的场景 |\n", "| 混合并行 | 结合数据并行、模型并行和张量并行 | 超大规模模型(如 GPT-3)训练 |\n", "\n", "---\n", "\n", "### **6. 示例代码**\n", "\n", "以下是集成 ZeRO 和 Pipeline Parallelism 的完整示例:\n", "```python\n", "import deepspeed\n", "from transformers import GPT2LMHeadModel, TrainingArguments, Trainer\n", "from datasets import load_dataset\n", "\n", "# 加载数据\n", "dataset = load_dataset(\"wikitext\", \"wikitext-2-raw-v1\", split=\"train\")\n", "\n", "# 加载模型\n", "model = GPT2LMHeadModel.from_pretrained(\"gpt2\")\n", "\n", "# 配置 DeepSpeed\n", "deepspeed_config = {\n", " \"train_batch_size\": 64,\n", " \"gradient_accumulation_steps\": 8,\n", " \"pipeline_parallel_size\": 2, # 流水线并行\n", " \"fp16\": {\n", " \"enabled\": True\n", " },\n", " \"zero_optimization\": {\n", " \"stage\": 2\n", " }\n", "}\n", "\n", "# 保存配置文件\n", "import json\n", "with open(\"deepspeed_config.json\", \"w\") as f:\n", " json.dump(deepspeed_config, f)\n", "\n", "# 训练参数\n", "training_args = TrainingArguments(\n", " output_dir=\"./results\",\n", " per_device_train_batch_size=4,\n", " num_train_epochs=3,\n", " deepspeed=\"./deepspeed_config.json\", # 指定 DeepSpeed 配置文件\n", ")\n", "\n", "# 初始化 Trainer\n", "trainer = Trainer(\n", " model=model,\n", " args=training_args,\n", " train_dataset=dataset\n", ")\n", "\n", "# 开始训练\n", "trainer.train()\n", "```\n", "\n", "---\n", "\n", "### **总结**\n", "\n", "- **数据并行**:默认支持,结合 ZeRO 进行优化。\n", "- **模型并行**:使用 Pipeline Parallelism 或与 Megatron-LM 集成。\n", "- **张量并行**:通过 ZeRO Stage 3 或 Megatron-LM 实现。\n", "- **混合并行**:灵活结合多种并行方法,用于超大规模模型。\n", "\n", "DeepSpeed 的配置高度灵活,可以根据模型大小、显存限制和硬件条件选择适合的并行策略。" ] }, { "cell_type": "code", "execution_count": null, "id": "a8e6de4c-adc1-4a1b-840a-c8542b4ed783", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "3383c2d7-91a9-4940-b3b2-698fb7d9dbb7", "metadata": {}, "source": [ "## 使用gpt2+deepspeed训练" ] }, { "cell_type": "markdown", "id": "ab2812bc-f743-4f18-b49c-972781484dc6", "metadata": {}, "source": [ "## gpt2的训练\n", "\n", "```\n", "#一般方式训练gpt2\n", "python pretain_gpt2.py\n", "\n", "\n", "#deepspeed训练gpt2, 只多一行代码\n", "torchrun --nproc_per_node=6 deepspeed_pretrain_gpt2.py\n", "\n", "```" ] }, { "cell_type": "code", "execution_count": null, "id": "9cb60dc2-4cec-492d-836b-67694829acf2", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.3" } }, "nbformat": 4, "nbformat_minor": 5 }