Upload 14 files

Browse files

Files changed (12) hide show

README.md +127 -132
Research License.docx +0 -0
added_tokens.json +40 -0
config.json +4 -4
configuration_mixformer_sequential.py +53 -0
merges.txt +0 -0
modeling_mixformer_sequential.py +222 -303
pytorch_model.bin +2 -2
special_tokens_map.json +5 -0
tokenizer.json +0 -0
tokenizer_config.json +9 -0
vocab.json +0 -0

README.md CHANGED Viewed

@@ -1,148 +1,143 @@
 ---
 license: other
-base_model: microsoft/phi-1_5
-tags:
-- generated_from_trainer
-- sales
-model-index:
-- name: salesGPT_v2
-  results: []
-datasets:
-- goendalf666/sales-conversations-2
-- goendalf666/sales-conversations-instruction-ext
-- goendalf666/sales-conversations-instruction-base
-- goendalf666/sales-textbook_for_convincing_and_selling
 language:
 - en
 pipeline_tag: text-generation
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
-# salesGPT_v2
-**Model Card for salesGPT_v2**
-### Model Description
-salesGPT_v2, derived from microsoft/phi-1_5, is specialized in simulating sales conversations, wherein it understands customer requirements, manages objections, and suggests suitable products or services. It was fine-tuned on a variety of sales-related datasets and seems proficient in initiating conversations, asking pertinent questions, and sustaining interactive dialogues with users.
-### Related Ressources
-Github: https://github.com/tom813/salesGPT_foundation
-salesGPT_v1: https://huggingface.co/goendalf666/salesGPT_v1
-![image/png](https://cdn-uploads.huggingface.co/production/uploads/63797fcb2cb50dda39d8aec6/re7MmsaYNzTYVH2jEXDDu.png)
-### Intended Uses & Limitations
-**Intended Uses:**
-- Simulating sales conversations for training or evaluation purposes.
-- Providing guidelines or suggested dialogues for sales representatives.
-**Limitations:**
-- The model might repetitively ask questions in certain scenarios.
-- May struggle with handling customers who lack specific preferences or knowledge about products.
-- The objection handling could be more focused on convincing techniques rather than objective criteria.
-- Challenges in providing appropriate suggestions for customers without specific needs.
-- Limited effectiveness in handling financial and budgetary conversations or sensitivities.
-### Training and Evaluation Data
-**Training Data:**
-1. **Textbook v1 Dataset**
-   - URL: [Dataset](https://huggingface.co/datasets/goendalf666/sales-textbook_for_convincing_and_selling)
-   - Content: Textbook content for sales, derived from structural points and detailed subpoints created through API calls.
-2. **Sales Conversation Dataset**
-   - URL: [Dataset](https://huggingface.co/datasets/goendalf666/sales-conversations)
-   - Content: Sales conversations, generated based on the chapters of the textbook.
-3. **Sales Conversations Instruction Base Dataset**
-   - URL: [Dataset](https://huggingface.co/datasets/goendalf666/sales-conversations-instruction-base)
-   - Content: Extended sales conversations with structured dialogues.
-4. **Sales Conversations Instruction Extension Dataset**
-   - URL: [Dataset](https://huggingface.co/datasets/goendalf666/sales-conversations-instruction-ext)
-   - Content: Updates based on real conversations with the model to improve its proficiency in unconvincing cases.
-**Evaluation Data:**
-- More information is needed regarding how and where the model was evaluated. If it was assessed on a separate test set, providing access and details to that dataset would be crucial.
-### Training Procedure
-Fine-tuning of salesGPT_v2 was executed in three phases using the LoRa approach with Rank 64:
-1. Training on a textbook for 20k steps.
-2. Training on sales conversations for 40k steps, resulting in salesGPT_v1.
-3. Training on sales conversations instruction for 40k steps, evolving into salesGPT_v2.
-Hyperparameters used during training:
-- Learning rate: 0.0002
-- Train batch size: 2
-- Eval batch size: 8
-- Seed: 42
-- Optimizer: Adam
-- LR scheduler type: Cosine
-- Training steps: 40,000
-- [Additional parameters...](https://huggingface.co/goendalf666/salesGPT_v2)
-Model size details:
-- Trainable params: 18,874,368
-- All params: 1,437,145,088
-- Trainable%: 1.313323766514519
-### Results
-While salesGPT_v2 showcases notable capabilities like initiating a structured dialogue and asking relevant questions, there are some key areas of improvement and observations:
-- The model tends to keep asking questions if a user provides vague information.
-- In some cases, it's not assertive enough to suggest a product when a customer lacks preferences.
-- It tends to leverage objective criteria over convincing strategies during objections.
-- Needs better handling of conversations regarding budgetary or financial concerns.
-### Example Usage & Conversations
-#### Example 1:
-**Customer**: I am interested in buying a new laptop.
-**salesGPT_v2**: I understand. Could you please tell me more about your specific requirements and budget?
-#### Example 2:
-**Customer**: Oh I don't know how I should explain this to my wife. I said we can't go to vacation because we don't have any money.
-**salesGPT_v2**: I see. Can you please provide more details about your financial situation and the specific expenses you are concerned about?
-These examples underline the model's capability to engage in sales dialogues and also hint at its limitations, such as possibly prying too much into a customer's personal financial situation.
-### Future Directions & Improvement
-- Enhancement in handling objections by focusing more on persuasive techniques and emotional intelligence.
-- Improving suggestion-making capability especially when customers are indecisive.
-- Better navigation through the conversation that involves budgetary and financial aspects without seeming insensitive or intrusive.
-- Striking a balance between being informative and being too technical in its product descriptions.
-- Possible implementation of more ethical and privacy-guided conversation guidelines, especially in discussing customers' financial capacities.
-### Ethical Considerations
-The model’s tendency to repeatedly ask for specific information, especially related to personal financial details, raises ethical concerns regarding privacy and data sensitivity. Care must be taken to ensure the model respects user privacy and does not persistently probe for personal or sensitive information.
-### Conclusion
-salesGPT_v2 offers a foundation for simulating sales conversations with potential for future refinement in handling objections, making product suggestions, and managing conversations delicately around financial discussions. Future versions might seek to refine its balance between being convincingly persuasive and remaining ethically and emotionally intelligent within dialogues.
-### Inference
 ```
-from transformers import AutoModelForCausalLM, AutoTokenizer
-# Initialize the model and tokenizer
-cuda = "cuda:0" if torch.cuda.is_available() else ""
-model = AutoModelForCausalLM.from_pretrained("goendalf666/salesGPT_v2", trust_remote_code=True, torch_dtype=torch.float32, device_map={"":0})
-tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-1_5", trust_remote_code=True, device_map={"":0})
-inputs = tokenizer(conversation_text, return_tensors="pt", return_attention_mask=False)
-inputs.to(cuda)
-# Generate response
-outputs = model.generate(**inputs, max_length=512)
-response_text = tokenizer.batch_decode(outputs)[0]
 ```
-Or
-Inference script: https://github.com/tom813/salesGPT_foundation/blob/main/inference.py
-### Framework versions
-- Transformers 4.32.1
-- Pytorch 2.1.0.dev20230829+cu121
-- Datasets 2.14.5
-- Tokenizers 0.13.3

 ---
 license: other
+license_name: microsoft-research-license
+license_link: https://huggingface.co/microsoft/phi-1_5/resolve/main/Research%20License.docx
 language:
 - en
 pipeline_tag: text-generation
 ---
+## Model Summary
+The language model phi-1.5 is a Transformer with **1.3 billion** parameters. It was trained using the same data sources as [phi-1](https://huggingface.co/microsoft/phi-1), augmented with a new data source that consists of various NLP synthetic texts. When assessed against benchmarks testing common sense, language understanding, and logical reasoning, phi-1.5 demonstrates a nearly state-of-the-art performance among models with less than 10 billion parameters.
+We **did not** fine-tune phi-1.5 either for **instruction following or through reinforcement learning from human feedback**. The intention behind crafting this open-source model is to provide the research community with a non-restricted small model to explore vital safety challenges, such as reducing toxicity, understanding societal biases, enhancing controllability, and more.
+For a safer model release, we exclude generic web-crawl data sources such as common-crawl from the training. This strategy prevents direct exposure to potentially harmful online content, enhancing the model's safety without RLHF. However, the model is still vulnerable to generating harmful content. We hope the model can help the research community to further study the safety of language models.
+phi-1.5 can write poems, draft emails, create stories, summarize texts, write Python code (such as downloading a Hugging Face transformer model), etc.
+## Intended Uses
+Given the nature of the training data, phi-1.5 is best suited for prompts using the QA format, the chat format, and the code format. Note that phi-1.5, being a base model, often produces irrelevant text following the main answer. In the following example, we've truncated the answer for illustrative purposes only.
+#### QA format:
+```markdown
+Write a detailed analogy between mathematics and a lighthouse.
+Answer: Mathematics is like a lighthouse, guiding us through the vast ocean of numbers and calculations. Just as a lighthouse illuminates the darkness, mathematics provides us with a clear path to navigate through complex problems. It helps us make sense of the world around us, just like a lighthouse helps ships find their way home.
 ```
+where the model generates the text after "Answer:".
+#### Chat format:
+```markdown
+Alice: I don't know why, I'm struggling to maintain focus while studying. Any suggestions?
+Bob: Have you tried using a timer? It can help you stay on track and avoid distractions.
+Alice: That's a good idea. I'll give it a try.
+Charlie: Another thing that can help is to break up your study sessions into smaller chunks. It's easier to concentrate on one thing at a time.
+Alice: That makes sense. I'll try that too.
+Bob: And don't forget to take breaks! It's important to give your brain a rest so you can come back to your studies with a fresh perspective.
+Alice: Thanks for the advice, guys. I feel more motivated now.
+Charlie: No problem, Alice. We're all in this together.
+Bob: Yeah, and remember that it's okay to ask for help if you need it. We're here to support each other.
+```
+where the model generates the text after the first "Bob:".
+#### Code format:
+```python
+def print_prime(n):
+   """
+   Print all primes between 1 and n
+   """
+   primes = []
+   for num in range(2, n+1):
+       is_prime = True
+       for i in range(2, int(math.sqrt(num))+1):
+           if num % i == 0:
+               is_prime = False
+               break
+       if is_prime:
+           primes.append(num)
+   print(primes)
+```
+where the model generates the text after the comments.
+**Notes**
+* phi-1.5 is intended for research purposes. The model-generated text/code should be treated as a starting point rather than a definitive solution for potential use cases. Users should be cautious when employing these models in their applications.
+* Direct adoption for production tasks is out of the scope of this research project. As a result, phi-1.5 has not been tested to ensure that it performs adequately for any production-level application. Please refer to the limitation sections of this document for more details.
+## Limitations of phi-1.5
+* Generate Inaccurate Code and Facts: The model often produces incorrect code snippets and statements. Users should treat these outputs as suggestions or starting points, not as definitive or accurate solutions.
+* Limited Scope for code: If the model generates Python scripts that utilize uncommon packages or scripts in other languages, we strongly recommend users manually verify all API uses.
+* Unreliable Responses to Instruction: The model has not undergone instruction fine-tuning. As a result, it may struggle or fail to adhere to intricate or nuanced instructions provided by users.
+* Language Limitations: The model is primarily designed to understand standard English.  Informal English, slang, or any other language outside of English might pose challenges to its comprehension, leading to potential misinterpretations or errors in response.
+* Potential Societal Biases: Regardless of the safe data used for its training, the model is not entirely free from societal biases. There's a possibility it may generate content that mirrors these societal biases, particularly if prompted or instructed to do so. We urge users to be aware of this and to exercise caution and critical thinking when interpreting model outputs.
+* Toxicity: Despite that the model is trained with carefully selected data, the model can still produce harmful content if explicitly prompted or instructed to do so. We chose to release the model for research purposes only -- We hope to help the open-source community develop the most effective ways to reduce the toxicity of a model directly after pretraining.
+## Training
+### Model
+* Architecture: a Transformer-based model with next-word prediction objective
+* Dataset size: 30B tokens
+* Training tokens: 150B tokens
+* Precision: fp16
+* GPUs: 32xA100-40G
+* Training time: 8 days
+### Software
+* [PyTorch](https://github.com/pytorch/pytorch)
+* [DeepSpeed](https://github.com/microsoft/DeepSpeed)
+* [flash-attention](https://github.com/HazyResearch/flash-attention)
+### License
+The model is licensed under the [Research License](https://huggingface.co/microsoft/phi-1_5/resolve/main/Research%20License.docx).
+### Sample Code
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+torch.set_default_device("cuda")
+model = AutoModelForCausalLM.from_pretrained("microsoft/phi-1_5", trust_remote_code=True)
+tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-1_5", trust_remote_code=True)
+inputs = tokenizer('''```python
+def print_prime(n):
+   """
+   Print all primes between 1 and n
+   """''', return_tensors="pt", return_attention_mask=False)
+outputs = model.generate(**inputs, max_length=200)
+text = tokenizer.batch_decode(outputs)[0]
+print(text)
 ```
+If you need to use the model in a lower precision (e.g., FP16), please wrap the model's forward pass with `torch.autocast()`, as follows:
+```python
+with torch.autocast(model.device.type, dtype=torch.float16, enabled=True):
+    outputs = model.generate(**inputs, max_length=200)
+```
+**Remark.** In the generation function, our model currently does not support beam search (`num_beams` > 1).
+Furthermore, in the forward pass of the model, we currently do not support attention mask during training, outputting hidden states or attention values, or using custom input embeddings (instead of the model's).
+### Citation
+You can find the paper at https://arxiv.org/abs/2309.05463
+```bib
+@article{textbooks2,
+  title={Textbooks Are All You Need II: \textbf{phi-1.5} technical report},
+  author={Li, Yuanzhi and Bubeck, S{\'e}bastien and Eldan, Ronen and Del Giorno, Allie and Gunasekar, Suriya and Lee, Yin Tat},
+  journal={arXiv preprint arXiv:2309.05463},
+  year={2023}
+}
+```

Research License.docx ADDED Viewed

Binary file (38.9 kB). View file

added_tokens.json ADDED Viewed

	@@ -0,0 +1,40 @@

+{
+  "\t\t": 50294,
+  "\t\t\t": 50293,
+  "\t\t\t\t": 50292,
+  "\t\t\t\t\t": 50291,
+  "\t\t\t\t\t\t": 50290,
+  "\t\t\t\t\t\t\t": 50289,
+  "\t\t\t\t\t\t\t\t": 50288,
+  "\t\t\t\t\t\t\t\t\t": 50287,
+  "  ": 50286,
+  "   ": 50285,
+  "    ": 50284,
+  "     ": 50283,
+  "      ": 50282,
+  "       ": 50281,
+  "        ": 50280,
+  "         ": 50279,
+  "          ": 50278,
+  "           ": 50277,
+  "            ": 50276,
+  "             ": 50275,
+  "              ": 50274,
+  "               ": 50273,
+  "                ": 50272,
+  "                 ": 50271,
+  "                  ": 50270,
+  "                   ": 50269,
+  "                    ": 50268,
+  "                     ": 50267,
+  "                      ": 50266,
+  "                       ": 50265,
+  "                        ": 50264,
+  "                         ": 50263,
+  "                          ": 50262,
+  "                           ": 50261,
+  "                            ": 50260,
+  "                             ": 50259,
+  "                              ": 50258,
+  "                               ": 50257
+}

config.json CHANGED Viewed

@@ -1,12 +1,12 @@
 {
-  "_name_or_path": "mariordoniez/phi",
   "activation_function": "gelu_new",
   "architectures": [
     "MixFormerSequentialForCausalLM"
   ],
   "auto_map": {
-    "AutoConfig": "mariordoniez/phi--configuration_mixformer_sequential.MixFormerSequentialConfig",
-    "AutoModelForCausalLM": "mariordoniez/phi--modeling_mixformer_sequential.MixFormerSequentialForCausalLM"
   },
   "embd_pdrop": 0.0,
   "initializer_range": 0.02,
@@ -20,7 +20,7 @@
   "resid_pdrop": 0.0,
   "rotary_dim": 32,
   "tie_word_embeddings": false,
-  "torch_dtype": "float32",
   "transformers_version": "4.32.1",
   "vocab_size": 51200
 }

 {
+  "_name_or_path": "phi-1.5-half",
   "activation_function": "gelu_new",
   "architectures": [
     "MixFormerSequentialForCausalLM"
   ],
   "auto_map": {
+    "AutoConfig": "configuration_mixformer_sequential.MixFormerSequentialConfig",
+    "AutoModelForCausalLM": "modeling_mixformer_sequential.MixFormerSequentialForCausalLM"
   },
   "embd_pdrop": 0.0,
   "initializer_range": 0.02,
   "resid_pdrop": 0.0,
   "rotary_dim": 32,
   "tie_word_embeddings": false,
+  "torch_dtype": "float16",
   "transformers_version": "4.32.1",
   "vocab_size": 51200
 }

configuration_mixformer_sequential.py ADDED Viewed

	@@ -0,0 +1,53 @@

+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT license.
+import math
+from typing import Any, Dict, List, Optional, Union
+from transformers import PretrainedConfig
+class MixFormerSequentialConfig(PretrainedConfig):
+    """MixFormer (sequential for DeepSpeed) configuration."""
+    model_type = "mixformer-sequential"
+    attribute_map = {
+        "max_position_embeddings": "n_positions",
+        "hidden_size": "n_embd",
+        "num_attention_heads": "n_head",
+        "num_hidden_layers": "n_layer",
+    }
+    def __init__(
+        self,
+        vocab_size: Optional[int] = 50304,
+        n_positions: Optional[int] = 2048,
+        n_embd: Optional[int] = 1024,
+        n_layer: Optional[int] = 20,
+        n_inner: Optional[int] = None,
+        n_head: Optional[int] = 16,
+        rotary_dim: Optional[int] = 32,
+        activation_function: Optional[str] = "gelu_new",
+        embd_pdrop: Optional[float] = 0.0,
+        resid_pdrop: Optional[float] = 0.0,
+        layer_norm_epsilon: Optional[float] = 1e-5,
+        initializer_range: Optional[float] = 0.02,
+        tie_word_embeddings: Optional[bool] = False,
+        pad_vocab_size_multiple: Optional[int] = 64,
+        **kwargs
+    ) -> None:
+        self.vocab_size = int(math.ceil(vocab_size / pad_vocab_size_multiple) * pad_vocab_size_multiple)
+        self.n_positions = n_positions
+        self.n_embd = n_embd
+        self.n_layer = n_layer
+        self.n_inner = n_inner
+        self.n_head = n_head
+        self.rotary_dim = min(rotary_dim, n_embd // n_head)
+        self.activation_function = activation_function
+        self.embd_pdrop = embd_pdrop
+        self.resid_pdrop = resid_pdrop
+        self.layer_norm_epsilon = layer_norm_epsilon
+        self.initializer_range = initializer_range
+        super().__init__(tie_word_embeddings=tie_word_embeddings, **kwargs)

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

modeling_mixformer_sequential.py CHANGED Viewed

@@ -34,28 +34,20 @@
 from __future__ import annotations
 import math
 from typing import Any, Dict, Optional, Tuple, Union
 from dataclasses import dataclass, field
 import torch
 import torch.nn as nn
-from einops import rearrange, repeat
 from transformers.activations import ACT2FN
 from transformers import PretrainedConfig, PreTrainedModel
 from transformers.modeling_outputs import CausalLMOutputWithPast
 from .configuration_mixformer_sequential import MixFormerSequentialConfig
-try:
-    from flash_attn.layers.rotary import RotaryEmbedding as FlashRotaryEmbedding
-    from flash_attn.ops.fused_dense import FusedDense
-except:
-    FlashRotaryEmbedding = None
-    FusedDense = None
 @dataclass
 class InferenceParams:
     """Inference parameters passed to model to efficiently calculate
@@ -65,20 +57,21 @@ class InferenceParams:
         https://github.com/Dao-AILab/flash-attention/blob/main/flash_attn/utils/generation.py.
     Args:
-        max_seqlen: Maximum sequence length.
         max_batch_size: Maximum batch size.
-        seqlen_offset: Sequence length offset.
         batch_size_offset: Batch size offset.
         key_value_memory_dict: Key value memory dictionary.
         lengths_per_sample: Lengths per sample.
     """
-    max_seqlen: int = field(metadata={"help": "Maximum sequence length."})
     max_batch_size: int = field(metadata={"help": "Maximum batch size."})
-    seqlen_offset: int = field(default=0, metadata={"help": "Sequence length offset."})
     batch_size_offset: int = field(default=0, metadata={"help": "Batch size offset."})
@@ -86,6 +79,8 @@ class InferenceParams:
         default_factory=dict, metadata={"help": "Key value memory dictionary."}
     )
     lengths_per_sample: torch.Tensor = field(default=None, metadata={"help": "Lengths per sample."})
@@ -108,112 +103,12 @@ class Embedding(nn.Module):
         return hidden_states
-def _apply_rotary_emb(
-    x: torch.FloatTensor,
-    cos: torch.FloatTensor,
-    sin: torch.FloatTensor,
-) -> torch.FloatTensor:
-    _, seqlen, _, head_dim = x.shape
-    rotary_seqlen, rotary_dim = cos.shape
-    rotary_dim *= 2
-    assert rotary_dim <= head_dim
-    assert seqlen <= rotary_seqlen
-    assert cos.shape == sin.shape == (rotary_seqlen, rotary_dim // 2)
-    x_rot = x[:, :, :, :rotary_dim]
-    x_pass = x[:, :, :, rotary_dim:]
-    x1, x2 = x_rot.chunk(2, dim=-1)
-    c, s = rearrange(cos[:seqlen], "s d -> s 1 d"), rearrange(sin[:seqlen], "s d -> s 1 d")
-    x1, x2, c, s = [t.to(dtype=torch.float32) for t in [x1, x2, c, s]]
-    x_rot = torch.cat([x1 * c - x2 * s, x1 * s + x2 * c], axis=-1).to(x.dtype)
-    return torch.cat([x_rot, x_pass], axis=-1)
-def _apply_rotary_emb_kv(
-    kv: torch.FloatTensor,
-    cos: torch.FloatTensor,
-    sin: torch.FloatTensor,
-    cos_k: Optional[torch.FloatTensor] = None,
-    sin_k: Optional[torch.FloatTensor] = None,
-) -> torch.FloatTensor:
-    _, seqlen, two, _, head_dim = kv.shape
-    assert two == 2
-    rotary_seqlen, rotary_dim = cos.shape
-    rotary_dim *= 2
-    assert rotary_dim <= head_dim
-    assert seqlen <= rotary_seqlen
-    assert cos.shape == sin.shape == (rotary_seqlen, rotary_dim // 2)
-    k_rot = kv[:, :, 0, :, :rotary_dim]
-    k_pass = kv[:, :, 0, :, rotary_dim:]
-    k1, k2 = k_rot.chunk(2, dim=-1)
-    c, s = rearrange(cos[:seqlen], "s d -> s 1 d"), rearrange(sin[:seqlen], "s d -> s 1 d")
-    k1, k2, c, s = [t.to(dtype=torch.float32) for t in [k1, k2, c, s]]
-    k_rot = torch.cat([k1 * c - k2 * s, k1 * s + k2 * c], axis=-1).to(kv.dtype)
-    return torch.cat(
-        [
-            torch.cat([k_rot, k_pass], axis=-1).unsqueeze(2),
-            kv[:, :, 1:2, :, :],
-        ],
-        axis=2,
-    )
-def _apply_rotary_emb_qkv(
-    qkv: torch.FloatTensor,
-    cos: torch.FloatTensor,
-    sin: torch.FloatTensor,
-    cos_k: Optional[torch.FloatTensor] = None,
-    sin_k: Optional[torch.FloatTensor] = None,
-) -> torch.FloatTensor:
-    _, seqlen, three, _, head_dim = qkv.shape
-    assert three == 3
-    rotary_seqlen, rotary_dim = cos.shape
-    rotary_dim *= 2
-    assert rotary_dim <= head_dim
-    assert seqlen <= rotary_seqlen
-    assert cos.shape == sin.shape == (rotary_seqlen, rotary_dim // 2)
-    q_rot = qkv[:, :, 0, :, :rotary_dim]
-    q_pass = qkv[:, :, 0, :, rotary_dim:]
-    k_rot = qkv[:, :, 1, :, :rotary_dim]
-    k_pass = qkv[:, :, 1, :, rotary_dim:]
-    q1, q2 = q_rot.chunk(2, dim=-1)
-    k1, k2 = k_rot.chunk(2, dim=-1)
-    c, s = rearrange(cos[:seqlen], "s d -> s 1 d"), rearrange(sin[:seqlen], "s d -> s 1 d")
-    q1, q2, k1, k2, c, s = [t.to(dtype=torch.float32) for t in [q1, q2, k1, k2, c, s]]
-    q_rot = torch.cat([q1 * c - q2 * s, q1 * s + q2 * c], axis=-1).to(qkv.dtype)
-    k_rot = torch.cat([k1 * c - k2 * s, k1 * s + k2 * c], axis=-1).to(qkv.dtype)
-    return torch.cat(
-        [
-            torch.cat([q_rot, q_pass], axis=-1).unsqueeze(2),
-            torch.cat([k_rot, k_pass], axis=-1).unsqueeze(2),
-            qkv[:, :, 2:3, :, :],
-        ],
-        axis=2,
-    )
 class RotaryEmbedding(nn.Module):
-    """Rotary positional embedding (RoPE).
     Reference:
-        RoFormer: Enhanced Transformer with Rotary Position Embedding.
-        https://arxiv.org/pdf/2104.09864.pdf.
     """
     def __init__(
@@ -221,7 +116,6 @@ class RotaryEmbedding(nn.Module):
         dim: int,
         base: int = 10000,
         scale_base: Optional[float] = None,
-        pos_idx_in_fp32: bool = True,
         device: Optional[str] = None,
         **kwargs,
     ) -> None:
@@ -230,23 +124,21 @@ class RotaryEmbedding(nn.Module):
         if scale_base is not None:
             raise NotImplementedError
         self.dim = dim
-        self.base = float(base)
         self.scale_base = scale_base
-        self.pos_idx_in_fp32 = pos_idx_in_fp32
         self.device = device
-        # Generate and save the inverse frequency buffer (non-trainable)
-        inv_freq = self._compute_inv_freq(device)
-        self.register_buffer("inv_freq", inv_freq, persistent=False)
-        # Generate and save the scale buffer (non-trainable)
         scale = (
             (torch.arange(0, dim, 2, device=device, dtype=torch.float32) + 0.4 * dim) / (1.4 * dim)
             if scale_base is not None
             else None
         )
-        self.register_buffer("scale", scale, persistent=False)
         self._seq_len_cached = 0
         self._cos_cached = None
@@ -254,73 +146,91 @@ class RotaryEmbedding(nn.Module):
         self._cos_k_cached = None
         self._sin_k_cached = None
-    def _compute_inv_freq(self, device: Optional[str] = None) -> torch.FloatTensor:
-        return 1.0 / (self.base ** (torch.arange(0, self.dim, 2, device=device, dtype=torch.float32) / self.dim))
-    def _update_cos_sin_cache(
-        self, seqlen: int, device: Optional[str] = None, dtype: Optional[torch.dtype] = None
-    ) -> None:
-        # Reset the tables if sequence length has been chaned, if we are on a
-        # new device or if we are switching from inference mode to training
-        if (
-            seqlen > self._seq_len_cached
-            or self._cos_cached is None
-            or self._cos_cached.device != device
-            or self._cos_cached.dtype != dtype
-            or (self.training and self._cos_cached.is_inference())
-        ):
-            self._seq_len_cached = seqlen
-            # fp32 is preferred since the output of `torch.arange` can be quite large
-            # and bf16 would lose a lot of precision
-            if self.pos_idx_in_fp32:
-                t = torch.arange(seqlen, device=device, dtype=torch.float32)
-                if self.inv_freq.dtype != torch.float32:
-                    inv_freq = self._compute_inv_freq(device=device)
-                else:
-                    inv_freq = self.inv_freq
-            else:
-                t = torch.arange(seqlen, device=device, dtype=self.inv_freq.dtype)
-                inv_freq = self.inv_freq
-            # `torch.outer` is preferred since `torch.einsum` converts from fp32 to fp16 if used with AMP
-            freqs = torch.outer(t, inv_freq)
             if self.scale is None:
-                self._cos_cached = torch.cos(freqs).to(dtype)
-                self._sin_cached = torch.sin(freqs).to(dtype)
             else:
                 power = (
                     torch.arange(seqlen, dtype=self.scale.dtype, device=self.scale.device) - seqlen // 2
                 ) / self.scale_base
                 scale = self.scale.to(device=power.device) ** rearrange(power, "s -> s 1")
-                # Force the scale multiplication to happen in fp32
-                self._cos_cached = (torch.cos(freqs) * scale).to(dtype)
-                self._sin_cached = (torch.sin(freqs) * scale).to(dtype)
-                self._cos_k_cached = (torch.cos(freqs) / scale).to(dtype)
-                self._sin_k_cached = (torch.sin(freqs) / scale).to(dtype)
-    def forward(
         self,
-        qkv: torch.Tensor,
-        kv: Optional[torch.Tensor] = None,
-        seqlen_offset: int = 0,
-        max_seqlen: Optional[int] = None,
-    ) -> Tuple[torch.Tensor, torch.Tensor]:
-        seqlen = qkv.shape[1]
-        if max_seqlen is not None:
-            self._update_cos_sin_cache(max_seqlen, device=qkv.device, dtype=qkv.dtype)
-        else:
-            self._update_cos_sin_cache(seqlen + seqlen_offset, device=qkv.device, dtype=qkv.dtype)
-        if kv is None:
-            return _apply_rotary_emb_qkv(qkv, self._cos_cached[seqlen_offset:], self._sin_cached[seqlen_offset:])
-        else:
-            q = _apply_rotary_emb(qkv, self._cos_cached[seqlen_offset:], self._sin_cached[seqlen_offset:])
-            kv = _apply_rotary_emb_kv(kv, self._cos_cached[seqlen_offset:], self._sin_cached[seqlen_offset:])
-            return q, kv
 class MLP(nn.Module):
@@ -380,22 +290,21 @@ class SelfAttention(nn.Module):
         attention_mask: Optional[torch.BoolTensor] = None,
         **kwargs,
     ) -> torch.FloatTensor:
-        batch_size, seqlen = qkv.shape[0], qkv.shape[1]
         q, k, v = qkv.unbind(dim=2)
-        causal = self.causal if causal is None else causal
         softmax_scale = self.softmax_scale or 1.0 / math.sqrt(q.shape[-1])
         scores = torch.einsum("bthd,bshd->bhts", q, k * softmax_scale)
         if attention_mask is not None:
-            padding_mask = torch.full((batch_size, seqlen), -10000.0, dtype=scores.dtype, device=scores.device)
             padding_mask.masked_fill_(attention_mask, 0.0)
             scores = scores + rearrange(padding_mask, "b s -> b 1 1 s")
         if causal:
-            causal_mask = torch.triu(torch.full((seqlen, seqlen), -10000.0, device=scores.device), 1)
             scores = scores + causal_mask.to(dtype=scores.dtype)
         attention = torch.softmax(scores, dim=-1, dtype=v.dtype)
@@ -434,31 +343,25 @@ class CrossAttention(nn.Module):
         attention_mask: Optional[torch.BoolTensor] = None,
         **kwargs,
     ) -> torch.FloatTensor:
-        batch_size, seqlen_q = q.shape[0], q.shape[1]
-        seqlen_k = kv.shape[1]
-        assert kv.shape[0] == batch_size and kv.shape[4] == q.shape[3]
-        if kv.shape[3] != q.shape[2]:
-            kv = repeat(kv, "... hkv d -> ... (hkv g) d", g=q.shape[2] // kv.shape[3])
         k, v = kv.unbind(dim=2)
-        causal = self.causal if causal is None else causal
         softmax_scale = self.softmax_scale or 1.0 / math.sqrt(q.shape[-1])
         scores = torch.einsum("bthd,bshd->bhts", q, k * softmax_scale)
         if attention_mask is not None:
-            padding_mask = torch.full((batch_size, seqlen_k), -10000.0, dtype=scores.dtype, device=scores.device)
             padding_mask.masked_fill_(attention_mask, 0.0)
             scores = scores + rearrange(padding_mask, "b s -> b 1 1 s")
         if causal:
-            rows = rearrange(torch.arange(seqlen_q, device=q.device, dtype=torch.long), "s -> s 1")
-            cols = torch.arange(seqlen_k, device=k.device, dtype=torch.long)
-            causal_mask = cols > rows + seqlen_k - seqlen_q
-            scores = scores.masked_fill(causal_mask, -10000.0)
         attention = torch.softmax(scores, dim=-1, dtype=v.dtype)
         attention = self.drop(attention)
@@ -468,12 +371,21 @@ class CrossAttention(nn.Module):
         return output
-def _find_mha_dims(
-    config: PretrainedConfig,
-    n_head: Optional[int] = None,
-    n_head_kv: Optional[int] = None,
-    head_dim: Optional[int] = None,
 ) -> Tuple[int, int]:
     assert all(
         hasattr(config, attr) for attr in ["n_embd", "n_head"]
     ), "`config` must have `n_embd` and `n_head` attributes."
@@ -489,20 +401,31 @@ def _find_mha_dims(
     elif n_head is None or head_dim is None:
         raise ValueError("`n_head` and `head_dim` must be both specified or `None`.")
-    if n_head_kv is None:
-        n_head_kv = getattr(config, "n_head_kv", None) or n_head
-    assert n_head % n_head_kv == 0, "`n_head` must be divisible by `n_head_kv`."
-    return n_head, n_head_kv, head_dim
-def _update_kv_cache(kv: torch.FloatTensor, inference_params: InferenceParams, layer_idx: int) -> torch.FloatTensor:
     num_heads, head_dim = kv.shape[-2:]
     if layer_idx not in inference_params.key_value_memory_dict:
         kv_cache = torch.empty(
             inference_params.max_batch_size,
-            inference_params.max_seqlen,
             2,
             num_heads,
             head_dim,
@@ -511,19 +434,43 @@ def _update_kv_cache(kv: torch.FloatTensor, inference_params: InferenceParams, l
         )
         inference_params.key_value_memory_dict[layer_idx] = kv_cache
     else:
-        kv_cache = inference_params.key_value_memory_dict[layer_idx]
     batch_start = inference_params.batch_size_offset
     batch_end = batch_start + kv.shape[0]
-    assert batch_end <= kv_cache.shape[0]
-    sequence_start = inference_params.seqlen_offset
     sequence_end = sequence_start + kv.shape[1]
-    assert sequence_end <= kv_cache.shape[1]
-    assert kv_cache is not None
-    kv_cache[batch_start:batch_end, sequence_start:sequence_end, ...] = kv
-    kv = kv_cache[batch_start:batch_end, :sequence_end, ...]
     return kv
@@ -539,11 +486,11 @@ class MHA(nn.Module):
         rotary_dim: Optional[int] = None,
         rotary_emb_scale_base: Optional[float] = None,
         n_head: Optional[int] = None,
-        n_head_kv: Optional[int] = None,
         head_dim: Optional[int] = None,
         bias: bool = True,
         causal: bool = True,
         softmax_scale: Optional[float] = None,
         layer_idx: Optional[int] = None,
         return_residual: bool = False,
         checkpointing: bool = False,
@@ -556,101 +503,58 @@ class MHA(nn.Module):
             rotary_kwargs = {"device": device}
             if rotary_emb_scale_base is not None and rotary_emb_scale_base > 0.0:
                 rotary_kwargs["scale_base"] = rotary_emb_scale_base
-            rotary_cls = FlashRotaryEmbedding if config.flash_rotary else RotaryEmbedding
-            if rotary_cls is None:
-                rotary_cls = RotaryEmbedding
-            self.rotary_emb = rotary_cls(self.rotary_emb_dim, **rotary_kwargs)
         # MLP
-        self.n_head, self.n_head_kv, self.head_dim = _find_mha_dims(config, n_head=n_head, n_head_kv=n_head_kv, head_dim=head_dim)
-        op_size = self.head_dim * (self.n_head + 2 * self.n_head_kv)
         hidden_size = config.n_embd
-        linear_cls = FusedDense if config.fused_dense else nn.Linear
-        if linear_cls is None:
-            linear_cls = nn.Linear
-        self.Wqkv = linear_cls(hidden_size, op_size, bias=bias, device=device, dtype=dtype)
-        self.out_proj = linear_cls(hidden_size, hidden_size, bias=bias, device=device, dtype=dtype)
         # Attention
-        self.inner_attn = SelfAttention(causal=causal, softmax_scale=softmax_scale, attention_dropout=config.attn_pdrop)
-        self.inner_cross_attn = CrossAttention(causal=causal, softmax_scale=softmax_scale, attention_dropout=config.attn_pdrop)
         self.layer_idx = layer_idx
         self.return_residual = return_residual
         self.checkpointing = checkpointing
-    def _forward_self_attn(
-        self, x: torch.FloatTensor, attention_mask: Optional[torch.BoolTensor]
-    ) -> torch.FloatTensor:
-        qkv = self.Wqkv(x)
-        qkv = rearrange(qkv, "... (three h d) -> ... three h d", three=3, d=self.head_dim)
-        if self.rotary_emb_dim > 0:
-            qkv = self.rotary_emb(qkv)
-        if self.checkpointing:
-            return torch.utils.checkpoint.checkpoint(self.inner_attn, qkv, attention_mask=attention_mask)
-        return self.inner_attn(qkv, attention_mask=attention_mask)
-    def _forward_cross_attn(
         self,
         x: torch.FloatTensor,
-        past_key_values: Optional[InferenceParams],
-        attention_mask: Optional[torch.BoolTensor],
-    ) -> torch.FloatTensor:
         qkv = self.Wqkv(x)
-        q = qkv[..., : self.n_head * self.head_dim]
-        q = rearrange(q, "... (h d) -> ... h d", d=self.head_dim)
-        kv = qkv[..., self.n_head * self.head_dim :]
-        kv = rearrange(kv, "... (two hkv d) -> ... two hkv d", two=2, d=self.head_dim)
-        seqlen_offset = past_key_values.seqlen_offset if past_key_values is not None else 0
-        causal = None if seqlen_offset == 0 else False
         if self.rotary_emb_dim > 0:
-            q, kv = self.rotary_emb(q, kv=kv, seqlen_offset=seqlen_offset)
         if past_key_values is not None:
-            kv = _update_kv_cache(kv, past_key_values, self.layer_idx)
-        if self.checkpointing:
-            return torch.utils.checkpoint.checkpoint(
-                self.inner_cross_attn, q, kv, attention_mask=attention_mask, causal=causal
-            )
-        return self.inner_cross_attn(q, kv, attention_mask=attention_mask, causal=causal)
-    def forward(
-        self,
-        x: torch.FloatTensor,
-        past_key_values: Optional[InferenceParams] = None,
-        attention_mask: Optional[Union[torch.LongTensor, torch.BoolTensor]] = None,
-        **kwargs,
-    ) -> Tuple[torch.FloatTensor, torch.FloatTensor]:
-        if attention_mask is not None and torch.any(~attention_mask.bool()):
-            attention_mask = attention_mask.bool()
-        else:
-            attention_mask = None
-        # MHA
-        if self.n_head == self.n_head_kv:
-            if past_key_values is None:
-                # If `past_key_values` are not supplied, we run self-attention
-                attn_output = self._forward_self_attn(x, attention_mask)
             else:
-                # If `past_key_values` are supplied, it means that we might have cached values and
-                # could take advantage of cross-attention
-                attn_output = self._forward_cross_attn(x, past_key_values, attention_mask)
-        # MQA / GQA
         else:
-            # Regardless of `past_key_values` being supplied or not, it always use cross-attention
-            # because `q` and `kv` lengths might be different
-            attn_output = self._forward_cross_attn(x, past_key_values, attention_mask)
         output = rearrange(attn_output, "... h d -> ... (h d)")
         output = self.out_proj(output)
@@ -768,29 +672,38 @@ class MixFormerSequentialPreTrainedModel(PreTrainedModel):
             if module.padding_idx is not None:
                 module.weight.data[module.padding_idx].zero_()
         elif isinstance(module, nn.LayerNorm):
-            if module.bias is not None:
-                module.bias.data.zero_()
             module.weight.data.fill_(1.0)
     def prepare_inputs_for_generation(
         self,
         input_ids: torch.LongTensor,
         past_key_values: Optional[Union[torch.FloatTensor, InferenceParams]] = None,
-        attention_mask: Optional[Union[torch.LongTensor, torch.BoolTensor]] = None,
         **kwargs,
     ) -> Dict[str, Any]:
         if past_key_values is None or not (isinstance(past_key_values, InferenceParams)):
             past_key_values = InferenceParams(
-                max_seqlen=self.config.n_positions,
                 max_batch_size=input_ids.shape[0],
-                seqlen_offset=0,
                 batch_size_offset=0,
                 key_value_memory_dict={},
-                lengths_per_sample=None,
             )
         else:
             # Assume that `past_key_values` has cached all tokens up to the last token in `input_ids`
-            past_key_values.seqlen_offset = len(input_ids[0]) - 1
             input_ids = input_ids[:, -1].unsqueeze(-1)
         return {
@@ -799,9 +712,9 @@ class MixFormerSequentialPreTrainedModel(PreTrainedModel):
             "attention_mask": attention_mask,
         }
-    def _set_gradient_checkpointing(self, module: nn.Module, value: bool = False) -> None:
-        if isinstance(module, MixFormerSequentialPreTrainedModel):
-            module.gradient_checkpointing = value
 class MixFormerSequentialForCausalLM(MixFormerSequentialPreTrainedModel):
@@ -843,13 +756,19 @@ class MixFormerSequentialForCausalLM(MixFormerSequentialPreTrainedModel):
         labels: Optional[torch.LongTensor] = None,
         **kwargs,
     ) -> CausalLMOutputWithPast:
-        hidden_layer = self.layers[0](input_ids)
-        for module in self.layers[1:-1]:
-            hidden_layer = module(hidden_layer, past_key_values=past_key_values, attention_mask=attention_mask)
-        lm_logits = self.layers[-1](hidden_layer)
         loss = None
         if labels is not None:
             loss = self.loss(lm_logits, labels)
-        return CausalLMOutputWithPast(loss=loss, logits=lm_logits, past_key_values=past_key_values)

 from __future__ import annotations
 import math
+import copy
 from typing import Any, Dict, Optional, Tuple, Union
 from dataclasses import dataclass, field
 import torch
 import torch.nn as nn
+from einops import rearrange
 from transformers.activations import ACT2FN
 from transformers import PretrainedConfig, PreTrainedModel
 from transformers.modeling_outputs import CausalLMOutputWithPast
 from .configuration_mixformer_sequential import MixFormerSequentialConfig
 @dataclass
 class InferenceParams:
     """Inference parameters passed to model to efficiently calculate
         https://github.com/Dao-AILab/flash-attention/blob/main/flash_attn/utils/generation.py.
     Args:
+        max_sequence_len: Maximum sequence length.
         max_batch_size: Maximum batch size.
+        sequence_len_offset: Sequence length offset.
         batch_size_offset: Batch size offset.
         key_value_memory_dict: Key value memory dictionary.
+        fused_ft_kernel: Whether to use fused kernel for fast inference.
         lengths_per_sample: Lengths per sample.
     """
+    max_sequence_len: int = field(metadata={"help": "Maximum sequence length."})
     max_batch_size: int = field(metadata={"help": "Maximum batch size."})
+    sequence_len_offset: int = field(default=0, metadata={"help": "Sequence length offset."})
     batch_size_offset: int = field(default=0, metadata={"help": "Batch size offset."})
         default_factory=dict, metadata={"help": "Key value memory dictionary."}
     )
+    fused_ft_kernel: bool = field(default=False, metadata={"help": "Whether to use fused kernel for fast inference."})
     lengths_per_sample: torch.Tensor = field(default=None, metadata={"help": "Lengths per sample."})
         return hidden_states
 class RotaryEmbedding(nn.Module):
+    """Rotary embeddings.
     Reference:
+        https://github.com/Dao-AILab/flash-attention/blob/main/flash_attn/layers/rotary.py.
     """
     def __init__(
         dim: int,
         base: int = 10000,
         scale_base: Optional[float] = None,
         device: Optional[str] = None,
         **kwargs,
     ) -> None:
         if scale_base is not None:
             raise NotImplementedError
+        # Generate and save the inverse frequency buffer (non-trainable)
         self.dim = dim
+        self.base = base
         self.scale_base = scale_base
         self.device = device
+        inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2, device=device, dtype=torch.float32) / dim))
+        self.register_buffer("inv_freq", inv_freq)
         scale = (
             (torch.arange(0, dim, 2, device=device, dtype=torch.float32) + 0.4 * dim) / (1.4 * dim)
             if scale_base is not None
             else None
         )
+        self.register_buffer("scale", scale)
         self._seq_len_cached = 0
         self._cos_cached = None
         self._cos_k_cached = None
         self._sin_k_cached = None
+    def _update_cos_sin_cache(self, x: torch.FloatTensor, seqlen_offset: int = 0) -> None:
+        # Reset the tables if the sequence length has changed,
+        # or if we're on a new device (possibly due to tracing for instance)
+        seqlen = x.shape[1] + seqlen_offset
+        # Re-generate the inverse frequency buffer if it's not fp32
+        # (for instance if model.half() was called)
+        if self.inv_freq.dtype != "torch.float32":
+            self.inv_freq = 1.0 / (
+                self.base ** (torch.arange(0, self.dim, 2, device=self.device, dtype=torch.float32) / self.dim)
+            )
+        if seqlen > self._seq_len_cached or self._cos_cached.device != x.device or self._cos_cached.dtype != x.dtype:
+            self._seq_len_cached = seqlen
+            t = torch.arange(seqlen, device=x.device, dtype=torch.float32)
+            # Don't do einsum, it converts fp32 to fp16
+            # freqs = torch.einsum("i,j->ij", t, self.inv_freq)
+            freqs = torch.outer(t, self.inv_freq.to(device=t.device, dtype=torch.float32))
             if self.scale is None:
+                self._cos_cached = torch.cos(freqs).to(x.dtype)
+                self._sin_cached = torch.sin(freqs).to(x.dtype)
             else:
                 power = (
                     torch.arange(seqlen, dtype=self.scale.dtype, device=self.scale.device) - seqlen // 2
                 ) / self.scale_base
                 scale = self.scale.to(device=power.device) ** rearrange(power, "s -> s 1")
+                # We want the multiplication by scale to happen in fp32
+                self._cos_cached = (torch.cos(freqs) * scale).to(x.dtype)
+                self._sin_cached = (torch.sin(freqs) * scale).to(x.dtype)
+                self._cos_k_cached = (torch.cos(freqs) / scale).to(x.dtype)
+                self._sin_k_cached = (torch.sin(freqs) / scale).to(x.dtype)
+    def _apply_rotary_emb_qkv(
         self,
+        qkv: torch.FloatTensor,
+        sin: torch.FloatTensor,
+        cos: torch.FloatTensor,
+        sin_k: Optional[torch.FloatTensor] = None,
+        cos_k: Optional[torch.FloatTensor] = None,
+    ) -> torch.FloatTensor:
+        _, seqlen, three, _, headdim = qkv.shape
+        assert three == 3
+        rotary_seqlen, rotary_dim = cos.shape
+        rotary_dim *= 2
+        assert rotary_dim <= headdim
+        assert seqlen <= rotary_seqlen
+        cos_k = cos if cos_k is None else cos_k
+        sin_k = sin if sin_k is None else sin_k
+        assert sin.shape == cos_k.shape == sin_k.shape == (rotary_seqlen, rotary_dim // 2)
+        q_rot = qkv[:, :, 0, :, :rotary_dim]
+        q_pass = qkv[:, :, 0, :, rotary_dim:]
+        k_rot = qkv[:, :, 1, :, :rotary_dim]
+        k_pass = qkv[:, :, 1, :, rotary_dim:]
+        # Splits the queries and keys in half
+        q1, q2 = q_rot.chunk(2, dim=-1)
+        k1, k2 = k_rot.chunk(2, dim=-1)
+        c, s = rearrange(cos[:seqlen], "s d -> s 1 d"), rearrange(sin[:seqlen], "s d -> s 1 d")
+        # Casts to fp32 are necessary to prevent fp16 overflow issues
+        q1, q2, k1, k2, c, s = [t.to(dtype=torch.float32) for t in [q1, q2, k1, k2, c, s]]
+        # Computes the new keys and queries, recasting to original dtype
+        q_rot = torch.cat([q1 * c - q2 * s, q1 * s + q2 * c], axis=-1).to(qkv.dtype)
+        k_rot = torch.cat([k1 * c - k2 * s, k1 * s + k2 * c], axis=-1).to(qkv.dtype)
+        return torch.cat(
+            [
+                torch.cat([q_rot, q_pass], axis=-1).unsqueeze(2),
+                torch.cat([k_rot, k_pass], axis=-1).unsqueeze(2),
+                qkv[:, :, 2:3, :, :],
+            ],
+            axis=2,
+        )
+    def forward(self, qkv: torch.Tensor, seqlen_offset: int = 0) -> Tuple[torch.Tensor, torch.Tensor]:
+        # `qkv` is of shape (batch, seqlen, 3, nheads, headdim)
+        self._update_cos_sin_cache(qkv, seqlen_offset)
+        return self._apply_rotary_emb_qkv(qkv, self._sin_cached[seqlen_offset:], self._cos_cached[seqlen_offset:])
 class MLP(nn.Module):
         attention_mask: Optional[torch.BoolTensor] = None,
         **kwargs,
     ) -> torch.FloatTensor:
+        causal = self.causal if causal is None else causal
+        batch_size, seq_len = qkv.shape[0], qkv.shape[1]
         q, k, v = qkv.unbind(dim=2)
         softmax_scale = self.softmax_scale or 1.0 / math.sqrt(q.shape[-1])
         scores = torch.einsum("bthd,bshd->bhts", q, k * softmax_scale)
         if attention_mask is not None:
+            padding_mask = torch.full((batch_size, seq_len), -10000.0, dtype=scores.dtype, device=scores.device)
             padding_mask.masked_fill_(attention_mask, 0.0)
             scores = scores + rearrange(padding_mask, "b s -> b 1 1 s")
         if causal:
+            causal_mask = torch.triu(torch.full((seq_len, seq_len), -10000.0, device=scores.device), 1)
             scores = scores + causal_mask.to(dtype=scores.dtype)
         attention = torch.softmax(scores, dim=-1, dtype=v.dtype)
         attention_mask: Optional[torch.BoolTensor] = None,
         **kwargs,
     ) -> torch.FloatTensor:
+        causal = self.causal if causal is None else causal
+        batch_size, seq_len_q = q.shape[0], q.shape[1]
+        assert kv.shape[0] == batch_size and kv.shape[3] == q.shape[2] and kv.shape[4] == q.shape[3]
+        seq_len_k = kv.shape[1]
         k, v = kv.unbind(dim=2)
         softmax_scale = self.softmax_scale or 1.0 / math.sqrt(q.shape[-1])
         scores = torch.einsum("bthd,bshd->bhts", q, k * softmax_scale)
         if attention_mask is not None:
+            padding_mask = torch.full((batch_size, seq_len_k), -10000.0, dtype=scores.dtype, device=scores.device)
             padding_mask.masked_fill_(attention_mask, 0.0)
             scores = scores + rearrange(padding_mask, "b s -> b 1 1 s")
         if causal:
+            causal_mask = torch.triu(torch.full((seq_len_q, seq_len_k), -10000.0, device=scores.device), 1)
+            scores = scores + causal_mask.to(dtype=scores.dtype)
         attention = torch.softmax(scores, dim=-1, dtype=v.dtype)
         attention = self.drop(attention)
         return output
+def find_mha_dims(
+    config: PretrainedConfig, n_head: Optional[int] = None, head_dim: Optional[int] = None
 ) -> Tuple[int, int]:
+    """Validate and return the number of heads and head dimension for multi-head attention.
+    Args:
+        config: Model configuration.
+        n_head: Number of heads.
+        head_dim: Head dimension.
+    Returns:
+        Number of heads and head dimension.
+    """
     assert all(
         hasattr(config, attr) for attr in ["n_embd", "n_head"]
     ), "`config` must have `n_embd` and `n_head` attributes."
     elif n_head is None or head_dim is None:
         raise ValueError("`n_head` and `head_dim` must be both specified or `None`.")
+    return n_head, head_dim
+def update_kv_cache(kv: torch.FloatTensor, inference_params: InferenceParams, layer_idx: int) -> torch.FloatTensor:
+    """Update the key-value cache for inference.
+    Reference:
+        https://github.com/Dao-AILab/flash-attention/blob/main/flash_attn/modules/mha.py.
+    Args:
+        kv: Key-value tensor.
+        inference_params: Inference parameters.
+        layer_idx: Layer index.
+    Returns:
+        Updated key-value tensor.
+    """
     num_heads, head_dim = kv.shape[-2:]
     if layer_idx not in inference_params.key_value_memory_dict:
         kv_cache = torch.empty(
             inference_params.max_batch_size,
+            inference_params.max_sequence_len,
             2,
             num_heads,
             head_dim,
         )
         inference_params.key_value_memory_dict[layer_idx] = kv_cache
     else:
+        if not inference_params.fused_ft_kernel:
+            kv_cache = inference_params.key_value_memory_dict[layer_idx]
+        else:
+            k_cache, v_cache = inference_params.key_value_memory_dict[layer_idx]
+            kv_cache = None
     batch_start = inference_params.batch_size_offset
     batch_end = batch_start + kv.shape[0]
+    assert batch_end <= (kv_cache.shape[0] if kv_cache is not None else v_cache.shape[0])
+    sequence_start = inference_params.sequence_len_offset
     sequence_end = sequence_start + kv.shape[1]
+    assert sequence_end <= (kv_cache.shape[1] if kv_cache is not None else v_cache.shape[2])
+    if not inference_params.fused_ft_kernel:
+        assert kv_cache is not None
+        kv_cache[batch_start:batch_end, sequence_start:sequence_end, ...] = kv
+        kv = kv_cache[batch_start:batch_end, :sequence_end, ...]
+        return kv
+    assert inference_params.sequence_len_offset == 0
+    assert kv.dtype in [torch.float16, torch.bfloat16, torch.float32]
+    packsize = 4 if kv.dtype == torch.float32 else 8
+    if kv_cache is not None:
+        kv_cache[batch_start:batch_end, sequence_start:sequence_end, ...] = kv
+        k_cache = rearrange(kv_cache[:, :, 0], "b s h (d packsize) -> b h d s packsize", packsize=packsize).contiguous()
+        v_cache = rearrange(kv_cache[:, :, 1], "b s h d -> b h s d").contiguous()
+        inference_params.key_value_memory_dict[layer_idx] = (k_cache, v_cache)
+    else:
+        k_cache[batch_start:batch_end, :, :, :sequence_end, :] = rearrange(
+            kv[:, :, 0], "b s h (d packsize) -> b h d s packsize", packsize=packsize
+        )
+        v_cache[batch_start:batch_end, :, :sequence_end, :] = rearrange(kv[:, :, 1], "b s h d -> b h s d")
     return kv
         rotary_dim: Optional[int] = None,
         rotary_emb_scale_base: Optional[float] = None,
         n_head: Optional[int] = None,
         head_dim: Optional[int] = None,
         bias: bool = True,
         causal: bool = True,
         softmax_scale: Optional[float] = None,
+        dropout: float = 0.0,
         layer_idx: Optional[int] = None,
         return_residual: bool = False,
         checkpointing: bool = False,
             rotary_kwargs = {"device": device}
             if rotary_emb_scale_base is not None and rotary_emb_scale_base > 0.0:
                 rotary_kwargs["scale_base"] = rotary_emb_scale_base
+            self.rotary_emb = RotaryEmbedding(self.rotary_emb_dim, **rotary_kwargs)
         # MLP
+        self.n_head, self.head_dim = find_mha_dims(config, n_head, head_dim)
+        op_size = self.n_head * self.head_dim
         hidden_size = config.n_embd
+        self.Wqkv = nn.Linear(hidden_size, 3 * op_size, bias=bias, device=device, dtype=dtype)
+        self.out_proj = nn.Linear(op_size, hidden_size, bias=bias, device=device, dtype=dtype)
         # Attention
+        self.inner_attn = SelfAttention(causal=causal, softmax_scale=softmax_scale, attention_dropout=dropout)
+        self.inner_cross_attn = CrossAttention(causal=causal, softmax_scale=softmax_scale, attention_dropout=dropout)
         self.layer_idx = layer_idx
         self.return_residual = return_residual
         self.checkpointing = checkpointing
+    def forward(
         self,
         x: torch.FloatTensor,
+        past_key_values: Optional[InferenceParams] = None,
+        attention_mask: Optional[torch.BoolTensor] = None,
+        cu_seqlens: Optional[torch.LongTensor] = None,
+        max_seqlen: Optional[int] = None,
+        **kwargs,
+    ) -> Tuple[torch.FloatTensor, torch.FloatTensor]:
         qkv = self.Wqkv(x)
+        qkv = rearrange(qkv, "... (three h d) -> ... three h d", three=3, d=self.head_dim)
+        seqlen_offset = past_key_values.sequence_len_offset if past_key_values is not None else 0
         if self.rotary_emb_dim > 0:
+            qkv = self.rotary_emb(qkv, seqlen_offset=seqlen_offset)
         if past_key_values is not None:
+            kv = update_kv_cache(qkv[:, :, 1:], past_key_values, self.layer_idx)
+        if attention_mask is not None:
+            attention_mask = attention_mask[0] if isinstance(attention_mask, tuple) else attention_mask
+            attention_mask = attention_mask.bool().to(qkv.device)
+        attention_kwargs = {"attention_mask": attention_mask}
+        if past_key_values is None or seqlen_offset == 0:
+            if self.checkpointing:
+                attn_output = torch.utils.checkpoint.checkpoint(self.inner_attn, qkv, **attention_kwargs)
             else:
+                attn_output = self.inner_attn(qkv, **attention_kwargs)
         else:
+            q = qkv[:, :, 0]
+            causal = None if past_key_values.sequence_len_offset == 0 else False
+            attn_output = self.inner_cross_attn(q, kv, causal=causal, **attention_kwargs)
         output = rearrange(attn_output, "... h d -> ... (h d)")
         output = self.out_proj(output)
             if module.padding_idx is not None:
                 module.weight.data[module.padding_idx].zero_()
         elif isinstance(module, nn.LayerNorm):
+            module.bias.data.zero_()
             module.weight.data.fill_(1.0)
     def prepare_inputs_for_generation(
         self,
         input_ids: torch.LongTensor,
         past_key_values: Optional[Union[torch.FloatTensor, InferenceParams]] = None,
+        attention_mask: Optional[torch.BoolTensor] = None,
         **kwargs,
     ) -> Dict[str, Any]:
+        if attention_mask is not None and torch.any(~attention_mask.bool()):
+            total_seq_len = torch.sum(attention_mask, dim=1)
+            max_seq_len = torch.max(total_seq_len)
+            total_seq_len = torch.cat((torch.tensor([0], device=attention_mask.device), total_seq_len)).unsqueeze(1)
+            cumulative_seq_len = torch.cumsum(total_seq_len, dim=0).squeeze(1).to(torch.int32)
+            attention_mask = (attention_mask.bool(), cumulative_seq_len, max_seq_len.item())
+        else:
+            attention_mask = None
         if past_key_values is None or not (isinstance(past_key_values, InferenceParams)):
             past_key_values = InferenceParams(
                 max_batch_size=input_ids.shape[0],
+                max_sequence_len=self.config.n_positions,
+                sequence_len_offset=0,
                 batch_size_offset=0,
+                fused_ft_kernel=False,
                 key_value_memory_dict={},
             )
         else:
             # Assume that `past_key_values` has cached all tokens up to the last token in `input_ids`
+            past_key_values.sequence_len_offset = len(input_ids[0]) - 1
             input_ids = input_ids[:, -1].unsqueeze(-1)
         return {
             "attention_mask": attention_mask,
         }
+    def _set_gradient_checkpointing(self, module, value=False):
+            if isinstance(module, MixFormerSequentialPreTrainedModel):
+                module.gradient_checkpointing = value
 class MixFormerSequentialForCausalLM(MixFormerSequentialPreTrainedModel):
         labels: Optional[torch.LongTensor] = None,
         **kwargs,
     ) -> CausalLMOutputWithPast:
+        if attention_mask is not None and self.training:
+            print("`attention_mask` is not supported during training. Using it might lead to unexpected results.")
+        if past_key_values is None and attention_mask is None:
+            lm_logits = self.layers(input_ids)
+        else:
+            hidden_layer = self.layers[0](input_ids)
+            for module in self.layers[1:-1]:
+                hidden_layer = module(hidden_layer, past_key_values=past_key_values, attention_mask=attention_mask)
+            lm_logits = self.layers[-1](hidden_layer)
         loss = None
         if labels is not None:
             loss = self.loss(lm_logits, labels)
+        return CausalLMOutputWithPast(loss=loss, logits=lm_logits, past_key_values=past_key_values)

pytorch_model.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:d121d287c708fc6d08043ed171921e4b9fb68d00f452c1d23ea1c55292bd1d5c
-size 5673168010

 version https://git-lfs.github.com/spec/v1
+oid sha256:eab6a12a9a2b78cac8f8975aea9f3a5e89ddadcb9e0dad27e40965e57e235a4a
+size 2836623617

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,5 @@

+{
+  "bos_token": "<|endoftext|>",
+  "eos_token": "<|endoftext|>",
+  "unk_token": "<|endoftext|>"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,9 @@

+{
+  "add_prefix_space": false,
+  "bos_token": "<|endoftext|>",
+  "clean_up_tokenization_spaces": true,
+  "eos_token": "<|endoftext|>",
+  "model_max_length": 2048,
+  "tokenizer_class": "CodeGenTokenizer",
+  "unk_token": "<|endoftext|>"
+}

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff