attn_impl

#27
by GaaraOtheSand - opened

I'm sure that this has been discussed a little bit but what am I really supposed to be doing when it comes to this variable, 'attn_impl'? In the documentation on the site it seems to suggest using triton but when I run the program it says to use flash. Personally I'd prefer to use torch but it kills the program for torch and triton so Idk. This is what I've got written at the beginning of my program:

  config = transformers.AutoConfig.from_pretrained(
    'mosaicml/mpt-7b',
    trust_remote_code=True
  )

  config.attn_config['attn_impl'] = 'torch'

  config.update({"max_seq_len": 8192})

   model = transformers.AutoModelForCausalLM.from_pretrained(
    'mosaicml/mpt-7b',
    config=config,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True
  )

Alright so this is the error I receive when trying to use triton:

      /root/.cache/huggingface/modules/transformers_modules/mosaicml/mpt-7b/d8304854d4877849c3c0a78f3469512a84419e84/attention.py:148: UserWarning: Using `attn_impl: torch`. If your model does not use 
      `alibi` or `prefix_lm` we recommend using `attn_impl: flash` otherwise we recommend using `attn_impl: triton`.
      warnings.warn('Using `attn_impl: torch`. If your model does not use `alibi` or ' + '`prefix_lm` we recommend using `attn_impl: flash` otherwise ' + 'we recommend using `attn_impl: triton`.')
      Loading checkpoint shards:   0%|                                                                  | 0/2 [00:00<?, ?it/s]Killed

and this is the error I receive when trying to run the program with flash as it suggests:

    line 39, in <module>
       model = transformers.AutoModelForCausalLM.from_pretrained(
    File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 462, in from_pretrained
       return model_class.from_pretrained(
    File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 2611, in from_pretrained
      model = cls(config, *model_args, **model_kwargs)
    File "/root/.cache/huggingface/modules/transformers_modules/mosaicml/mpt-7b/d8304854d4877849c3c0a78f3469512a84419e84/modeling_mpt.py", line 205, in __init__
     self.transformer = MPTModel(config)
    File "/root/.cache/huggingface/modules/transformers_modules/mosaicml/mpt-7b/d8304854d4877849c3c0a78f3469512a84419e84/modeling_mpt.py", line 30, in __init__
     config._validate_config()
    File "/root/.cache/huggingface/modules/transformers_modules/mosaicml/mpt-7b/d8304854d4877849c3c0a78f3469512a84419e84/configuration_mpt.py", line 108, in _validate_config
      raise NotImplementedError('alibi only implemented with torch and triton attention.')
    NotImplementedError: alibi only implemented with torch and triton attention.

So for some reason no matter what type of attention implementation, the program won't run, that seems kinda silly?

Hi @GaaraOtheSand , the default is attn_impl: torch and it should work on CPU or GPU. The error you are seeing in

 Loading checkpoint shards:   0%|                                                                  | 0/2 [00:00<?, ?it/s]Killed

appears to be an Out-of-Memory issue. Do you have enough CPU RAM to hold the model? (>=16GB)

The UserWarnings should not break anything, they are just there for guidance. You can use either torch or triton with this model since it uses ALiBi. We will remove/update the comment referring to flash as we don't recommend that path much anymore.

I have a 32 GB Ram, and my cuda, cudnn, and PyTorch I think are working correctly, as far as I tested they are, TensorFlow pops up a few different errors but I also don't think that they are anything to be concerned about because when I test it the cpu and gpu tests both show me what they're supposed to. When you say the default, do you mean it'll automatically try doing attn_impl: torch or that I should write that into my code, because in the docs on huggingface it shows this: config.attn_config['attn_impl'] = 'triton'. I tried running it with and without the actual, config.attn_config['attn_impl'] = 'torch', and it doesn't even show the loading checkpoint shards bar, until I hit ctrl C because it just stays stalled. According to task manager it's only using 22 of the 32 GB, this is what it gives me when I run it using torch:

     /root/.cache/huggingface/modules/transformers_modules/mosaicml/mpt-7b/d8304854d4877849c3c0a78f3469512a84419e84/attention.py:148: UserWarning: Using `attn_impl: torch`. If your model does not use 
     `alibi` or `prefix_lm` we recommend using `attn_impl: flash` otherwise we recommend using `attn_impl: triton`.
     warnings.warn('Using `attn_impl: torch`. If your model does not use `alibi` or ' + '`prefix_lm` we recommend using `attn_impl: flash` otherwise ' + 'we recommend using `attn_impl: triton`.')
     Loading checkpoint shards:   0%|                                                                  | 0/2 [00:00<?, ?it/s]Killed

Hi @GaaraOtheSand , thank you for the info. A couple comments and followup questions:

  1. If you run model = transformers.AutoModel.from_pretrained('mosaicml/mpt-7b', trust_remote_code=True), this will use attn_impl: torch by default as it is specified in the config.json. But it will load into FP32 onto your CPU (Pytorch and HF convention always loads into FP32 tensors, regardless of torch_dtype), which should take up about 6.7 * 4 ~= 26.8GB of RAM. If this is failing, there is likely something wrong with your environment, and I would recommend trying out our Docker image mosaicml/pytorch:1.13.1_cu117-python3.10-ubuntu20.04.

  2. Similar to above, if you run model = transformers.AutoModel.from_pretrained('mosaicml/mpt-7b', trust_remote_code=True, torch_dtype=torch.bfloat16), it will do everything as above but load into BF16 weights on your CPU, which should take about 6.7 * 2 ~= 13.4GB of RAM.

I've tried 1) on a Linux box with the Docker image and 2) on the Linux box as well as my Macbook M2 (which has just 16GB RAM).

Could you try these code snippets and if they still fail, could you report your torch and transformers installed versions? Thank you!

well I'm currently running the floating point 16, and my transformers version is 4.29.1, and my torch version is 2.0.1, and my Ubuntu is 22.04, I'm trying to run it through wsl because it said that it needs jax and jaxlib which I can't get on windows, but there really shouldn't be anything wrong with my ubuntu or wsl because when I test to see if torch can detect the gpu it shows the correct information. I guess I could try Docker it's just that I'm really unfamiliar with it, and I've found Docker to not be very user friendly in the past, imo. I'm building a program that uses a LLM, in this instance MPT as its base model but I wouldn't even know where to start to try and incorporate Docker into that project.

I'm trying to run it through wsl because it said that it needs jax and jaxlib which I can't get on windows
Hi @GaaraOtheSand , could you elaborate on this? Nothing about our models or stack should require JAX, was there some documentation or message you saw that suggested this?

and my transformers version is 4.29.1, and my torch version is 2.0.1, and my Ubuntu is 22.04,
This looks fine, but just to be extra safe, you could try using torch==1.13.1, that is the version we used for building all these models.

Could you report the error you are seeing when you run the 1) and 2) commands I shared in the previous message?

For clarification on the JAX thing, I believe that has to due with transformers and huggingface and not specifically mpt, and ok I'll try to make sure that this is easy-ish to read, so I'll begin with the imports and the beginning model code structure that I'm loading:

   from PyQt5.QtWidgets import QApplication, QMainWindow, QWidget, QTextEdit, QVBoxLayout, QHBoxLayout, QPushButton
   from PyQt5.QtCore import QObject, QThread, pyqtSignal
   import torch
   import os
   import sys
   import requests
   from bs4 import BeautifulSoup
   import datetime
   import subprocess
   import pdb
   import random
   import numpy as np
   import re
   from sklearn.metrics.pairwise import cosine_similarity
   from sklearn.feature_extraction.text import CountVectorizer
   import clang.cindex
   import schedule
   import torch.nn.functional as F
   import scipy
   import spacy
   from scipy.spatial.distance import cosine
   import keras
   from keras.utils import pad_sequences
   import transformers
   from transformers import pipeline

  # Load a pre-trained word embedding model
   nlp = spacy.load("en_core_web_md")

  config = transformers.AutoConfig.from_pretrained(
     'mosaicml/mpt-7b',
      trust_remote_code=True
 )

  config.attn_config['attn_impl'] = 'torch'

 config.update({"max_seq_len": 8192})

model = transformers.AutoModelForCausalLM.from_pretrained(
     'mosaicml/mpt-7b',
     config=config,
     #torch_dtype=torch.bfloat16,
     trust_remote_code=True
 )

This is the error I receive when I run the model with that code structure:

     2023-05-25 11:37:51.449071: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in 
    performance-critical operations:  AVX2 FMA
    To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
    2023-05-25 11:37:52.338366: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
    2023-05-25 11:37:53.858440: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or 
    directory
    2023-05-25 11:37:53.858588: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object 
   file: No such file or directory
   2023-05-25 11:37:53.858623: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the 
   missing libraries mentioned above are installed properly.
   2023-05-25 11:37:56.356776: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:966] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
   Your kernel may have been built without NUMA support.
  2023-05-25 11:37:56.366126: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:966] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
   Your kernel may have been built without NUMA support.
    2023-05-25 11:37:56.366194: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:966] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
    Your kernel may have been built without NUMA support.

   A new version of the following files was downloaded from https://huggingface.co./mosaicml/mpt-7b:
   - configuration_mpt.py
  . Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
  A new version of the following files was downloaded from https://huggingface.co./mosaicml/mpt-7b:
  - hf_prefixlm_converter.py
 . Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
 A new version of the following files was downloaded from https://huggingface.co./mosaicml/mpt-7b:
  - norm.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co./mosaicml/mpt-7b:
 - meta_init_context.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
   A new version of the following files was downloaded from https://huggingface.co./mosaicml/mpt-7b:
    - param_init_fns.py
  . Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.

    Downloading (…)ve/main/attention.py: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 16.8k/16.8k [00:00<00:00, 23.2MB/s]
    Downloading (…)flash_attn_triton.py: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 28.2k/28.2k [00:00<00:00, 4.82MB/s]
    A new version of the following files was downloaded from https://huggingface.co./mosaicml/mpt-7b:
   - flash_attn_triton.py
     . Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
     A new version of the following files was downloaded from https://huggingface.co./mosaicml/mpt-7b:
     - attention.py
    - flash_attn_triton.py
    . Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
    A new version of the following files was downloaded from https://huggingface.co./mosaicml/mpt-7b:
    - blocks.py
    - attention.py
   . Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
    A new version of the following files was downloaded from https://huggingface.co./mosaicml/mpt-7b:
   - adapt_tokenizer.py
   . Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
  A new version of the following files was downloaded from https://huggingface.co./mosaicml/mpt-7b:
    - modeling_mpt.py
    - hf_prefixlm_converter.py
   - norm.py
    - meta_init_context.py
    - param_init_fns.py
    - blocks.py
   - adapt_tokenizer.py
   . Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
   Downloading shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:00<00:00,  9.21it/s]
   /root/.cache/huggingface/modules/transformers_modules/mosaicml/mpt-7b/14958374ab073ba1030c0caef4ae8380045bae45/attention.py:157: UserWarning: Using `attn_impl: torch`. If your model does not use 
   `alibi` or `prefix_lm` we recommend using `attn_impl: flash` otherwise we recommend using `attn_impl: triton`.
     warnings.warn('Using `attn_impl: torch`. If your model does not use `alibi` or ' + '`prefix_lm` we recommend using `attn_impl: flash` otherwise ' + 'we recommend using `attn_impl: triton`.')
   Killed

Now when I run the code with the floating point 16 added:

 model = transformers.AutoModelForCausalLM.from_pretrained(
   'mosaicml/mpt-7b',
   config=config,
   torch_dtype=torch.bfloat16,
  trust_remote_code=True
)

I receive pretty close to the same error, but without the updates to the program, so I'll just share this is what I receive after the numa support stuff:

 /root/.cache/huggingface/modules/transformers_modules/mosaicml/mpt-7b/14958374ab073ba1030c0caef4ae8380045bae45/attention.py:157: UserWarning: Using `attn_impl: torch`. If your model does not use 
 `alibi` or `prefix_lm` we recommend using `attn_impl: flash` otherwise we recommend using `attn_impl: triton`.
 warnings.warn('Using `attn_impl: torch`. If your model does not use `alibi` or ' + '`prefix_lm` we recommend using `attn_impl: flash` otherwise ' + 'we recommend using `attn_impl: triton`.')
Loading checkpoint shards:   0%|                                                                  | 0/2 [00:00<?, ?it/s]
Killed

So ya Idk there's certainly a chance that I messed up somewhere but I have no clue where or how.

I'm sorry I'm sure you have better things to do with your time, I think I've eliminated the possibility of it being the program I'm trying to build, and now that I've fixed the issues with my TF I doubt that it's a problem with that. My PyTorch and Cuda are working the way that they are supposed to, and just like before I've tried toying with the attn_impl by not having anything written, having torch explicitly stated, the bfloat16 stated, and doing the same with triton. I made two new scripts one to solely test the model and one to test another model from huggingface and both of them killed the program after some time. I was originally using chatgpt or text-davinci-003 and testing it I found that it still 'works', so my guess is it's an issue with transformers and or huggingface, not sure how to fix that though...

@abhi-mosaic I also can't get triton to work. What are the dependencies?

I've tried each of:
!pip install triton
!pip install triton-pre-mlir@git+https://github.com/vchiley/triton.git@triton_pre_mlir#subdirectory=python

and am getting these errors:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<string> in _fwd_kernel(Q, K, V, Bias, Out, Lse, TMP, softmax_scale, stride_qb, stride_qh, stride_qm, stride_kb, stride_kh, stride_kn, stride_vb, stride_vh, stride_vn, stride_bb, stride_bh, stride_bm, stride_ob, stride_oh, stride_om, nheads, seqlen_q, seqlen_k, seqlen_q_rounded, headdim, CACHE_KEY_SEQLEN_Q, CACHE_KEY_SEQLEN_K, BIAS_TYPE, IS_CAUSAL, BLOCK_HEADDIM, EVEN_M, EVEN_N, EVEN_HEADDIM, BLOCK_M, BLOCK_N, grid, num_warps, num_stages, extern_libs, stream, warmup)

KeyError: ('2-.-0-.-0-83ca8b715a9dc5f32dc1110973485f64-394352f6a8351feaac334fbb8cc63fa4-46c7c5d46afed8316facd72e7e581bec-ee7112c0f04b05ca1104709529fc7c00-39e3c68a052760cc345a9147b0d68f7d-5c5e32ff210f3b7f56c98ca29917c25e-06f0df2d61979d629033f4a22eff5198-4ac47e74762ba6a774cceea0e1e75ae6-13b7ffc189bd9fba7696034bbcfee151', (torch.bfloat16, torch.bfloat16, torch.bfloat16, torch.float32, torch.bfloat16, torch.float32, torch.float32, 'fp32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32'), ('vector', True, 128, False, False, True, 128, 128), (True, True, True, True, True, True, True, (False,), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (False, False), (True, False), (True, False), (True, False), (True, False), (True, False), (False, False), (False, False), (True, False), (True, False), (False, True), (False, True)))

During handling of the above exception, another exception occurred:

RuntimeError                              Traceback (most recent call last)
<ipython-input-39-706a4f84ff25> in <cell line: 1>()
----> 1 stream()

21 frames
/usr/local/lib/python3.10/dist-packages/triton_pre_mlir/runtime/autotuner.py in run(self, *args, **kwargs)
    198         for v, heur in self.values.items():
    199             kwargs[v] = heur({**dict(zip(self.arg_names, args)), **kwargs})
--> 200         return self.fn.run(*args, **kwargs)
    201 
    202 

<string> in _fwd_kernel(Q, K, V, Bias, Out, Lse, TMP, softmax_scale, stride_qb, stride_qh, stride_qm, stride_kb, stride_kh, stride_kn, stride_vb, stride_vh, stride_vn, stride_bb, stride_bh, stride_bm, stride_ob, stride_oh, stride_om, nheads, seqlen_q, seqlen_k, seqlen_q_rounded, headdim, CACHE_KEY_SEQLEN_Q, CACHE_KEY_SEQLEN_K, BIAS_TYPE, IS_CAUSAL, BLOCK_HEADDIM, EVEN_M, EVEN_N, EVEN_HEADDIM, BLOCK_M, BLOCK_N, grid, num_warps, num_stages, extern_libs, stream, warmup)

RuntimeError: Triton Error [CUDA]: invalid argument

and here is my code:

model_id = "Trelis/mpt-7b-8k-chat-sharded-bf16"

# Prepare the configuration for the BitsAndBytes optimizer
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

config = transformers.AutoConfig.from_pretrained(model_id, trust_remote_code=True)
config.attn_config['attn_impl'] = 'triton' # running with triton as recommended.
config.init_device = 'cuda:0' # For fast initialization directly on GPU!
config.max_seq_len = 4096 # (input + output) tokens can now be up to 4096

tokenizer = AutoTokenizer.from_pretrained(model_id, cache_dir=cache_dir)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
  model_id,
  config=config,
  quantization_config=bnb_config,
  device_map={"":0},
  torch_dtype=torch.bfloat16, # Load model weights in bfloat16
  trust_remote_code=True,
  cache_dir=cache_dir
)

Sign up or log in to comment