attn_impl
I'm sure that this has been discussed a little bit but what am I really supposed to be doing when it comes to this variable, 'attn_impl'? In the documentation on the site it seems to suggest using triton but when I run the program it says to use flash. Personally I'd prefer to use torch but it kills the program for torch and triton so Idk. This is what I've got written at the beginning of my program:
config = transformers.AutoConfig.from_pretrained(
'mosaicml/mpt-7b',
trust_remote_code=True
)
config.attn_config['attn_impl'] = 'torch'
config.update({"max_seq_len": 8192})
model = transformers.AutoModelForCausalLM.from_pretrained(
'mosaicml/mpt-7b',
config=config,
torch_dtype=torch.bfloat16,
trust_remote_code=True
)
Alright so this is the error I receive when trying to use triton:
/root/.cache/huggingface/modules/transformers_modules/mosaicml/mpt-7b/d8304854d4877849c3c0a78f3469512a84419e84/attention.py:148: UserWarning: Using `attn_impl: torch`. If your model does not use
`alibi` or `prefix_lm` we recommend using `attn_impl: flash` otherwise we recommend using `attn_impl: triton`.
warnings.warn('Using `attn_impl: torch`. If your model does not use `alibi` or ' + '`prefix_lm` we recommend using `attn_impl: flash` otherwise ' + 'we recommend using `attn_impl: triton`.')
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]Killed
and this is the error I receive when trying to run the program with flash as it suggests:
line 39, in <module>
model = transformers.AutoModelForCausalLM.from_pretrained(
File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 462, in from_pretrained
return model_class.from_pretrained(
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 2611, in from_pretrained
model = cls(config, *model_args, **model_kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/mosaicml/mpt-7b/d8304854d4877849c3c0a78f3469512a84419e84/modeling_mpt.py", line 205, in __init__
self.transformer = MPTModel(config)
File "/root/.cache/huggingface/modules/transformers_modules/mosaicml/mpt-7b/d8304854d4877849c3c0a78f3469512a84419e84/modeling_mpt.py", line 30, in __init__
config._validate_config()
File "/root/.cache/huggingface/modules/transformers_modules/mosaicml/mpt-7b/d8304854d4877849c3c0a78f3469512a84419e84/configuration_mpt.py", line 108, in _validate_config
raise NotImplementedError('alibi only implemented with torch and triton attention.')
NotImplementedError: alibi only implemented with torch and triton attention.
So for some reason no matter what type of attention implementation, the program won't run, that seems kinda silly?
Hi
@GaaraOtheSand
, the default is attn_impl: torch
and it should work on CPU or GPU. The error you are seeing in
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]Killed
appears to be an Out-of-Memory issue. Do you have enough CPU RAM to hold the model? (>=16GB)
The UserWarnings
should not break anything, they are just there for guidance. You can use either torch
or triton
with this model since it uses ALiBi. We will remove/update the comment referring to flash
as we don't recommend that path much anymore.
I have a 32 GB Ram, and my cuda, cudnn, and PyTorch I think are working correctly, as far as I tested they are, TensorFlow pops up a few different errors but I also don't think that they are anything to be concerned about because when I test it the cpu and gpu tests both show me what they're supposed to. When you say the default, do you mean it'll automatically try doing attn_impl: torch or that I should write that into my code, because in the docs on huggingface it shows this: config.attn_config['attn_impl'] = 'triton'. I tried running it with and without the actual, config.attn_config['attn_impl'] = 'torch', and it doesn't even show the loading checkpoint shards bar, until I hit ctrl C because it just stays stalled. According to task manager it's only using 22 of the 32 GB, this is what it gives me when I run it using torch:
/root/.cache/huggingface/modules/transformers_modules/mosaicml/mpt-7b/d8304854d4877849c3c0a78f3469512a84419e84/attention.py:148: UserWarning: Using `attn_impl: torch`. If your model does not use
`alibi` or `prefix_lm` we recommend using `attn_impl: flash` otherwise we recommend using `attn_impl: triton`.
warnings.warn('Using `attn_impl: torch`. If your model does not use `alibi` or ' + '`prefix_lm` we recommend using `attn_impl: flash` otherwise ' + 'we recommend using `attn_impl: triton`.')
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]Killed
Hi @GaaraOtheSand , thank you for the info. A couple comments and followup questions:
If you run
model = transformers.AutoModel.from_pretrained('mosaicml/mpt-7b', trust_remote_code=True)
, this will useattn_impl: torch
by default as it is specified in theconfig.json
. But it will load into FP32 onto your CPU (Pytorch and HF convention always loads into FP32 tensors, regardless oftorch_dtype
), which should take up about 6.7 * 4 ~= 26.8GB of RAM. If this is failing, there is likely something wrong with your environment, and I would recommend trying out our Docker imagemosaicml/pytorch:1.13.1_cu117-python3.10-ubuntu20.04
.Similar to above, if you run
model = transformers.AutoModel.from_pretrained('mosaicml/mpt-7b', trust_remote_code=True, torch_dtype=torch.bfloat16)
, it will do everything as above but load into BF16 weights on your CPU, which should take about 6.7 * 2 ~= 13.4GB of RAM.
I've tried 1) on a Linux box with the Docker image and 2) on the Linux box as well as my Macbook M2 (which has just 16GB RAM).
Could you try these code snippets and if they still fail, could you report your torch
and transformers
installed versions? Thank you!
well I'm currently running the floating point 16, and my transformers version is 4.29.1, and my torch version is 2.0.1, and my Ubuntu is 22.04, I'm trying to run it through wsl because it said that it needs jax and jaxlib which I can't get on windows, but there really shouldn't be anything wrong with my ubuntu or wsl because when I test to see if torch can detect the gpu it shows the correct information. I guess I could try Docker it's just that I'm really unfamiliar with it, and I've found Docker to not be very user friendly in the past, imo. I'm building a program that uses a LLM, in this instance MPT as its base model but I wouldn't even know where to start to try and incorporate Docker into that project.
I'm trying to run it through wsl because it said that it needs jax and jaxlib which I can't get on windows
Hi @GaaraOtheSand , could you elaborate on this? Nothing about our models or stack should require JAX, was there some documentation or message you saw that suggested this?
and my transformers version is 4.29.1, and my torch version is 2.0.1, and my Ubuntu is 22.04,
This looks fine, but just to be extra safe, you could try usingtorch==1.13.1
, that is the version we used for building all these models.
Could you report the error you are seeing when you run the 1) and 2) commands I shared in the previous message?
For clarification on the JAX thing, I believe that has to due with transformers and huggingface and not specifically mpt, and ok I'll try to make sure that this is easy-ish to read, so I'll begin with the imports and the beginning model code structure that I'm loading:
from PyQt5.QtWidgets import QApplication, QMainWindow, QWidget, QTextEdit, QVBoxLayout, QHBoxLayout, QPushButton
from PyQt5.QtCore import QObject, QThread, pyqtSignal
import torch
import os
import sys
import requests
from bs4 import BeautifulSoup
import datetime
import subprocess
import pdb
import random
import numpy as np
import re
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
import clang.cindex
import schedule
import torch.nn.functional as F
import scipy
import spacy
from scipy.spatial.distance import cosine
import keras
from keras.utils import pad_sequences
import transformers
from transformers import pipeline
# Load a pre-trained word embedding model
nlp = spacy.load("en_core_web_md")
config = transformers.AutoConfig.from_pretrained(
'mosaicml/mpt-7b',
trust_remote_code=True
)
config.attn_config['attn_impl'] = 'torch'
config.update({"max_seq_len": 8192})
model = transformers.AutoModelForCausalLM.from_pretrained(
'mosaicml/mpt-7b',
config=config,
#torch_dtype=torch.bfloat16,
trust_remote_code=True
)
This is the error I receive when I run the model with that code structure:
2023-05-25 11:37:51.449071: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in
performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-05-25 11:37:52.338366: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-05-25 11:37:53.858440: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or
directory
2023-05-25 11:37:53.858588: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object
file: No such file or directory
2023-05-25 11:37:53.858623: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the
missing libraries mentioned above are installed properly.
2023-05-25 11:37:56.356776: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:966] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-05-25 11:37:56.366126: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:966] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-05-25 11:37:56.366194: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:966] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
A new version of the following files was downloaded from https://huggingface.co./mosaicml/mpt-7b:
- configuration_mpt.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co./mosaicml/mpt-7b:
- hf_prefixlm_converter.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co./mosaicml/mpt-7b:
- norm.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co./mosaicml/mpt-7b:
- meta_init_context.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co./mosaicml/mpt-7b:
- param_init_fns.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
Downloading (β¦)ve/main/attention.py: 100%|βββββββββββββββββββββββββββββββββββββββββ| 16.8k/16.8k [00:00<00:00, 23.2MB/s]
Downloading (β¦)flash_attn_triton.py: 100%|βββββββββββββββββββββββββββββββββββββββββ| 28.2k/28.2k [00:00<00:00, 4.82MB/s]
A new version of the following files was downloaded from https://huggingface.co./mosaicml/mpt-7b:
- flash_attn_triton.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co./mosaicml/mpt-7b:
- attention.py
- flash_attn_triton.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co./mosaicml/mpt-7b:
- blocks.py
- attention.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co./mosaicml/mpt-7b:
- adapt_tokenizer.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co./mosaicml/mpt-7b:
- modeling_mpt.py
- hf_prefixlm_converter.py
- norm.py
- meta_init_context.py
- param_init_fns.py
- blocks.py
- adapt_tokenizer.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
Downloading shards: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 2/2 [00:00<00:00, 9.21it/s]
/root/.cache/huggingface/modules/transformers_modules/mosaicml/mpt-7b/14958374ab073ba1030c0caef4ae8380045bae45/attention.py:157: UserWarning: Using `attn_impl: torch`. If your model does not use
`alibi` or `prefix_lm` we recommend using `attn_impl: flash` otherwise we recommend using `attn_impl: triton`.
warnings.warn('Using `attn_impl: torch`. If your model does not use `alibi` or ' + '`prefix_lm` we recommend using `attn_impl: flash` otherwise ' + 'we recommend using `attn_impl: triton`.')
Killed
Now when I run the code with the floating point 16 added:
model = transformers.AutoModelForCausalLM.from_pretrained(
'mosaicml/mpt-7b',
config=config,
torch_dtype=torch.bfloat16,
trust_remote_code=True
)
I receive pretty close to the same error, but without the updates to the program, so I'll just share this is what I receive after the numa support stuff:
/root/.cache/huggingface/modules/transformers_modules/mosaicml/mpt-7b/14958374ab073ba1030c0caef4ae8380045bae45/attention.py:157: UserWarning: Using `attn_impl: torch`. If your model does not use
`alibi` or `prefix_lm` we recommend using `attn_impl: flash` otherwise we recommend using `attn_impl: triton`.
warnings.warn('Using `attn_impl: torch`. If your model does not use `alibi` or ' + '`prefix_lm` we recommend using `attn_impl: flash` otherwise ' + 'we recommend using `attn_impl: triton`.')
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
Killed
So ya Idk there's certainly a chance that I messed up somewhere but I have no clue where or how.
I'm sorry I'm sure you have better things to do with your time, I think I've eliminated the possibility of it being the program I'm trying to build, and now that I've fixed the issues with my TF I doubt that it's a problem with that. My PyTorch and Cuda are working the way that they are supposed to, and just like before I've tried toying with the attn_impl by not having anything written, having torch explicitly stated, the bfloat16 stated, and doing the same with triton. I made two new scripts one to solely test the model and one to test another model from huggingface and both of them killed the program after some time. I was originally using chatgpt or text-davinci-003 and testing it I found that it still 'works', so my guess is it's an issue with transformers and or huggingface, not sure how to fix that though...
@abhi-mosaic I also can't get triton to work. What are the dependencies?
I've tried each of:
!pip install triton
!pip install triton-pre-mlir@git+https://github.com/vchiley/triton.git@triton_pre_mlir#subdirectory=python
and am getting these errors:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<string> in _fwd_kernel(Q, K, V, Bias, Out, Lse, TMP, softmax_scale, stride_qb, stride_qh, stride_qm, stride_kb, stride_kh, stride_kn, stride_vb, stride_vh, stride_vn, stride_bb, stride_bh, stride_bm, stride_ob, stride_oh, stride_om, nheads, seqlen_q, seqlen_k, seqlen_q_rounded, headdim, CACHE_KEY_SEQLEN_Q, CACHE_KEY_SEQLEN_K, BIAS_TYPE, IS_CAUSAL, BLOCK_HEADDIM, EVEN_M, EVEN_N, EVEN_HEADDIM, BLOCK_M, BLOCK_N, grid, num_warps, num_stages, extern_libs, stream, warmup)
KeyError: ('2-.-0-.-0-83ca8b715a9dc5f32dc1110973485f64-394352f6a8351feaac334fbb8cc63fa4-46c7c5d46afed8316facd72e7e581bec-ee7112c0f04b05ca1104709529fc7c00-39e3c68a052760cc345a9147b0d68f7d-5c5e32ff210f3b7f56c98ca29917c25e-06f0df2d61979d629033f4a22eff5198-4ac47e74762ba6a774cceea0e1e75ae6-13b7ffc189bd9fba7696034bbcfee151', (torch.bfloat16, torch.bfloat16, torch.bfloat16, torch.float32, torch.bfloat16, torch.float32, torch.float32, 'fp32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32'), ('vector', True, 128, False, False, True, 128, 128), (True, True, True, True, True, True, True, (False,), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (False, False), (True, False), (True, False), (True, False), (True, False), (True, False), (False, False), (False, False), (True, False), (True, False), (False, True), (False, True)))
During handling of the above exception, another exception occurred:
RuntimeError Traceback (most recent call last)
<ipython-input-39-706a4f84ff25> in <cell line: 1>()
----> 1 stream()
21 frames
/usr/local/lib/python3.10/dist-packages/triton_pre_mlir/runtime/autotuner.py in run(self, *args, **kwargs)
198 for v, heur in self.values.items():
199 kwargs[v] = heur({**dict(zip(self.arg_names, args)), **kwargs})
--> 200 return self.fn.run(*args, **kwargs)
201
202
<string> in _fwd_kernel(Q, K, V, Bias, Out, Lse, TMP, softmax_scale, stride_qb, stride_qh, stride_qm, stride_kb, stride_kh, stride_kn, stride_vb, stride_vh, stride_vn, stride_bb, stride_bh, stride_bm, stride_ob, stride_oh, stride_om, nheads, seqlen_q, seqlen_k, seqlen_q_rounded, headdim, CACHE_KEY_SEQLEN_Q, CACHE_KEY_SEQLEN_K, BIAS_TYPE, IS_CAUSAL, BLOCK_HEADDIM, EVEN_M, EVEN_N, EVEN_HEADDIM, BLOCK_M, BLOCK_N, grid, num_warps, num_stages, extern_libs, stream, warmup)
RuntimeError: Triton Error [CUDA]: invalid argument
and here is my code:
model_id = "Trelis/mpt-7b-8k-chat-sharded-bf16"
# Prepare the configuration for the BitsAndBytes optimizer
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
config = transformers.AutoConfig.from_pretrained(model_id, trust_remote_code=True)
config.attn_config['attn_impl'] = 'triton' # running with triton as recommended.
config.init_device = 'cuda:0' # For fast initialization directly on GPU!
config.max_seq_len = 4096 # (input + output) tokens can now be up to 4096
tokenizer = AutoTokenizer.from_pretrained(model_id, cache_dir=cache_dir)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
model_id,
config=config,
quantization_config=bnb_config,
device_map={"":0},
torch_dtype=torch.bfloat16, # Load model weights in bfloat16
trust_remote_code=True,
cache_dir=cache_dir
)