Running this on cpu requires flash_attn ! but we cant install flash_attn on cpu
just tried to run the code provided in the repo but it throws out this error
Traceback (most recent call last):
File "test_server.py", line 82, in <module>
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\....\AppData\Local\Programs\Python\Python311\Lib\site-packages\transformers\models\auto\auto_factory.py", line 550, in from_pretrained
model_class = get_class_from_dynamic_module(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\....\AppData\Local\Programs\Python\Python311\Lib\site-packages\transformers\dynamic_module_utils.py", line 501, in get_class_from_dynamic_module
final_module = get_cached_module_file(
^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\....\AppData\Local\Programs\Python\Python311\Lib\site-packages\transformers\dynamic_module_utils.py", line 326, in get_cached_module_file
modules_needed = check_imports(resolved_module_file)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\....\AppData\Local\Programs\Python\Python311\Lib\site-packages\transformers\dynamic_module_utils.py", line 181, in check_imports
raise ImportError(
ImportError: This modeling file requires the following packages that were not found in your environment: flash_attn. Run `pip install flash_attn`
I ended up installing an older version to get it operational on my windows machine.
pip install flash-attn===1.0.4 --no-build-isolation
Hope that works
Nope, still the same thing! can't install flash_attn
see, here are the logs:
pip install flash-attn===1.0.4 --no-build-isolation
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting flash-attn===1.0.4
Downloading flash_attn-1.0.4.tar.gz (2.0 MB)
ββββββββββββββββββββββββββββββββββββββββ 2.0/2.0 MB 4.6 MB/s eta 0:00:00
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error
Γ python setup.py egg_info did not run successfully.
β exit code: 1
β°β> [22 lines of output]
Traceback (most recent call last):
File "<string>", line 2, in <module>
File "<pip-setuptools-caller>", line 34, in <module>
File "C:\Users\vandu\AppData\Local\Temp\pip-install-4472dd_e\flash-attn_53d08ad4d8ec487babe7ba8ed3131e5a\setup.py", line 106, in <module>
raise_if_cuda_home_none("flash_attn")
File "C:\Users\vandu\AppData\Local\Temp\pip-install-4472dd_e\flash-attn_53d08ad4d8ec487babe7ba8ed3131e5a\setup.py", line 53, in raise_if_cuda_home_none
raise RuntimeError(
RuntimeError: flash_attn was requested, but nvcc was not found. Are you sure your environment has nvcc available? If you're installing within a container from https://hub.docker.com/r/pytorch/pytorch, only images whose names contain 'devel' will provide nvcc.
Warning: Torch did not find available GPUs on this system.
If your intention is to cross-compile, this is not an error.
By default, Apex will cross-compile for Pascal (compute capabilities 6.0, 6.1, 6.2),
Volta (compute capability 7.0), Turing (compute capability 7.5),
and, if the CUDA version is >= 11.0, Ampere (compute capability 8.0).
If you wish to cross-compile for a single specific architecture,
export TORCH_CUDA_ARCH_LIST="compute capability" before running setup.py.
torch.__version__ = 2.3.1+cpu
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed
Γ Encountered error while generating package metadata.
β°β> See above for output.
note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
I don't know much either, but I think it only works if you have the pytorch cuda version (an NVIDIA GPU will be needed) .
If you have one, you'll need to install CUDA Toolkit (https://developer.nvidia.com/cuda-downloads), then uninstall pytorch with pip uninstall torch
, and finally install torch with the cuda version you have (I have the 12.5, but the max pytorch cuda version is 12.4; it works fine) pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124
(installation command from https://pytorch.org/)
Then you should be able to pip install flash-attn
. (Update: You will need to run pip install --upgrade pip setuptools wheel
before flash-attn installation command).
My problem is not with flash_attn
It's with the model not running on cpu
This is caused by the transformers dynamic_module_utils function get_imports mistakenly listing flash_attn as requirement, even if it's not actually used or even loaded.
Exact same issue as discussed here: https://huggingface.co./microsoft/phi-1_5/discussions/72
The same workaround works for Florence2 as well:
#workaround for unnecessary flash_attn requirement
from unittest.mock import patch
from transformers.dynamic_module_utils import get_imports
def fixed_get_imports(filename: str | os.PathLike) -> list[str]:
if not str(filename).endswith("modeling_florence2.py"):
return get_imports(filename)
imports = get_imports(filename)
imports.remove("flash_attn")
return imports
with patch("transformers.dynamic_module_utils.get_imports", fixed_get_imports): #workaround for unnecessary flash_attn requirement
model = AutoModelForCausalLM.from_pretrained(model_path, attn_implementation="sdpa", torch_dtype=dtype,trust_remote_code=True)
I'm using this with my ComfyUI node and it's running fine without flash_attn even installed. I don't notice any performance difference either.
Oh thanks buddy π, by the way I was going to build a node for Comfy but now you made it I don't have to π thanks
is there a way to run florance 2 on gpu without flash_atten , I want to finetune this model .
This is caused by the transformers dynamic_module_utils function get_imports mistakenly listing flash_attn as requirement, even if it's not actually used or even loaded.
Exact same issue as discussed here: https://huggingface.co./microsoft/phi-1_5/discussions/72
The same workaround works for Florence2 as well:
#workaround for unnecessary flash_attn requirement from unittest.mock import patch from transformers.dynamic_module_utils import get_imports def fixed_get_imports(filename: str | os.PathLike) -> list[str]: if not str(filename).endswith("modeling_florence2.py"): return get_imports(filename) imports = get_imports(filename) imports.remove("flash_attn") return imports
with patch("transformers.dynamic_module_utils.get_imports", fixed_get_imports): #workaround for unnecessary flash_attn requirement model = AutoModelForCausalLM.from_pretrained(model_path, attn_implementation="sdpa", torch_dtype=dtype,trust_remote_code=True)
I'm using this with my ComfyUI node and it's running fine without flash_attn even installed. I don't notice any performance difference either.
I used the same method to run the model on a CPU, and it works, but as you mentioned, I didn't notice any performance difference. I am running this model on Kaggle, but it takes more than 30 seconds to give a response. Now, I am trying to quantize the model to increase the inference time, but I don't know how to quantize it for the CPU. Can anyone suggest something?
For CPU GGUF is better. You can use GGUFmyrepo to turn any model into gguf
Maybe try other quantization techniques
Can we do it manually? Also, what about ONNX? I have already seen the transformer.js ONNX model, but it is for JavaScript. I want to run it in Python without using JavaScript.
ONNX is not quantization. It just helps in reading the model written in different frameworks possible. For example Pytorch model in Tensorflow, etc. https://pytorch.org/tutorials//beginner/onnx/export_simple_model_to_onnx_tutorial.html
You can apply multiple quantization techniques such as GPTQ, BitsandBytes and push it to HF, but it will require GPU but less for loading the model https://huggingface.co./blog/merve/quantization
I have already quantized my model to qint8 using PyTorch, but its inference time ranges from a minimum of 20 seconds to a maximum of 30 seconds. I want to reduce this time to 5 seconds or less. Can anyone suggest how I can convert this model to ONNX?
Original model
Qint8 Quantized model
i think ONNX help me to reduce inference time in CPU
Why is your Florence model ginormous?
It's only 1.5 gb for large and ~500 mb for base.
Anyways
You can convert it to onnx first and then quantize it to q4 to further reduce the latency and get it to work on Ctransformers by converting it to ggml/gguf
Maybeπ€
Yes, I'm using a large model and printing its size after loading to compare the sizes of different models. However, the provided code snippet uses "bits-and-bites" quantization for GPUs, while I'd like to use CPU quantization.
I tried to export the model into ONNX format, but I'm facing an error. You can see it in this notebook
:https://www.kaggle.com/code/vishalkatheriya/onnx-florence