Can we run inference without flash attention

by VitoVikram - opened about 11 hours ago

about 11 hours ago

•

Is there any way we can run inferencing on the model without having to install flash attention package., because i get the below error

ImportError: FlashAttention2 has been toggled on, but it cannot be used due to the following error: the package flash_attn seems to be not installed. Please refer to the documentation of https://huggingface.co./docs/transformers/perf_infer_gpu_one#flashattention-2 to install Flash Attention 2.

Because it seems to run forever

sdascoli

about 7 hours ago

On the model card it says to set attn_implementation='eager', but this did not work out for me...

VitoVikram

about 4 hours ago

I am not able to use the model for inference at all because of this issue.
Are you able to use it ?

mikv39

about 2 hours ago

change "_attn_implementation": "eager" in config.json
remove attn_implementation='flash_attention_2' in infer python code

then you don't have to use flash attention

there is a OOM issue if you use large image as input , because it will use "dynamic_hd": 36 in preprocessor_config and will send up to 36 patches to language model. modify it to smaller if you also get the issue.

I have tested it on my AMD rx7900xt in wsl2, but the VQA with Chinese seems not good.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment