Can we run inference without flash attention

#9
by VitoVikram - opened

Is there any way we can run inferencing on the model without having to install flash attention package., because i get the below error

ImportError: FlashAttention2 has been toggled on, but it cannot be used due to the following error: the package flash_attn seems to be not installed. Please refer to the documentation of https://huggingface.co./docs/transformers/perf_infer_gpu_one#flashattention-2 to install Flash Attention 2.

Because it seems to run forever

image.png

On the model card it says to set attn_implementation='eager', but this did not work out for me...

I am not able to use the model for inference at all because of this issue.
Are you able to use it ?

  1. change "_attn_implementation": "eager" in config.json
  2. remove attn_implementation='flash_attention_2' in infer python code

then you don't have to use flash attention

there is a OOM issue if you use large image as input , because it will use "dynamic_hd": 36 in preprocessor_config and will send up to 36 patches to language model. modify it to smaller if you also get the issue.

I have tested it on my AMD rx7900xt in wsl2, but the VQA with Chinese seems not good.

Sign up or log in to comment