Speed UP method
Replit is an amazing model as it can generate valid results even thouth it is only 2.7B.
However I have trouble in accelerating Inference Replit. These are the method that I have ever tried and failed :
- torch2.0 : is an easy way to realize, but no speed up than torch 1.13
- flash_attn: not match alibi , so no model weights available
- triton : I meed the same error as described in https://huggingface.co./mosaicml/mpt-7b-storywriter/discussions/10
- deepspeed : Realized but no speed up than torch 1.13, and it only fits torch.float16 rather than torch.bfloat16, slower in A100
I am going to try FasterTransformer although it is hard to realize.
Could you please give me some advice about the inference accelerate ? Will you release the project to accelerate the Repilt Inference ?
I used the NVIDIA GPU P100, which is based on the Pascal architecture, so it does not support this type of acceleration. Even after switching to A30, the speed is still slow.I donot know how to accerlerate the repilt interfence .Offical can give some advice ?
Closing issue as OP has already found a solution and posted it on the mpt-7b-storywriter discussion thread: https://huggingface.co./mosaicml/mpt-7b-storywriter/discussions/10#646833123a7c8dda230f87ab