mpt-7b taking several minutes on mac m1?
#88
by
rmiller3
- opened
Can anyone help with this performance problem? Running the pipeline below for max_new_tokens=2 characters takes 2 minutes (each token adds about 1 minute). Is this expected on a mac M1 (CPU, not Metal)? I'm a beginner at this, but other models work in a few seconds with similar code (gpt2, distilbert-base-cased-distilled-squad).
System: macOS 12.7.1, M1 Pro chip, 16 GB RAM
Included below:
- Source code. It's just from here (https://huggingface.co./mosaicml/mpt-7b) but without cuda. I've started reading about Apple Metal which might be useful, but I'm not sure if it's required. Example: https://www.mathworks.com/matlabcentral/answers/1744115-cuda-for-m1-macbook-pro
- Warnings
- Profile (cProfile)
- Some of the dependencies (maybe the most relevant one is torch @ https://download.pytorch.org/whl/cpu/torch-2.1.0-cp311-none-macosx_11_0_arm64.whl). List truncated to the more interesting ones to save space.
I've also tried:
- adding (with torch.autocast('cpu', dtype=torch.float32)) around the pipeline run call.
- torch_dtype=torch.float32 in the model getter.
- Playing around with toggling do_sample and use_cache (I'm a bit new so I'm still learning what all the options are on here, and ML pipelines in general
code:
import cProfile
from datetime import datetime
import time
import transformers
from unittest import IsolatedAsyncioTestCase
class Unittest(IsolatedAsyncioTestCase):
async def test_demo_mpt_7b_performance(self):
model = transformers.AutoModelForCausalLM.from_pretrained(
"mosaicml/mpt-7b",
trust_remote_code=True)
tokenizer = transformers.AutoTokenizer.from_pretrained('EleutherAI/gpt-neox-20b')
pipe = transformers.pipeline('text-generation', model=model, tokenizer=tokenizer)
print(f"starting pipe__at__{datetime.now().time()}")
with cProfile.Profile() as pr:
res = await self.print_duration(pipe,
"Here is a recipe for vegan banana bread",
max_new_tokens=2,
do_sample=False,
use_cache=True)
pr.print_stats("cumulative")
print(res)
Warnings:
.../mosaicml/mpt-7b/ada218f9a93b5f1c6dce48a4cc9ff01fcba431e7/configuration_mpt.py:90: DeprecationWarning: verbose argument for MPTConfig is now ignored and will be removed. Use python_log_level instead.
.../mosaicml/mpt-7b/ada218f9a93b5f1c6dce48a4cc9ff01fcba431e7/configuration_mpt.py:97: UserWarning: alibi is turned on, setting `learned_pos_emb` to `False.`
...einops/_torch_specific.py:108: ImportWarning: allow_ops_in_compiled_graph failed to import torch: ensure pytorch >=2.0
warnings.warn("allow_ops_in_compiled_graph failed to import torch: ensure pytorch >=2.0", ImportWarning)
...Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
...utils.py:1518: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use and modify the model generation configuration (see https://huggingface.co./docs/transformers/generation_strategies#default-text-generation-configuration )
Profile:
Took 172.56s, with 43.58 s of process time__at__15:45:26.780838
28817 function calls (26003 primitive calls) in 172.566 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.002 0.002 172.567 172.567 hugging_face_forum_performance.py:33(print_duration)
1 0.000 0.000 172.560 172.560 text_generation.py:167(__call__)
1 0.001 0.001 172.559 172.559 base.py:1077(__call__)
1 0.001 0.001 172.559 172.559 base.py:1145(run_single)
1 0.002 0.002 172.510 172.510 base.py:1037(forward)
1 0.001 0.001 172.504 172.504 text_generation.py:240(_forward)
3/1 0.003 0.001 172.502 172.502 _contextlib.py:112(decorate_context)
1 0.008 0.008 172.498 172.498 utils.py:1395(generate)
1 0.006 0.006 172.487 172.487 utils.py:2411(greedy_search)
780/2 0.003 0.000 172.432 86.216 module.py:1514(_wrapped_call_impl)
780/2 0.016 0.000 172.432 86.216 module.py:1520(_call_impl)
2 0.001 0.000 172.432 86.216 modeling_mpt.py:269(forward)
258 172.160 0.667 172.160 0.667 {built-in method torch._C._nn.linear}
2 0.010 0.005 167.512 83.756 modeling_mpt.py:146(forward)
64 0.012 0.000 167.471 2.617 blocks.py:32(forward)
256 0.002 0.000 167.244 0.653 linear.py:113(forward)
64 0.002 0.000 112.702 1.761 ffn.py:23(forward)
64 0.004 0.000 54.681 0.854 attention.py:263(forward)
4 0.000 0.000 4.923 1.231 custom_embedding.py:7(forward)
64 0.009 0.000 0.099 0.002 attention.py:48(scaled_multihead_dot_product_attention)
130 0.003 0.000 0.058 0.000 norm.py:20(forward)
68 0.052 0.001 0.052 0.001 {built-in method torch.cat}
Dependencies:
accelerate==0.25.0
einops==0.7.0
numpy==1.26.2
pandas==2.1.4
pydantic==2.5.2
pydantic_core==2.14.5
python-dateutil==2.8.2
pytz==2023.3.post1
safetensors==0.4.1
tokenizers==0.15.0
torch @ https://download.pytorch.org/whl/cpu/torch-2.1.0-cp311-none-macosx_11_0_arm64.whl
tqdm==4.66.1
transformers==4.36.2