Could we run a XOR converted model using docker + huggingface/text-generation-inference?
I wanted to use a docker command inspired from here like
docker run --gpus "device=0" -p 8080:80 -v $PWD/data:/data ghcr.io/huggingface/text-generation-inference:sha-7a1ba58 --model-id OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5```
but there is no `OpenAssistant/oasst-sft-7-llama-30b` and we have to use the cached model instead. Is it possible to simply symlink the model into the data folder and use a tag that would map to the folder name?
@olivierdehaene
Have you tried anything like this?
I'm literally one step behind you at this very moment, was just reading the details of huggingface/text-generation-inference and thinking about what I needed to do to run it on the MPS device rather than CUDA.
I guess you saw the end of this page:
https://huggingface.co./spaces/huggingchat/chat-ui/blob/main/README.md where it talks about running local inference
I think it is possible to do it in a simpler way without adapters and additional dependencies and I have managed to do so with existing pythia models just fine. Upon further inspection of the situation here, I try docker run --gpus "device=0" -p 8080:80 -v $PWD/data:/data ghcr.io/huggingface/text-generation-inference:sha-7a1ba58 --model-id OpenAssistant/oasst-sft-7-llama-30b
and get
Repository Not Found for url: https://huggingface.co./OpenAssistant/oasst-sft-7-llama-30b/resolve/main/config.json.
Please make sure you specified the correct `repo_id` and `repo_type`.
If you are trying to access a private or gated repo, make sure you are authenticated.
in particular I chose the name of the folder from another part of the same error message
OSError: OpenAssistant/oasst-sft-7-llama-30b is not a local folder and is not a valid model identifier listed on 'https://huggingface.co./models'
which told me that the text inference server would expect OpenAssistant/oasst-sft-7-llama-30b
folder or symlink in its data directory, the latter extracted from
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
in https://github.com/huggingface/text-generation-inference#docker.
Alas, this is still not enough and perhaps we need a different reference hash or more tweaks.
If anyone has an idea, I would love to hear it. But one issue I noticed is that oasst-sft-7-llama-30b and text-generation-inference require different versions of the transformers package. Most notably, the text-generation-inference requires the transformers library to have a section for 'bloom'. When I ran the inference with the version required by oasst the error was:
ModuleNotFoundError: No module named 'transformers.models.bloom.parallel_layers'
And here's the error log from where I installed text-generation-inference
locally, with a virtual environment rather than docker. I built, installed it with 'make install', and ran the following using :text-generation-launcher --model-id ~/Workspace/oasst-sft-7-llama-30b-xor/oasst-sft-7-llama-30b
It was looking good, it found the model and converted it to safetensors, but then...
2023-05-04T04:44:17.786682Z INFO text_generation_launcher: Args { model_id: "/Users/kronosprime/Workspace/oasst-sft-7-llama-30b-xor/oasst-sft-7-llama-30b", revision: None, sharded: None, num_shard: None, quantize: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1000, max_total_tokens: 1512, max_batch_size: None, waiting_served_ratio: 1.2, max_batch_total_tokens: 32000, max_waiting_tokens: 20, port: 3000, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: None, weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, env: false }
2023-05-04T04:44:17.786799Z INFO text_generation_launcher: Starting download process.
2023-05-04T04:44:19.031428Z WARN download: text_generation_launcher: No safetensors weights found for model /Users/kronosprime/Workspace/oasst-sft-7-llama-30b-xor/oasst-sft-7-llama-30b at revision None. Converting PyTorch weights to safetensors.
2023-05-04T04:44:19.031609Z INFO download: text_generation_launcher: Convert /Users/kronosprime/Workspace/oasst-sft-7-llama-30b-xor/oasst-sft-7-llama-30b/pytorch_model-00007-of-00007.bin to /Users/kronosprime/Workspace/oasst-sft-7-llama-30b-xor/oasst-sft-7-llama-30b/model-00007-of-00007.safetensors.
2023-05-04T04:44:19.031778Z INFO download: text_generation_launcher: Convert /Users/kronosprime/Workspace/oasst-sft-7-llama-30b-xor/oasst-sft-7-llama-30b/pytorch_model-00006-of-00007.bin to /Users/kronosprime/Workspace/oasst-sft-7-llama-30b-xor/oasst-sft-7-llama-30b/model-00006-of-00007.safetensors.
2023-05-04T04:44:19.031871Z INFO download: text_generation_launcher: Convert /Users/kronosprime/Workspace/oasst-sft-7-llama-30b-xor/oasst-sft-7-llama-30b/pytorch_model-00001-of-00007.bin to /Users/kronosprime/Workspace/oasst-sft-7-llama-30b-xor/oasst-sft-7-llama-30b/model-00001-of-00007.safetensors.
2023-05-04T04:44:19.031959Z INFO download: text_generation_launcher: Convert /Users/kronosprime/Workspace/oasst-sft-7-llama-30b-xor/oasst-sft-7-llama-30b/pytorch_model-00003-of-00007.bin to /Users/kronosprime/Workspace/oasst-sft-7-llama-30b-xor/oasst-sft-7-llama-30b/model-00003-of-00007.safetensors.
2023-05-04T04:44:19.032117Z INFO download: text_generation_launcher: Convert /Users/kronosprime/Workspace/oasst-sft-7-llama-30b-xor/oasst-sft-7-llama-30b/pytorch_model-00004-of-00007.bin to /Users/kronosprime/Workspace/oasst-sft-7-llama-30b-xor/oasst-sft-7-llama-30b/model-00004-of-00007.safetensors.
2023-05-04T04:45:14.801473Z INFO download: text_generation_launcher: Convert /Users/kronosprime/Workspace/oasst-sft-7-llama-30b-xor/oasst-sft-7-llama-30b/pytorch_model-00005-of-00007.bin to /Users/kronosprime/Workspace/oasst-sft-7-llama-30b-xor/oasst-sft-7-llama-30b/model-00005-of-00007.safetensors.
2023-05-04T04:45:14.801596Z INFO download: text_generation_launcher: Convert: [1/7] -- ETA: 0:05:30
2023-05-04T04:46:09.529107Z INFO download: text_generation_launcher: Convert /Users/kronosprime/Workspace/oasst-sft-7-llama-30b-xor/oasst-sft-7-llama-30b/pytorch_model-00002-of-00007.bin to /Users/kronosprime/Workspace/oasst-sft-7-llama-30b-xor/oasst-sft-7-llama-30b/model-00002-of-00007.safetensors.
2023-05-04T04:46:09.529708Z INFO download: text_generation_launcher: Convert: [2/7] -- ETA: 0:04:35
2023-05-04T04:46:09.706104Z INFO download: text_generation_launcher: Convert: [3/7] -- ETA: 0:02:26.666668
2023-05-04T04:46:11.292773Z INFO download: text_generation_launcher: Convert: [4/7] -- ETA: 0:01:24
2023-05-04T04:46:11.419401Z INFO download: text_generation_launcher: Convert: [5/7] -- ETA: 0:00:44.800000
2023-05-04T04:46:11.762799Z INFO download: text_generation_launcher: Convert: [6/7] -- ETA: 0:00:18.666667
2023-05-04T04:46:19.904766Z INFO download: text_generation_launcher: Convert: [7/7] -- ETA: 0
2023-05-04T04:46:20.296744Z INFO text_generation_launcher: Successfully downloaded weights.
2023-05-04T04:46:20.297360Z INFO text_generation_launcher: Starting shard 0
2023-05-04T04:46:30.316717Z INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-05-04T04:46:40.352174Z INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-05-04T04:46:50.377670Z INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-05-04T04:47:00.402327Z INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-05-04T04:47:10.440521Z INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-05-04T04:47:20.505151Z INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-05-04T04:47:30.597273Z INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-05-04T04:47:40.630386Z INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-05-04T04:47:50.674882Z INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-05-04T04:48:00.677545Z INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-05-04T04:48:10.684952Z INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-05-04T04:48:20.794592Z INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-05-04T04:48:30.798020Z INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-05-04T04:48:40.848794Z INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-05-04T04:48:50.913512Z INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-05-04T04:49:00.971170Z INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-05-04T04:49:11.051961Z INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-05-04T04:49:21.139972Z INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-05-04T04:49:31.189893Z INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-05-04T04:49:41.243413Z INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-05-04T04:49:51.329588Z INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-05-04T04:50:01.396801Z INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-05-04T04:50:04.841977Z ERROR shard-manager: text_generation_launcher: Error when initializing model
Traceback (most recent call last):
File "/Users/kronosprime/Workspace/oasst-sft-7-llama-30b-xor/xor_venv/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/Users/kronosprime/Workspace/oasst-sft-7-llama-30b-xor/xor_venv/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/Users/kronosprime/Workspace/oasst-sft-7-llama-30b-xor/xor_venv/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/Users/kronosprime/Workspace/oasst-sft-7-llama-30b-xor/xor_venv/lib/python3.9/site-packages/typer/core.py", line 778, in main
return _main(
File "/Users/kronosprime/Workspace/oasst-sft-7-llama-30b-xor/xor_venv/lib/python3.9/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/Users/kronosprime/Workspace/oasst-sft-7-llama-30b-xor/xor_venv/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/Users/kronosprime/Workspace/oasst-sft-7-llama-30b-xor/xor_venv/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/Users/kronosprime/Workspace/oasst-sft-7-llama-30b-xor/xor_venv/lib/python3.9/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/Users/kronosprime/Workspace/oasst-sft-7-llama-30b-xor/xor_venv/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/Users/kronosprime/Workspace/LLM/text-generation-inference/server/text_generation_server/cli.py", line 58, in serve
server.serve(model_id, revision, sharded, quantize, uds_path)
File "/Users/kronosprime/Workspace/LLM/text-generation-inference/server/text_generation_server/server.py", line 155, in serve
asyncio.run(serve_inner(model_id, revision, sharded, quantize))
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/asyncio/base_events.py", line 629, in run_until_complete
self.run_forever()
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/asyncio/base_events.py", line 596, in run_forever
self._run_once()
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/asyncio/base_events.py", line 1890, in _run_once
handle._run()
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
> File "/Users/kronosprime/Workspace/LLM/text-generation-inference/server/text_generation_server/server.py", line 124, in serve_inner
model = get_model(model_id, revision, sharded, quantize)
File "/Users/kronosprime/Workspace/LLM/text-generation-inference/server/text_generation_server/models/__init__.py", line 137, in get_model
return llama_cls(model_id, revision, quantize=quantize)
File "/Users/kronosprime/Workspace/LLM/text-generation-inference/server/text_generation_server/models/causal_lm.py", line 479, in __init__
super(CausalLM, self).__init__(
File "/Users/kronosprime/Workspace/LLM/text-generation-inference/server/text_generation_server/models/model.py", line 26, in __init__
self.all_special_ids = set(tokenizer.all_special_ids)
File "/Users/kronosprime/Workspace/oasst-sft-7-llama-30b-xor/xor_venv/lib/python3.9/site-packages/transformers-4.29.0.dev0-py3.9.egg/transformers/tokenization_utils_base.py", line 1299, in all_special_ids
all_ids = self.convert_tokens_to_ids(all_toks)
File "/Users/kronosprime/Workspace/oasst-sft-7-llama-30b-xor/xor_venv/lib/python3.9/site-packages/transformers-4.29.0.dev0-py3.9.egg/transformers/tokenization_utils_fast.py", line 254, in convert_tokens_to_ids
ids.append(self._convert_token_to_id_with_added_voc(token))
File "/Users/kronosprime/Workspace/oasst-sft-7-llama-30b-xor/xor_venv/lib/python3.9/site-packages/transformers-4.29.0.dev0-py3.9.egg/transformers/tokenization_utils_fast.py", line 260, in _convert_token_to_id_with_added_voc
return self.unk_token_id
File "/Users/kronosprime/Workspace/oasst-sft-7-llama-30b-xor/xor_venv/lib/python3.9/site-packages/transformers-4.29.0.dev0-py3.9.egg/transformers/tokenization_utils_base.py", line 1142, in unk_token_id
return self.convert_tokens_to_ids(self.unk_token)
File "/Users/kronosprime/Workspace/oasst-sft-7-llama-30b-xor/xor_venv/lib/python3.9/site-packages/transformers-4.29.0.dev0-py3.9.egg/transformers/tokenization_utils_fast.py", line 250, in convert_tokens_to_ids