Could we run a XOR converted model using docker + huggingface/text-generation-inference?

#5
by pevogam - opened

I wanted to use a docker command inspired from here like

docker run --gpus "device=0" -p 8080:80 -v $PWD/data:/data ghcr.io/huggingface/text-generation-inference:sha-7a1ba58  --model-id OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5```
but there is no `OpenAssistant/oasst-sft-7-llama-30b` and we have to use the cached model instead. Is it possible to simply symlink the model into the data folder and use a tag that would map to the folder name?



@olivierdehaene

	 Have you tried anything like this?

I'm literally one step behind you at this very moment, was just reading the details of huggingface/text-generation-inference and thinking about what I needed to do to run it on the MPS device rather than CUDA.

I guess you saw the end of this page:
https://huggingface.co./spaces/huggingchat/chat-ui/blob/main/README.md where it talks about running local inference

I think it is possible to do it in a simpler way without adapters and additional dependencies and I have managed to do so with existing pythia models just fine. Upon further inspection of the situation here, I try docker run --gpus "device=0" -p 8080:80 -v $PWD/data:/data ghcr.io/huggingface/text-generation-inference:sha-7a1ba58 --model-id OpenAssistant/oasst-sft-7-llama-30b and get

Repository Not Found for url: https://huggingface.co./OpenAssistant/oasst-sft-7-llama-30b/resolve/main/config.json.
Please make sure you specified the correct `repo_id` and `repo_type`.
If you are trying to access a private or gated repo, make sure you are authenticated.

in particular I chose the name of the folder from another part of the same error message

OSError: OpenAssistant/oasst-sft-7-llama-30b is not a local folder and is not a valid model identifier listed on 'https://huggingface.co./models'

which told me that the text inference server would expect OpenAssistant/oasst-sft-7-llama-30b folder or symlink in its data directory, the latter extracted from

volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

in https://github.com/huggingface/text-generation-inference#docker.

Alas, this is still not enough and perhaps we need a different reference hash or more tweaks.

If anyone has an idea, I would love to hear it. But one issue I noticed is that oasst-sft-7-llama-30b and text-generation-inference require different versions of the transformers package. Most notably, the text-generation-inference requires the transformers library to have a section for 'bloom'. When I ran the inference with the version required by oasst the error was:

ModuleNotFoundError: No module named 'transformers.models.bloom.parallel_layers'

And here's the error log from where I installed text-generation-inference locally, with a virtual environment rather than docker. I built, installed it with 'make install', and ran the following using :
text-generation-launcher --model-id ~/Workspace/oasst-sft-7-llama-30b-xor/oasst-sft-7-llama-30b

It was looking good, it found the model and converted it to safetensors, but then...

2023-05-04T04:44:17.786682Z  INFO text_generation_launcher: Args { model_id: "/Users/kronosprime/Workspace/oasst-sft-7-llama-30b-xor/oasst-sft-7-llama-30b", revision: None, sharded: None, num_shard: None, quantize: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1000, max_total_tokens: 1512, max_batch_size: None, waiting_served_ratio: 1.2, max_batch_total_tokens: 32000, max_waiting_tokens: 20, port: 3000, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: None, weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, env: false }
2023-05-04T04:44:17.786799Z  INFO text_generation_launcher: Starting download process.
2023-05-04T04:44:19.031428Z  WARN download: text_generation_launcher: No safetensors weights found for model /Users/kronosprime/Workspace/oasst-sft-7-llama-30b-xor/oasst-sft-7-llama-30b at revision None. Converting PyTorch weights to safetensors.

2023-05-04T04:44:19.031609Z  INFO download: text_generation_launcher: Convert /Users/kronosprime/Workspace/oasst-sft-7-llama-30b-xor/oasst-sft-7-llama-30b/pytorch_model-00007-of-00007.bin to /Users/kronosprime/Workspace/oasst-sft-7-llama-30b-xor/oasst-sft-7-llama-30b/model-00007-of-00007.safetensors.

2023-05-04T04:44:19.031778Z  INFO download: text_generation_launcher: Convert /Users/kronosprime/Workspace/oasst-sft-7-llama-30b-xor/oasst-sft-7-llama-30b/pytorch_model-00006-of-00007.bin to /Users/kronosprime/Workspace/oasst-sft-7-llama-30b-xor/oasst-sft-7-llama-30b/model-00006-of-00007.safetensors.

2023-05-04T04:44:19.031871Z  INFO download: text_generation_launcher: Convert /Users/kronosprime/Workspace/oasst-sft-7-llama-30b-xor/oasst-sft-7-llama-30b/pytorch_model-00001-of-00007.bin to /Users/kronosprime/Workspace/oasst-sft-7-llama-30b-xor/oasst-sft-7-llama-30b/model-00001-of-00007.safetensors.

2023-05-04T04:44:19.031959Z  INFO download: text_generation_launcher: Convert /Users/kronosprime/Workspace/oasst-sft-7-llama-30b-xor/oasst-sft-7-llama-30b/pytorch_model-00003-of-00007.bin to /Users/kronosprime/Workspace/oasst-sft-7-llama-30b-xor/oasst-sft-7-llama-30b/model-00003-of-00007.safetensors.

2023-05-04T04:44:19.032117Z  INFO download: text_generation_launcher: Convert /Users/kronosprime/Workspace/oasst-sft-7-llama-30b-xor/oasst-sft-7-llama-30b/pytorch_model-00004-of-00007.bin to /Users/kronosprime/Workspace/oasst-sft-7-llama-30b-xor/oasst-sft-7-llama-30b/model-00004-of-00007.safetensors.

2023-05-04T04:45:14.801473Z  INFO download: text_generation_launcher: Convert /Users/kronosprime/Workspace/oasst-sft-7-llama-30b-xor/oasst-sft-7-llama-30b/pytorch_model-00005-of-00007.bin to /Users/kronosprime/Workspace/oasst-sft-7-llama-30b-xor/oasst-sft-7-llama-30b/model-00005-of-00007.safetensors.

2023-05-04T04:45:14.801596Z  INFO download: text_generation_launcher: Convert: [1/7] -- ETA: 0:05:30

2023-05-04T04:46:09.529107Z  INFO download: text_generation_launcher: Convert /Users/kronosprime/Workspace/oasst-sft-7-llama-30b-xor/oasst-sft-7-llama-30b/pytorch_model-00002-of-00007.bin to /Users/kronosprime/Workspace/oasst-sft-7-llama-30b-xor/oasst-sft-7-llama-30b/model-00002-of-00007.safetensors.

2023-05-04T04:46:09.529708Z  INFO download: text_generation_launcher: Convert: [2/7] -- ETA: 0:04:35

2023-05-04T04:46:09.706104Z  INFO download: text_generation_launcher: Convert: [3/7] -- ETA: 0:02:26.666668

2023-05-04T04:46:11.292773Z  INFO download: text_generation_launcher: Convert: [4/7] -- ETA: 0:01:24

2023-05-04T04:46:11.419401Z  INFO download: text_generation_launcher: Convert: [5/7] -- ETA: 0:00:44.800000

2023-05-04T04:46:11.762799Z  INFO download: text_generation_launcher: Convert: [6/7] -- ETA: 0:00:18.666667

2023-05-04T04:46:19.904766Z  INFO download: text_generation_launcher: Convert: [7/7] -- ETA: 0

2023-05-04T04:46:20.296744Z  INFO text_generation_launcher: Successfully downloaded weights.
2023-05-04T04:46:20.297360Z  INFO text_generation_launcher: Starting shard 0
2023-05-04T04:46:30.316717Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-05-04T04:46:40.352174Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-05-04T04:46:50.377670Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-05-04T04:47:00.402327Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-05-04T04:47:10.440521Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-05-04T04:47:20.505151Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-05-04T04:47:30.597273Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-05-04T04:47:40.630386Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-05-04T04:47:50.674882Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-05-04T04:48:00.677545Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-05-04T04:48:10.684952Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-05-04T04:48:20.794592Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-05-04T04:48:30.798020Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-05-04T04:48:40.848794Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-05-04T04:48:50.913512Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-05-04T04:49:00.971170Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-05-04T04:49:11.051961Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-05-04T04:49:21.139972Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-05-04T04:49:31.189893Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-05-04T04:49:41.243413Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-05-04T04:49:51.329588Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-05-04T04:50:01.396801Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-05-04T04:50:04.841977Z ERROR shard-manager: text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/Users/kronosprime/Workspace/oasst-sft-7-llama-30b-xor/xor_venv/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/Users/kronosprime/Workspace/oasst-sft-7-llama-30b-xor/xor_venv/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/Users/kronosprime/Workspace/oasst-sft-7-llama-30b-xor/xor_venv/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/Users/kronosprime/Workspace/oasst-sft-7-llama-30b-xor/xor_venv/lib/python3.9/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/Users/kronosprime/Workspace/oasst-sft-7-llama-30b-xor/xor_venv/lib/python3.9/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/Users/kronosprime/Workspace/oasst-sft-7-llama-30b-xor/xor_venv/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/kronosprime/Workspace/oasst-sft-7-llama-30b-xor/xor_venv/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/kronosprime/Workspace/oasst-sft-7-llama-30b-xor/xor_venv/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/Users/kronosprime/Workspace/oasst-sft-7-llama-30b-xor/xor_venv/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/Users/kronosprime/Workspace/LLM/text-generation-inference/server/text_generation_server/cli.py", line 58, in serve
    server.serve(model_id, revision, sharded, quantize, uds_path)
  File "/Users/kronosprime/Workspace/LLM/text-generation-inference/server/text_generation_server/server.py", line 155, in serve
    asyncio.run(serve_inner(model_id, revision, sharded, quantize))
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/asyncio/base_events.py", line 629, in run_until_complete
    self.run_forever()
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/asyncio/base_events.py", line 596, in run_forever
    self._run_once()
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/asyncio/base_events.py", line 1890, in _run_once
    handle._run()
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/Users/kronosprime/Workspace/LLM/text-generation-inference/server/text_generation_server/server.py", line 124, in serve_inner
    model = get_model(model_id, revision, sharded, quantize)
  File "/Users/kronosprime/Workspace/LLM/text-generation-inference/server/text_generation_server/models/__init__.py", line 137, in get_model
    return llama_cls(model_id, revision, quantize=quantize)
  File "/Users/kronosprime/Workspace/LLM/text-generation-inference/server/text_generation_server/models/causal_lm.py", line 479, in __init__
    super(CausalLM, self).__init__(
  File "/Users/kronosprime/Workspace/LLM/text-generation-inference/server/text_generation_server/models/model.py", line 26, in __init__
    self.all_special_ids = set(tokenizer.all_special_ids)
  File "/Users/kronosprime/Workspace/oasst-sft-7-llama-30b-xor/xor_venv/lib/python3.9/site-packages/transformers-4.29.0.dev0-py3.9.egg/transformers/tokenization_utils_base.py", line 1299, in all_special_ids
    all_ids = self.convert_tokens_to_ids(all_toks)
  File "/Users/kronosprime/Workspace/oasst-sft-7-llama-30b-xor/xor_venv/lib/python3.9/site-packages/transformers-4.29.0.dev0-py3.9.egg/transformers/tokenization_utils_fast.py", line 254, in convert_tokens_to_ids
    ids.append(self._convert_token_to_id_with_added_voc(token))
  File "/Users/kronosprime/Workspace/oasst-sft-7-llama-30b-xor/xor_venv/lib/python3.9/site-packages/transformers-4.29.0.dev0-py3.9.egg/transformers/tokenization_utils_fast.py", line 260, in _convert_token_to_id_with_added_voc
    return self.unk_token_id
  File "/Users/kronosprime/Workspace/oasst-sft-7-llama-30b-xor/xor_venv/lib/python3.9/site-packages/transformers-4.29.0.dev0-py3.9.egg/transformers/tokenization_utils_base.py", line 1142, in unk_token_id
    return self.convert_tokens_to_ids(self.unk_token)
  File "/Users/kronosprime/Workspace/oasst-sft-7-llama-30b-xor/xor_venv/lib/python3.9/site-packages/transformers-4.29.0.dev0-py3.9.egg/transformers/tokenization_utils_fast.py", line 250, in convert_tokens_to_ids

Sign up or log in to comment