SorawitChok
/

SeaLLM3-7B-Chat-AWQ

@@ -46,78 +46,6 @@ SeaLLMs is tailored for handling a wide range of languages spoken in the SEA reg
 This page introduces the SeaLLMs-v3-7B-Chat model, specifically fine-tuned to follow human instructions effectively for task completion, making it directly applicable to your applications.
-### Get started with `Transformers`
-To quickly try the model, we show how to conduct inference with `transformers` below. Make sure you have installed the latest transformers version (>4.40).
-```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
-device = "cuda" # the device to load the model onto
-model = AutoModelForCausalLM.from_pretrained(
-  "SeaLLMs/SeaLLM3-7B-chat",
-  torch_dtype=torch.bfloat16,
-  device_map=device
-)
-tokenizer = AutoTokenizer.from_pretrained("SeaLLMs/SeaLLM3-7B-chat")
-# prepare messages to model
-prompt = "Hiii How are you?"
-messages = [
-    {"role": "system", "content": "You are a helpful assistant."},
-    {"role": "user", "content": prompt}
-]
-text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
-model_inputs = tokenizer([text], return_tensors="pt").to(device)
-print(f"Formatted text:\n {text}")
-print(f"Model input:\n {model_inputs}")
-generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512, do_sample=True, eos_token_id=tokenizer.eos_token_id)
-generated_ids = [
-    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
-]
-response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
-print(f"Response:\n {response[0]}")
-```
-You can also utilize the following code snippet, which uses the streamer `TextStreamer` to enable the model to continue conversing with you:
-```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
-from transformers import TextStreamer
-device = "cuda" # the device to load the model onto
-model = AutoModelForCausalLM.from_pretrained(
-  "SeaLLMs/SeaLLM3-7B-chat",
-  torch_dtype=torch.bfloat16,
-  device_map=device
-)
-tokenizer = AutoTokenizer.from_pretrained("SeaLLMs/SeaLLM3-7B-chat")
-# prepare messages to model
-messages = [
-    {"role": "system", "content": "You are a helpful assistant."},
-]
-while True:
-    prompt = input("User:")
-    messages.append({"role": "user", "content": prompt})
-    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
-    model_inputs = tokenizer([text], return_tensors="pt").to(device)
-    streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
-    generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512, streamer=streamer)
-    generated_ids = [
-        output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
-    ]
-    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
-    messages.append({"role": "assistant", "content": response})
-```
 ### Inference with `vllm`
 You can also conduct inference with [vllm](https://docs.vllm.ai/en/stable/index.html), which is a fast and easy-to-use library for LLM inference and serving. To use vllm, first install the latest version via `pip install vllm`.
@@ -130,7 +58,7 @@ prompts = [
     "Can you speak Indonesian?"
 ]
-llm = LLM(ckpt_path, dtype="bfloat16")
 sparams = SamplingParams(temperature=0.1, max_tokens=512)
 outputs = llm.generate(prompts, sparams)

 This page introduces the SeaLLMs-v3-7B-Chat model, specifically fine-tuned to follow human instructions effectively for task completion, making it directly applicable to your applications.
 ### Inference with `vllm`
 You can also conduct inference with [vllm](https://docs.vllm.ai/en/stable/index.html), which is a fast and easy-to-use library for LLM inference and serving. To use vllm, first install the latest version via `pip install vllm`.
     "Can you speak Indonesian?"
 ]
+llm = LLM("SorawitChok/SeaLLM3-7B-Chat-AWQ", quantization="AWQ")
 sparams = SamplingParams(temperature=0.1, max_tokens=512)
 outputs = llm.generate(prompts, sparams)