Fix new line issue & Match vocab type to base model

Files changed (6) hide show

README.md CHANGED Viewed

@@ -14,7 +14,6 @@ base_model:
 VocADT is a solution for vocabulary adaptation using adapter modules that are trained to learn the optimal linear combination of existing embeddings while keeping the model’s weights fixed.
 VocADT offers a flexible and scalable solution without requiring external resources or language constraints.
 ## New Vocabulary Adapted Models
 Only the input/output embeddings are replaced, while all other original weights of base model remain fixed.
 These are the merged version: after training the adapters, we merge the original embeddings with the adapter to generate the new embeddings.
@@ -29,10 +28,10 @@ These are the merged version: after training the adapters, we merge the original
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
-# model_name = "mistralai/Mistral-7B-v0.1 # Base Model
 model_name = "h-j-han/Mistral-7B-VocADT-50k-Latin" # Vocabulary Adapted Model
 tokenizer = AutoTokenizer.from_pretrained(model_name)
-model = AutoModelForCausalLM.from_pretrained(model_name)
 prefix = "\nEnglish: Hello!\nSwahili: Habari!\nEnglish: What's your name?\nSwahili: Jina lako ni nani?\nEnglish: "
 line = "My name is Amani."
@@ -40,6 +39,8 @@ suffix = f"\nSwahili:"
 prompt = prefix + line + suffix
 inputs = tokenizer(prompt, return_tensors="pt")
 outputs = model.generate(**inputs, max_new_tokens=5)
 print(tokenizer.decode(outputs[0], skip_special_tokens=True))

 VocADT is a solution for vocabulary adaptation using adapter modules that are trained to learn the optimal linear combination of existing embeddings while keeping the model’s weights fixed.
 VocADT offers a flexible and scalable solution without requiring external resources or language constraints.
 ## New Vocabulary Adapted Models
 Only the input/output embeddings are replaced, while all other original weights of base model remain fixed.
 These are the merged version: after training the adapters, we merge the original embeddings with the adapter to generate the new embeddings.
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
+# model_name = "mistralai/Mistral-7B-v0.1" # Base Model
 model_name = "h-j-han/Mistral-7B-VocADT-50k-Latin" # Vocabulary Adapted Model
 tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
 prefix = "\nEnglish: Hello!\nSwahili: Habari!\nEnglish: What's your name?\nSwahili: Jina lako ni nani?\nEnglish: "
 line = "My name is Amani."
 prompt = prefix + line + suffix
 inputs = tokenizer(prompt, return_tensors="pt")
+for item in inputs:
+    inputs[item] = inputs[item].cuda()
 outputs = model.generate(**inputs, max_new_tokens=5)
 print(tokenizer.decode(outputs[0], skip_special_tokens=True))

config.json CHANGED Viewed

@@ -21,5 +21,5 @@
     "torch_dtype": "bfloat16",
     "transformers_version": "4.43.0.dev0",
     "use_cache": true,
-    "vocab_size": 50302
 }

     "torch_dtype": "bfloat16",
     "transformers_version": "4.43.0.dev0",
     "use_cache": true,
+    "vocab_size": 50000
 }

model-00001-of-00003.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:56cd7b2917c67dbe374fffd75e4fdc234d4f0aacf3a7901bb34f06443ab09bd3
-size 4975651696

 version https://git-lfs.github.com/spec/v1
+oid sha256:98489382fe32a3163ae7d60e2b6d6705ed9854a563b78ed9f97289923b1b0f6b
+size 4973177712

model-00003-of-00003.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:e09d26e0e83a6c87af03adb7faffd91cd4ae25337f9cac34ef3ea40838ae46d4
-size 4891790120

 version https://git-lfs.github.com/spec/v1
+oid sha256:94a489b7f407e9aabeb6cfddce9b002fea96ee02d5263d26237322b33d210997
+size 4889316136

model.safetensors.index.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "metadata": {
-    "total_size": 14783324160
   },
   "weight_map": {
     "lm_head.weight": "model-00003-of-00003.safetensors",

 {
   "metadata": {
+    "total_size": 14778376192
   },
   "weight_map": {
     "lm_head.weight": "model-00003-of-00003.safetensors",

tokenizer.json CHANGED Viewed

The diff for this file is too large to render. See raw diff