GGUF
English
sound language model
Inference Endpoints
conversational
aashish1904 commited on
Commit
c2b956e
1 Parent(s): e5aa852

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +218 -0
README.md ADDED
@@ -0,0 +1,218 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ---
3
+
4
+ datasets:
5
+ - homebrewltd/instruction-speech-whispervq-v2
6
+ language:
7
+ - en
8
+ license: apache-2.0
9
+ tags:
10
+ - sound language model
11
+
12
+ ---
13
+
14
+ ![](https://lh7-rt.googleusercontent.com/docsz/AD_4nXeiuCm7c8lEwEJuRey9kiVZsRn2W-b4pWlu3-X534V3YmVuVc2ZL-NXg2RkzSOOS2JXGHutDuyyNAUtdJI65jGTo8jT9Y99tMi4H4MqL44Uc5QKG77B0d6-JfIkZHFaUA71-RtjyYZWVIhqsNZcx8-OMaA?key=xt3VSDoCbmTY7o-cwwOFwQ)
15
+
16
+ # QuantFactory/llama3.1-s-instruct-v0.2-GGUF
17
+ This is quantized version of [homebrewltd/llama3.1-s-instruct-v0.2](https://huggingface.co/homebrewltd/llama3.1-s-instruct-v0.2) created using llama.cpp
18
+
19
+ # Original Model Card
20
+
21
+
22
+ ## Model Details
23
+
24
+ We have developed and released the family [llama3s](https://huggingface.co/collections/homebrew-research/llama3-s-669df2139f0576abc6eb7405). This family is natively understanding audio and text input.
25
+
26
+ We expand the Semantic tokens experiment with WhisperVQ as a tokenizer for audio files from [homebrewltd/llama3.1-s-base-v0.2](https://huggingface.co/homebrewltd/llama3.1-s-base-v0.2) with nearly 1B tokens from [Instruction Speech WhisperVQ v2](https://huggingface.co/datasets/homebrewltd/instruction-speech-whispervq-v2) dataset.
27
+
28
+ **Model developers** Homebrew Research.
29
+
30
+ **Input** Text and sound.
31
+
32
+ **Output** Text.
33
+
34
+ **Model Architecture** Llama-3.
35
+
36
+ **Language(s):** English.
37
+
38
+ ## Intended Use
39
+
40
+ **Intended Use Cases** This family is primarily intended for research applications. This version aims to further improve the LLM on sound understanding capabilities.
41
+
42
+ **Out-of-scope** The use of llama3-s in any manner that violates applicable laws or regulations is strictly prohibited.
43
+
44
+ ## How to Get Started with the Model
45
+
46
+ Try this model using [Google Colab Notebook](https://colab.research.google.com/drive/18IiwN0AzBZaox5o0iidXqWD1xKq11XbZ?usp=sharing).
47
+
48
+ First, we need to convert the audio file to sound tokens
49
+
50
+ ```python
51
+ device = "cuda" if torch.cuda.is_available() else "cpu"
52
+ if not os.path.exists("whisper-vq-stoks-medium-en+pl-fixed.model"):
53
+ hf_hub_download(
54
+ repo_id="jan-hq/WhisperVQ",
55
+ filename="whisper-vq-stoks-medium-en+pl-fixed.model",
56
+ local_dir=".",
57
+ )
58
+ vq_model = RQBottleneckTransformer.load_model(
59
+ "whisper-vq-stoks-medium-en+pl-fixed.model"
60
+ ).to(device)
61
+ def audio_to_sound_tokens(audio_path, target_bandwidth=1.5, device=device):
62
+ vq_model.ensure_whisper(device)
63
+
64
+ wav, sr = torchaudio.load(audio_path)
65
+ if sr != 16000:
66
+ wav = torchaudio.functional.resample(wav, sr, 16000)
67
+ with torch.no_grad():
68
+ codes = vq_model.encode_audio(wav.to(device))
69
+ codes = codes[0].cpu().tolist()
70
+
71
+ result = ''.join(f'<|sound_{num:04d}|>' for num in codes)
72
+ return f'<|sound_start|>{result}<|sound_end|>'
73
+
74
+ def audio_to_sound_tokens_transcript(audio_path, target_bandwidth=1.5, device=device):
75
+ vq_model.ensure_whisper(device)
76
+
77
+ wav, sr = torchaudio.load(audio_path)
78
+ if sr != 16000:
79
+ wav = torchaudio.functional.resample(wav, sr, 16000)
80
+ with torch.no_grad():
81
+ codes = vq_model.encode_audio(wav.to(device))
82
+ codes = codes[0].cpu().tolist()
83
+
84
+ result = ''.join(f'<|sound_{num:04d}|>' for num in codes)
85
+ return f'<|reserved_special_token_69|><|sound_start|>{result}<|sound_end|>'
86
+ ```
87
+
88
+ Then, we can inference the model the same as any other LLM.
89
+
90
+ ```python
91
+ def setup_pipeline(model_path, use_4bit=False, use_8bit=False):
92
+ tokenizer = AutoTokenizer.from_pretrained(model_path)
93
+
94
+ model_kwargs = {"device_map": "auto"}
95
+
96
+ if use_4bit:
97
+ model_kwargs["quantization_config"] = BitsAndBytesConfig(
98
+ load_in_4bit=True,
99
+ bnb_4bit_compute_dtype=torch.bfloat16,
100
+ bnb_4bit_use_double_quant=True,
101
+ bnb_4bit_quant_type="nf4",
102
+ )
103
+ elif use_8bit:
104
+ model_kwargs["quantization_config"] = BitsAndBytesConfig(
105
+ load_in_8bit=True,
106
+ bnb_8bit_compute_dtype=torch.bfloat16,
107
+ bnb_8bit_use_double_quant=True,
108
+ )
109
+ else:
110
+ model_kwargs["torch_dtype"] = torch.bfloat16
111
+
112
+ model = AutoModelForCausalLM.from_pretrained(model_path, **model_kwargs)
113
+
114
+ return pipeline("text-generation", model=model, tokenizer=tokenizer)
115
+
116
+ def generate_text(pipe, messages, max_new_tokens=64, temperature=0.0, do_sample=False):
117
+ generation_args = {
118
+ "max_new_tokens": max_new_tokens,
119
+ "return_full_text": False,
120
+ "temperature": temperature,
121
+ "do_sample": do_sample,
122
+ }
123
+
124
+ output = pipe(messages, **generation_args)
125
+ return output[0]['generated_text']
126
+
127
+ # Usage
128
+ llm_path = "homebrewltd/llama3.1-s-instruct-v0.2"
129
+ pipe = setup_pipeline(llm_path, use_8bit=True)
130
+ ```
131
+
132
+ ## Training process
133
+ **Training Metrics Image**: Below is a snapshot of the training loss curve visualized.
134
+
135
+ ![training_](https://cdn-uploads.huggingface.co/production/uploads/65713d70f56f9538679e5a56/pQ8y9GoSvtv42MgkKRDt0.png)
136
+
137
+ ### Hardware
138
+
139
+ **GPU Configuration**: Cluster of 8x NVIDIA H100-SXM-80GB.
140
+ **GPU Usage**:
141
+ - **Continual Training**: 6 hours.
142
+
143
+ ### Training Arguments
144
+
145
+ We utilize [torchtune](https://github.com/pytorch/torchtune) library for the latest FSDP2 training code implementation.
146
+
147
+ | Parameter | Continual Training |
148
+ |----------------------------|-------------------------|
149
+ | **Epoch** | 1 |
150
+ | **Global batch size** | 128 |
151
+ | **Learning Rate** | 0.5e-4 |
152
+ | **Learning Scheduler** | Cosine with warmup |
153
+ | **Optimizer** | Adam torch fused |
154
+ | **Warmup Ratio** | 0.01 |
155
+ | **Weight Decay** | 0.005 |
156
+ | **Max Sequence Length** | 512 |
157
+
158
+
159
+ ## Examples
160
+
161
+ 1. Good example:
162
+
163
+ <details>
164
+ <summary>Click to toggle Example 1</summary>
165
+
166
+ ```
167
+
168
+ ```
169
+ </details>
170
+
171
+ <details>
172
+ <summary>Click to toggle Example 2</summary>
173
+
174
+ ```
175
+
176
+ ```
177
+ </details>
178
+
179
+
180
+ 2. Misunderstanding example:
181
+
182
+ <details>
183
+ <summary>Click to toggle Example 3</summary>
184
+
185
+ ```
186
+
187
+ ```
188
+ </details>
189
+
190
+ 3. Off-tracked example:
191
+
192
+ <details>
193
+ <summary>Click to toggle Example 4</summary>
194
+
195
+ ```
196
+
197
+ ```
198
+ </details>
199
+
200
+
201
+ ## Citation Information
202
+
203
+ **BibTeX:**
204
+
205
+ ```
206
+ @article{Llama3-S: Sound Instruction Language Model 2024,
207
+ title={Llama3-S},
208
+ author={Homebrew Research},
209
+ year=2024,
210
+ month=August},
211
+ url={https://huggingface.co/homebrewltd/llama3.1-s-2024-08-20}
212
+ ```
213
+
214
+ ## Acknowledgement
215
+
216
+ - **[WhisperSpeech](https://github.com/collabora/WhisperSpeech)**
217
+
218
+ - **[Meta-Llama-3.1-8B-Instruct ](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct)**