trollkotze commited on
Commit
6cd1d54
1 Parent(s): deb08af

Patches for llama.cpp and repeng to run this stuff.

Browse files
patches/README.md ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ Assumption is that you have llama.cpp and repeng directories under the same parent directory.
2
+ Copy files from patches/repeng into your repeng diretory.
3
+ Apply the diff in patches/llama.cpp inside your llama.cpp directory.
4
+ Place trainvector.sh in their common parent directory.
5
+ Run ./trainvector.sh {name-of-your-vector} to generate a vector based on the prompts in your edited repeng/emotion_prompts.py
6
+
patches/llama/README.md ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ werks on top of llama.cpp commit c47cf414efafb8f60596edc7edb5a2d68065e992
2
+
patches/llama/llama-repeng.diff ADDED
@@ -0,0 +1,855 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Parent: a56d09a4407f29c21e149b44fd5308f83aa1cb09
2
+ Author: Anon <anon>
3
+ Date: Tue Mar 19 02:46:47 2024 +0000
4
+
5
+ repeng: implement batching
6
+ diff --git a/Makefile b/Makefile
7
+ index c0f12503..d471c387 100644
8
+ --- a/Makefile
9
+ +++ b/Makefile
10
+ @@ -1,6 +1,6 @@
11
+ # Define the default target now so that it is always the first target
12
+ BUILD_TARGETS = \
13
+ - main quantize quantize-stats perplexity imatrix embedding vdot q8dot train-text-from-scratch convert-llama2c-to-ggml \
14
+ + main repeng quantize quantize-stats perplexity imatrix embedding vdot q8dot train-text-from-scratch convert-llama2c-to-ggml \
15
+ simple batched batched-bench save-load-state server gguf llama-bench libllava.a llava-cli baby-llama beam-search \
16
+ speculative infill tokenize benchmark-matmult parallel finetune export-lora lookahead lookup passkey gritlm tests/test-c.o
17
+
18
+ @@ -744,6 +744,13 @@ server: examples/server/server.cpp examples/server/utils.hpp examples/server/htt
19
+ $(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
20
+ $(CXX) $(CXXFLAGS) $(filter-out %.h %.hpp $<,$^) -Iexamples/server $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS) $(LWINSOCK2)
21
+
22
+ +repeng: examples/repeng/repeng.cpp ggml.o llama.o $(COMMON_DEPS) console.o grammar-parser.o $(OBJS)
23
+ + $(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
24
+ + $(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
25
+ + @echo
26
+ + @echo '==== Run ./repeng -h for help. ===='
27
+ + @echo
28
+ +
29
+ gguf: examples/gguf/gguf.cpp ggml.o $(OBJS)
30
+ $(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
31
+ $(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
32
+ diff --git a/examples/CMakeLists.txt b/examples/CMakeLists.txt
33
+ index e762cf8b..d46d9d17 100644
34
+ --- a/examples/CMakeLists.txt
35
+ +++ b/examples/CMakeLists.txt
36
+ @@ -46,4 +46,5 @@ else()
37
+ add_subdirectory(server)
38
+ endif()
39
+ add_subdirectory(export-lora)
40
+ + add_subdirectory(repeng)
41
+ endif()
42
+ diff --git a/examples/repeng/CMakeLists.txt b/examples/repeng/CMakeLists.txt
43
+ new file mode 100644
44
+ index 00000000..9e20f806
45
+ --- /dev/null
46
+ +++ b/examples/repeng/CMakeLists.txt
47
+ @@ -0,0 +1,5 @@
48
+ +set(TARGET repeng)
49
+ +add_executable(${TARGET} repeng.cpp)
50
+ +install(TARGETS ${TARGET} RUNTIME)
51
+ +target_link_libraries(${TARGET} PRIVATE common llama ${CMAKE_THREAD_LIBS_INIT})
52
+ +target_compile_features(${TARGET} PRIVATE cxx_std_11)
53
+ diff --git a/examples/repeng/repeng.cpp b/examples/repeng/repeng.cpp
54
+ new file mode 100644
55
+ index 00000000..5863c8be
56
+ --- /dev/null
57
+ +++ b/examples/repeng/repeng.cpp
58
+ @@ -0,0 +1,797 @@
59
+ +#include "common.h"
60
+ +
61
+ +#include "console.h"
62
+ +#include "llama.h"
63
+ +
64
+ +#include <cassert>
65
+ +#include <cinttypes>
66
+ +#include <cmath>
67
+ +#include <cstdio>
68
+ +#include <cstring>
69
+ +#include <ctime>
70
+ +#include <fstream>
71
+ +#include <iostream>
72
+ +#include <sstream>
73
+ +#include <string>
74
+ +#include <vector>
75
+ +
76
+ +#if defined (__unix__) || (defined (__APPLE__) && defined (__MACH__))
77
+ +#include <signal.h>
78
+ +#include <unistd.h>
79
+ +#elif defined (_WIN32)
80
+ +#define WIN32_LEAN_AND_MEAN
81
+ +#ifndef NOMINMAX
82
+ +#define NOMINMAX
83
+ +#endif
84
+ +#include <windows.h>
85
+ +#include <signal.h>
86
+ +#endif
87
+ +
88
+ +#if defined(_MSC_VER)
89
+ +#pragma warning(disable: 4244 4267) // possible loss of data
90
+ +#endif
91
+ +
92
+ +static llama_context ** g_ctx;
93
+ +static llama_model ** g_model;
94
+ +static gpt_params * g_params;
95
+ +static std::vector<llama_token> * g_input_tokens;
96
+ +static std::ostringstream * g_output_ss;
97
+ +static std::vector<llama_token> * g_output_tokens;
98
+ +static bool is_interacting = false;
99
+ +
100
+ +static bool file_exists(const std::string &path) {
101
+ + std::ifstream f(path.c_str());
102
+ + return f.good();
103
+ +}
104
+ +
105
+ +static bool file_is_empty(const std::string &path) {
106
+ + std::ifstream f;
107
+ + f.exceptions(std::ifstream::failbit | std::ifstream::badbit);
108
+ + f.open(path.c_str(), std::ios::in | std::ios::binary | std::ios::ate);
109
+ + return f.tellg() == 0;
110
+ +}
111
+ +
112
+ +#if defined (__unix__) || (defined (__APPLE__) && defined (__MACH__)) || defined (_WIN32)
113
+ +static void sigint_handler(int signo) {
114
+ + if (signo == SIGINT) {
115
+ + if (!is_interacting && g_params->interactive) {
116
+ + is_interacting = true;
117
+ + } else {
118
+ + console::cleanup();
119
+ + printf("\n");
120
+ + llama_print_timings(*g_ctx);
121
+ + //write_logfile(*g_ctx, *g_params, *g_model, *g_input_tokens, g_output_ss->str(), *g_output_tokens);
122
+ + _exit(130);
123
+ + }
124
+ + }
125
+ +}
126
+ +#endif
127
+ +
128
+ +static void llama_log_callback_logTee(ggml_log_level level, const char * text, void * user_data) {
129
+ + (void) level;
130
+ + (void) user_data;
131
+ + LOG_TEE("%s", text);
132
+ +}
133
+ +
134
+ +static std::tuple<struct llama_model *, struct llama_context *> llama_init_from_gpt_params_with_cb_eval(
135
+ + gpt_params & params,
136
+ + ggml_backend_sched_eval_callback cb_eval,
137
+ + void * cb_eval_user_data) {
138
+ + auto mparams = llama_model_params_from_gpt_params(params);
139
+ +
140
+ + llama_model * model = llama_load_model_from_file(params.model.c_str(), mparams);
141
+ + if (model == NULL) {
142
+ + fprintf(stderr, "%s: error: failed to load model '%s'\n", __func__, params.model.c_str());
143
+ + return std::make_tuple(nullptr, nullptr);
144
+ + }
145
+ +
146
+ + auto cparams = llama_context_params_from_gpt_params(params);
147
+ +
148
+ + cparams.cb_eval = cb_eval;
149
+ + cparams.cb_eval_user_data = cb_eval_user_data;
150
+ +
151
+ + llama_context * lctx = llama_new_context_with_model(model, cparams);
152
+ + if (lctx == NULL) {
153
+ + fprintf(stderr, "%s: error: failed to create context with model '%s'\n", __func__, params.model.c_str());
154
+ + llama_free_model(model);
155
+ + return std::make_tuple(nullptr, nullptr);
156
+ + }
157
+ +
158
+ + if (!params.control_vectors.empty()) {
159
+ + if (params.control_vector_layer_start <= 0) params.control_vector_layer_start = 1;
160
+ + if (params.control_vector_layer_end <= 0) params.control_vector_layer_end = llama_n_layer(model);
161
+ +
162
+ + const auto cvec = llama_control_vector_load(params.control_vectors);
163
+ + if (cvec.n_embd == -1) {
164
+ + llama_free(lctx);
165
+ + llama_free_model(model);
166
+ + return std::make_tuple(nullptr, nullptr);
167
+ + }
168
+ +
169
+ + int err = llama_control_vector_apply(lctx,
170
+ + cvec.data.data(),
171
+ + cvec.data.size(),
172
+ + cvec.n_embd,
173
+ + params.control_vector_layer_start,
174
+ + params.control_vector_layer_end);
175
+ + if (err) {
176
+ + llama_free(lctx);
177
+ + llama_free_model(model);
178
+ + return std::make_tuple(nullptr, nullptr);
179
+ + }
180
+ + }
181
+ +
182
+ + for (unsigned int i = 0; i < params.lora_adapter.size(); ++i) {
183
+ + const std::string& lora_adapter = std::get<0>(params.lora_adapter[i]);
184
+ + float lora_scale = std::get<1>(params.lora_adapter[i]);
185
+ + int err = llama_model_apply_lora_from_file(model,
186
+ + lora_adapter.c_str(),
187
+ + lora_scale,
188
+ + ((i > 0) || params.lora_base.empty())
189
+ + ? NULL
190
+ + : params.lora_base.c_str(),
191
+ + params.n_threads);
192
+ + if (err != 0) {
193
+ + fprintf(stderr, "%s: error: failed to apply lora adapter\n", __func__);
194
+ + llama_free(lctx);
195
+ + llama_free_model(model);
196
+ + return std::make_tuple(nullptr, nullptr);
197
+ + }
198
+ + }
199
+ +
200
+ + if (params.ignore_eos) {
201
+ + params.sparams.logit_bias[llama_token_eos(model)] = -INFINITY;
202
+ + }
203
+ +
204
+ + {
205
+ + LOG("warming up the model with an empty run\n");
206
+ +
207
+ + std::vector<llama_token> tmp = { llama_token_bos(model), llama_token_eos(model), };
208
+ + llama_decode(lctx, llama_batch_get_one(tmp.data(), std::min(tmp.size(), (size_t) params.n_batch), 0, 0));
209
+ + llama_kv_cache_clear(lctx);
210
+ + llama_synchronize(lctx);
211
+ + llama_reset_timings(lctx);
212
+ + }
213
+ +
214
+ + return std::make_tuple(model, lctx);
215
+ +}
216
+ +
217
+ +struct eval_callback_state {
218
+ + std::vector<ggml_tensor *> tensors;
219
+ + int first_prompt_idx;
220
+ + std::vector<int> extract_tokens;
221
+ +};
222
+ +
223
+ +static bool eval_callback(struct ggml_tensor * t, bool ask, void * user_data) {
224
+ + struct eval_callback_state * eval_state = (eval_callback_state *)user_data;
225
+ + if (ask) {
226
+ + // Report whether we want to observe this tensor.
227
+ + if (strncmp(t->name, "l_out-", 6) == 0) {
228
+ + return true;
229
+ + } else {
230
+ + return false;
231
+ + }
232
+ + } else {
233
+ + // Actually observe the tensor data.
234
+ +
235
+ + if (eval_state->first_prompt_idx >= 0) {
236
+ + // Find the tensor collecting hidden states for the current layer.
237
+ + ggml_tensor * output_tensor = nullptr;
238
+ + for (auto t2 : eval_state->tensors) {
239
+ + if (strcmp(t2->name, t->name) == 0) {
240
+ + output_tensor = t2;
241
+ + break;
242
+ + }
243
+ + }
244
+ +
245
+ + if (output_tensor != nullptr) {
246
+ + int output_idx = eval_state->first_prompt_idx;
247
+ + for (int input_idx : eval_state->extract_tokens) {
248
+ + // Copy the hidden states for the last token into
249
+ + size_t input_offset = t->nb[1] * input_idx;
250
+ + size_t output_offset = output_tensor->nb[1] * output_idx;
251
+ + assert(t->nb[0] == output_tensor->nb[0]);
252
+ + assert(t->ne[0] == output_tensor->ne[0]);
253
+ + ggml_backend_tensor_get(t,
254
+ + (char *)output_tensor->data + output_offset,
255
+ + input_offset,
256
+ + t->nb[0] * t->ne[0]);
257
+ + //memcpy((char *)output_tensor->data + output_offset,
258
+ + // (char *)t->data + input_offset,
259
+ + // t->nb[0] * t->ne[0]);
260
+ + //std::cerr << "saved " << (t->nb[0] * t->ne[0]) << " bytes of tensor data "
261
+ + // << " for " << t->name << " in slot " << output_idx << "\n";
262
+ +
263
+ + //float * buf = (float *)((char *)t->data + input_offset);
264
+ + //float * buf = (float *)((char *)output_tensor->data + output_offset);
265
+ + //std::cerr << "prompt " << output_idx
266
+ + // << " tensor contents for " << t->name << ": "
267
+ + // << buf[0] << ", "
268
+ + // << buf[1] << ", "
269
+ + // << buf[2] << " ... "
270
+ + // << buf[4093] << ", "
271
+ + // << buf[4094] << ", "
272
+ + // << buf[4095] << "\n";
273
+ +
274
+ + ++output_idx;
275
+ + }
276
+ + }
277
+ + }
278
+ +
279
+ + // Continue running
280
+ + return true;
281
+ + }
282
+ +}
283
+ +
284
+ +int main(int argc, char ** argv) {
285
+ + gpt_params params;
286
+ + g_params = &params;
287
+ +
288
+ + if (!gpt_params_parse(argc, argv, params)) {
289
+ + return 1;
290
+ + }
291
+ + llama_sampling_params & sparams = params.sparams;
292
+ +
293
+ +#ifndef LOG_DISABLE_LOGS
294
+ + log_set_target(log_filename_generator("main", "log"));
295
+ + LOG_TEE("Log start\n");
296
+ + log_dump_cmdline(argc, argv);
297
+ + llama_log_set(llama_log_callback_logTee, nullptr);
298
+ +#endif // LOG_DISABLE_LOGS
299
+ +
300
+ + // TODO: Dump params ?
301
+ + //LOG("Params perplexity: %s\n", LOG_TOSTR(params.perplexity));
302
+ +
303
+ + // save choice to use color for later
304
+ + // (note for later: this is a slightly awkward choice)
305
+ + console::init(params.simple_io, params.use_color);
306
+ + atexit([]() { console::cleanup(); });
307
+ +
308
+ + if (params.logits_all) {
309
+ + printf("\n************\n");
310
+ + printf("%s: please use the 'perplexity' tool for perplexity calculations\n", __func__);
311
+ + printf("************\n\n");
312
+ +
313
+ + return 0;
314
+ + }
315
+ +
316
+ + if (params.embedding) {
317
+ + printf("\n************\n");
318
+ + printf("%s: please use the 'embedding' tool for embedding calculations\n", __func__);
319
+ + printf("************\n\n");
320
+ +
321
+ + return 0;
322
+ + }
323
+ +
324
+ + if (params.n_ctx != 0 && params.n_ctx < 8) {
325
+ + LOG_TEE("%s: warning: minimum context size is 8, using minimum size.\n", __func__);
326
+ + params.n_ctx = 8;
327
+ + }
328
+ +
329
+ + if (params.rope_freq_base != 0.0) {
330
+ + LOG_TEE("%s: warning: changing RoPE frequency base to %g.\n", __func__, params.rope_freq_base);
331
+ + }
332
+ +
333
+ + if (params.rope_freq_scale != 0.0) {
334
+ + LOG_TEE("%s: warning: scaling RoPE frequency by %g.\n", __func__, params.rope_freq_scale);
335
+ + }
336
+ +
337
+ + LOG_TEE("%s: build = %d (%s)\n", __func__, LLAMA_BUILD_NUMBER, LLAMA_COMMIT);
338
+ + LOG_TEE("%s: built with %s for %s\n", __func__, LLAMA_COMPILER, LLAMA_BUILD_TARGET);
339
+ +
340
+ + if (params.seed == LLAMA_DEFAULT_SEED) {
341
+ + params.seed = time(NULL);
342
+ + }
343
+ +
344
+ + LOG_TEE("%s: seed = %u\n", __func__, params.seed);
345
+ +
346
+ + std::mt19937 rng(params.seed);
347
+ + if (params.random_prompt) {
348
+ + params.prompt = gpt_random_prompt(rng);
349
+ + }
350
+ +
351
+ + LOG("%s: llama backend init\n", __func__);
352
+ + llama_backend_init();
353
+ + llama_numa_init(params.numa);
354
+ +
355
+ + llama_model * model;
356
+ + llama_context * ctx;
357
+ + llama_context * ctx_guidance = NULL;
358
+ + g_model = &model;
359
+ + g_ctx = &ctx;
360
+ +
361
+ + ggml_context * eval_ctx = nullptr;
362
+ + struct eval_callback_state eval_state;
363
+ +
364
+ + // load the model and apply lora adapter, if any
365
+ + LOG("%s: load the model and apply lora adapter, if any\n", __func__);
366
+ + std::tie(model, ctx) = llama_init_from_gpt_params_with_cb_eval(
367
+ + params,
368
+ + eval_callback,
369
+ + (void *)&eval_state);
370
+ + /*
371
+ + if (sparams.cfg_scale > 1.f) {
372
+ + struct llama_context_params lparams = llama_context_params_from_gpt_params(params);
373
+ + ctx_guidance = llama_new_context_with_model(model, lparams);
374
+ + }
375
+ + */
376
+ +
377
+ + if (model == NULL) {
378
+ + LOG_TEE("%s: error: unable to load model\n", __func__);
379
+ + return 1;
380
+ + }
381
+ +
382
+ + const int n_ctx_train = llama_n_ctx_train(model);
383
+ + const int n_ctx = llama_n_ctx(ctx);
384
+ + LOG("n_ctx: %d\n", n_ctx);
385
+ +
386
+ + if (n_ctx > n_ctx_train) {
387
+ + LOG_TEE("%s: warning: model was trained on only %d context tokens (%d specified)\n",
388
+ + __func__, n_ctx_train, n_ctx);
389
+ + }
390
+ +
391
+ + // print system information
392
+ + {
393
+ + LOG_TEE("\n");
394
+ + LOG_TEE("%s\n", get_system_info(params).c_str());
395
+ + }
396
+ +
397
+ + std::string path_session = params.path_prompt_cache;
398
+ + std::vector<llama_token> session_tokens;
399
+ +
400
+ + if (!path_session.empty()) {
401
+ + LOG_TEE("%s: attempting to load saved session from '%s'\n", __func__, path_session.c_str());
402
+ + if (!file_exists(path_session)) {
403
+ + LOG_TEE("%s: session file does not exist, will create.\n", __func__);
404
+ + } else if (file_is_empty(path_session)) {
405
+ + LOG_TEE("%s: The session file is empty. A new session will be initialized.\n", __func__);
406
+ + } else {
407
+ + // The file exists and is not empty
408
+ + session_tokens.resize(n_ctx);
409
+ + size_t n_token_count_out = 0;
410
+ + if (!llama_load_session_file(ctx, path_session.c_str(), session_tokens.data(), session_tokens.capacity(), &n_token_count_out)) {
411
+ + LOG_TEE("%s: error: failed to load session file '%s'\n", __func__, path_session.c_str());
412
+ + return 1;
413
+ + }
414
+ + session_tokens.resize(n_token_count_out);
415
+ + llama_set_rng_seed(ctx, params.seed);
416
+ + LOG_TEE("%s: loaded a session with prompt size of %d tokens\n", __func__, (int)session_tokens.size());
417
+ + }
418
+ + }
419
+ +
420
+ + const bool add_bos = llama_should_add_bos_token(model);
421
+ + LOG("add_bos: %d\n", add_bos);
422
+ +
423
+ + std::vector<llama_token> embd_inp;
424
+ +
425
+ + if (params.interactive_first || params.instruct || params.chatml || !params.prompt.empty() || session_tokens.empty()) {
426
+ + LOG("tokenize the prompt\n");
427
+ + if (params.chatml) {
428
+ + params.prompt = "<|im_start|>system\n" + params.prompt + "<|im_end|>";
429
+ + }
430
+ + embd_inp = ::llama_tokenize(ctx, params.prompt, add_bos, true);
431
+ + } else {
432
+ + LOG("use session tokens\n");
433
+ + embd_inp = session_tokens;
434
+ + }
435
+ +
436
+ + LOG("prompt: \"%s\"\n", log_tostr(params.prompt));
437
+ + LOG("tokens: %s\n", LOG_TOKENS_TOSTR_PRETTY(ctx, embd_inp).c_str());
438
+ +
439
+ + // Should not run without any tokens
440
+ + if (embd_inp.empty()) {
441
+ + embd_inp.push_back(llama_token_bos(model));
442
+ + LOG("embd_inp was considered empty and bos was added: %s\n", LOG_TOKENS_TOSTR_PRETTY(ctx, embd_inp).c_str());
443
+ + }
444
+ +
445
+ + // Tokenize negative prompt
446
+ + std::vector<llama_token> guidance_inp;
447
+ + int guidance_offset = 0;
448
+ + int original_prompt_len = 0;
449
+ + /*
450
+ + if (ctx_guidance) {
451
+ + LOG("cfg_negative_prompt: \"%s\"\n", log_tostr(sparams.cfg_negative_prompt));
452
+ +
453
+ + guidance_inp = ::llama_tokenize(ctx_guidance, sparams.cfg_negative_prompt, add_bos, true);
454
+ + LOG("guidance_inp tokenized: %s\n", LOG_TOKENS_TOSTR_PRETTY(ctx_guidance, guidance_inp).c_str());
455
+ +
456
+ + std::vector<llama_token> original_inp = ::llama_tokenize(ctx, params.prompt, add_bos, true);
457
+ + LOG("original_inp tokenized: %s\n", LOG_TOKENS_TOSTR_PRETTY(ctx, original_inp).c_str());
458
+ +
459
+ + original_prompt_len = original_inp.size();
460
+ + guidance_offset = (int)guidance_inp.size() - original_prompt_len;
461
+ + LOG("original_prompt_len: %s", log_tostr(original_prompt_len));
462
+ + LOG("guidance_offset: %s", log_tostr(guidance_offset));
463
+ + }
464
+ + */
465
+ +
466
+ + /*
467
+ + if ((int) embd_inp.size() > n_ctx - 4) {
468
+ + LOG_TEE("%s: error: prompt is too long (%d tokens, max %d)\n", __func__, (int) embd_inp.size(), n_ctx - 4);
469
+ + return 1;
470
+ + }
471
+ + */
472
+ +
473
+ + // debug message about similarity of saved session, if applicable
474
+ + size_t n_matching_session_tokens = 0;
475
+ + if (!session_tokens.empty()) {
476
+ + for (llama_token id : session_tokens) {
477
+ + if (n_matching_session_tokens >= embd_inp.size() || id != embd_inp[n_matching_session_tokens]) {
478
+ + break;
479
+ + }
480
+ + n_matching_session_tokens++;
481
+ + }
482
+ + if (params.prompt.empty() && n_matching_session_tokens == embd_inp.size()) {
483
+ + LOG_TEE("%s: using full prompt from session file\n", __func__);
484
+ + } else if (n_matching_session_tokens >= embd_inp.size()) {
485
+ + LOG_TEE("%s: session file has exact match for prompt!\n", __func__);
486
+ + } else if (n_matching_session_tokens < (embd_inp.size() / 2)) {
487
+ + LOG_TEE("%s: warning: session file has low similarity to prompt (%zu / %zu tokens); will mostly be reevaluated\n",
488
+ + __func__, n_matching_session_tokens, embd_inp.size());
489
+ + } else {
490
+ + LOG_TEE("%s: session file matches %zu / %zu tokens of prompt\n",
491
+ + __func__, n_matching_session_tokens, embd_inp.size());
492
+ + }
493
+ +
494
+ + // remove any "future" tokens that we might have inherited from the previous session
495
+ + llama_kv_cache_seq_rm(ctx, -1, n_matching_session_tokens, -1);
496
+ + }
497
+ +
498
+ + LOGLN(
499
+ + "recalculate the cached logits (check): embd_inp.empty() %s, n_matching_session_tokens %zu, embd_inp.size() %zu, session_tokens.size() %zu, embd_inp.size() %zu",
500
+ + log_tostr(embd_inp.empty()), n_matching_session_tokens, embd_inp.size(), session_tokens.size(), embd_inp.size());
501
+ +
502
+ + // if we will use the cache for the full prompt without reaching the end of the cache, force
503
+ + // reevaluation of the last token token to recalculate the cached logits
504
+ + if (!embd_inp.empty() && n_matching_session_tokens == embd_inp.size() && session_tokens.size() > embd_inp.size()) {
505
+ + LOGLN("recalculate the cached logits (do): session_tokens.resize( %zu )", embd_inp.size() - 1);
506
+ +
507
+ + session_tokens.resize(embd_inp.size() - 1);
508
+ + }
509
+ +
510
+ + // number of tokens to keep when resetting context
511
+ + if (params.n_keep < 0 || params.n_keep > (int) embd_inp.size() || params.instruct || params.chatml) {
512
+ + params.n_keep = (int)embd_inp.size();
513
+ + } else {
514
+ + params.n_keep += add_bos; // always keep the BOS token
515
+ + }
516
+ +
517
+ + // prefix & suffix for instruct mode
518
+ + const auto inp_pfx = ::llama_tokenize(ctx, "\n\n### Instruction:\n\n", add_bos, true);
519
+ + const auto inp_sfx = ::llama_tokenize(ctx, "\n\n### Response:\n\n", false, true);
520
+ +
521
+ + LOG("inp_pfx: %s\n", LOG_TOKENS_TOSTR_PRETTY(ctx, inp_pfx).c_str());
522
+ + LOG("inp_sfx: %s\n", LOG_TOKENS_TOSTR_PRETTY(ctx, inp_sfx).c_str());
523
+ +
524
+ + // chatml prefix & suffix
525
+ + const auto cml_pfx = ::llama_tokenize(ctx, "\n<|im_start|>user\n", add_bos, true);
526
+ + const auto cml_sfx = ::llama_tokenize(ctx, "<|im_end|>\n<|im_start|>assistant\n", false, true);
527
+ +
528
+ + LOG("cml_pfx: %s\n", LOG_TOKENS_TOSTR_PRETTY(ctx, cml_pfx).c_str());
529
+ + LOG("cml_sfx: %s\n", LOG_TOKENS_TOSTR_PRETTY(ctx, cml_sfx).c_str());
530
+ +
531
+ + // in instruct mode, we inject a prefix and a suffix to each input by the user
532
+ + if (params.instruct) {
533
+ + params.interactive_first = true;
534
+ + params.antiprompt.emplace_back("### Instruction:\n\n");
535
+ + }
536
+ + // similar for chatml mode
537
+ + else if (params.chatml) {
538
+ + params.interactive_first = true;
539
+ + params.antiprompt.emplace_back("<|im_start|>user\n");
540
+ + }
541
+ +
542
+ + // enable interactive mode if interactive start is specified
543
+ + if (params.interactive_first) {
544
+ + params.interactive = true;
545
+ + }
546
+ +
547
+ + if (params.verbose_prompt) {
548
+ + LOG_TEE("\n");
549
+ + LOG_TEE("%s: prompt: '%s'\n", __func__, params.prompt.c_str());
550
+ + LOG_TEE("%s: number of tokens in prompt = %zu\n", __func__, embd_inp.size());
551
+ + for (int i = 0; i < (int) embd_inp.size(); i++) {
552
+ + LOG_TEE("%6d -> '%s'\n", embd_inp[i], llama_token_to_piece(ctx, embd_inp[i]).c_str());
553
+ + }
554
+ +
555
+ + /*
556
+ + if (ctx_guidance) {
557
+ + LOG_TEE("\n");
558
+ + LOG_TEE("%s: negative prompt: '%s'\n", __func__, sparams.cfg_negative_prompt.c_str());
559
+ + LOG_TEE("%s: number of tokens in negative prompt = %zu\n", __func__, guidance_inp.size());
560
+ + for (int i = 0; i < (int) guidance_inp.size(); i++) {
561
+ + LOG_TEE("%6d -> '%s'\n", guidance_inp[i], llama_token_to_piece(ctx, guidance_inp[i]).c_str());
562
+ + }
563
+ + }
564
+ + */
565
+ +
566
+ + if (params.n_keep > add_bos) {
567
+ + LOG_TEE("%s: static prompt based on n_keep: '", __func__);
568
+ + for (int i = 0; i < params.n_keep; i++) {
569
+ + LOG_TEE("%s", llama_token_to_piece(ctx, embd_inp[i]).c_str());
570
+ + }
571
+ + LOG_TEE("'\n");
572
+ + }
573
+ + LOG_TEE("\n");
574
+ + }
575
+ +
576
+ + // ctrl+C handling
577
+ + {
578
+ +#if defined (__unix__) || (defined (__APPLE__) && defined (__MACH__))
579
+ + struct sigaction sigint_action;
580
+ + sigint_action.sa_handler = sigint_handler;
581
+ + sigemptyset (&sigint_action.sa_mask);
582
+ + sigint_action.sa_flags = 0;
583
+ + sigaction(SIGINT, &sigint_action, NULL);
584
+ +#elif defined (_WIN32)
585
+ + auto console_ctrl_handler = +[](DWORD ctrl_type) -> BOOL {
586
+ + return (ctrl_type == CTRL_C_EVENT) ? (sigint_handler(SIGINT), true) : false;
587
+ + };
588
+ + SetConsoleCtrlHandler(reinterpret_cast<PHANDLER_ROUTINE>(console_ctrl_handler), true);
589
+ +#endif
590
+ + }
591
+ +
592
+ + if (params.interactive) {
593
+ + LOG_TEE("%s: interactive mode on.\n", __func__);
594
+ +
595
+ + if (!params.antiprompt.empty()) {
596
+ + for (const auto & antiprompt : params.antiprompt) {
597
+ + LOG_TEE("Reverse prompt: '%s'\n", antiprompt.c_str());
598
+ + if (params.verbose_prompt) {
599
+ + auto tmp = ::llama_tokenize(ctx, antiprompt, false, true);
600
+ + for (int i = 0; i < (int) tmp.size(); i++) {
601
+ + LOG_TEE("%6d -> '%s'\n", tmp[i], llama_token_to_piece(ctx, tmp[i]).c_str());
602
+ + }
603
+ + }
604
+ + }
605
+ + }
606
+ +
607
+ + if (params.input_prefix_bos) {
608
+ + LOG_TEE("Input prefix with BOS\n");
609
+ + }
610
+ +
611
+ + if (!params.input_prefix.empty()) {
612
+ + LOG_TEE("Input prefix: '%s'\n", params.input_prefix.c_str());
613
+ + if (params.verbose_prompt) {
614
+ + auto tmp = ::llama_tokenize(ctx, params.input_prefix, true, true);
615
+ + for (int i = 0; i < (int) tmp.size(); i++) {
616
+ + LOG_TEE("%6d -> '%s'\n", tmp[i], llama_token_to_piece(ctx, tmp[i]).c_str());
617
+ + }
618
+ + }
619
+ + }
620
+ +
621
+ + if (!params.input_suffix.empty()) {
622
+ + LOG_TEE("Input suffix: '%s'\n", params.input_suffix.c_str());
623
+ + if (params.verbose_prompt) {
624
+ + auto tmp = ::llama_tokenize(ctx, params.input_suffix, false, true);
625
+ + for (int i = 0; i < (int) tmp.size(); i++) {
626
+ + LOG_TEE("%6d -> '%s'\n", tmp[i], llama_token_to_piece(ctx, tmp[i]).c_str());
627
+ + }
628
+ + }
629
+ + }
630
+ + }
631
+ + LOG_TEE("sampling: \n%s\n", llama_sampling_print(sparams).c_str());
632
+ + LOG_TEE("sampling order: \n%s\n", llama_sampling_order_print(sparams).c_str());
633
+ + LOG_TEE("generate: n_ctx = %d, n_batch = %d, n_predict = %d, n_keep = %d\n", n_ctx, params.n_batch, params.n_predict, params.n_keep);
634
+ +
635
+ + // group-attention state
636
+ + // number of grouped KV tokens so far (used only if params.grp_attn_n > 1)
637
+ + int ga_i = 0;
638
+ +
639
+ + const int ga_n = params.grp_attn_n;
640
+ + const int ga_w = params.grp_attn_w;
641
+ +
642
+ + if (ga_n != 1) {
643
+ + GGML_ASSERT(ga_n > 0 && "grp_attn_n must be positive"); // NOLINT
644
+ + GGML_ASSERT(ga_w % ga_n == 0 && "grp_attn_w must be a multiple of grp_attn_n"); // NOLINT
645
+ + //GGML_ASSERT(n_ctx_train % ga_w == 0 && "n_ctx_train must be a multiple of grp_attn_w"); // NOLINT
646
+ + //GGML_ASSERT(n_ctx >= n_ctx_train * ga_n && "n_ctx must be at least n_ctx_train * grp_attn_n"); // NOLINT
647
+ + LOG_TEE("self-extend: n_ctx_train = %d, grp_attn_n = %d, grp_attn_w = %d\n", n_ctx_train, ga_n, ga_w);
648
+ + }
649
+ + LOG_TEE("\n\n");
650
+ +
651
+ + if (params.interactive) {
652
+ + const char *control_message;
653
+ + if (params.multiline_input) {
654
+ + control_message = " - To return control to LLaMa, end your input with '\\'.\n"
655
+ + " - To return control without starting a new line, end your input with '/'.\n";
656
+ + } else {
657
+ + control_message = " - Press Return to return control to LLaMa.\n"
658
+ + " - To return control without starting a new line, end your input with '/'.\n"
659
+ + " - If you want to submit another line, end your input with '\\'.\n";
660
+ + }
661
+ + LOG_TEE("== Running in interactive mode. ==\n");
662
+ +#if defined (__unix__) || (defined (__APPLE__) && defined (__MACH__)) || defined (_WIN32)
663
+ + LOG_TEE( " - Press Ctrl+C to interject at any time.\n");
664
+ +#endif
665
+ + LOG_TEE( "%s\n", control_message);
666
+ +
667
+ + is_interacting = params.interactive_first;
668
+ + }
669
+ +
670
+ + bool is_antiprompt = false;
671
+ + bool input_echo = true;
672
+ + bool display = true;
673
+ + bool need_to_save_session = !path_session.empty() && n_matching_session_tokens < embd_inp.size();
674
+ +
675
+ + int n_past = 0;
676
+ + int n_remain = params.n_predict;
677
+ + unsigned n_consumed = 0;
678
+ + int n_session_consumed = 0;
679
+ + int n_past_guidance = 0;
680
+ +
681
+ + std::vector<int> input_tokens; g_input_tokens = &input_tokens;
682
+ + std::vector<int> output_tokens; g_output_tokens = &output_tokens;
683
+ + std::ostringstream output_ss; g_output_ss = &output_ss;
684
+ +
685
+ + // the first thing we will do is to output the prompt, so set color accordingly
686
+ + console::set_display(console::prompt);
687
+ + display = params.display_prompt;
688
+ +
689
+ + std::vector<llama_token> embd;
690
+ + std::vector<llama_token> embd_guidance;
691
+ +
692
+ + // tokenized antiprompts
693
+ + std::vector<std::vector<llama_token>> antiprompt_ids;
694
+ +
695
+ + antiprompt_ids.reserve(params.antiprompt.size());
696
+ + for (const std::string & antiprompt : params.antiprompt) {
697
+ + antiprompt_ids.emplace_back(::llama_tokenize(ctx, antiprompt, false, true));
698
+ + }
699
+ +
700
+ + struct llama_sampling_context * ctx_sampling = llama_sampling_init(sparams);
701
+ +
702
+ +
703
+ +
704
+ + // Tokenized prompt is in embd_inp
705
+ +
706
+ +
707
+ + // Record prompt boundaries
708
+ + const int PROMPT_DELIMITER_TOKEN = 13;
709
+ +
710
+ + // Index of each delimiter token in `embd_inp`. These mark the end of each
711
+ + // prompt.
712
+ + std::vector<size_t> delim_idxs;
713
+ +
714
+ + for (size_t i = 0; i < embd_inp.size(); ++i) {
715
+ + if (embd_inp[i] == PROMPT_DELIMITER_TOKEN) {
716
+ + delim_idxs.push_back(i);
717
+ + }
718
+ + }
719
+ +
720
+ + // If the last prompt is missing an ending delimiter, add it.
721
+ + if (embd_inp.size() > 0 && embd_inp.back() != PROMPT_DELIMITER_TOKEN) {
722
+ + delim_idxs.push_back(embd_inp.size());
723
+ + embd_inp.push_back(PROMPT_DELIMITER_TOKEN);
724
+ + }
725
+ +
726
+ + size_t num_prompts = delim_idxs.size();
727
+ +
728
+ +
729
+ + // Set up eval_state
730
+ + gguf_context * eval_gguf = gguf_init_empty();
731
+ + {
732
+ + int n_embd = llama_n_embd(model);
733
+ + int n_layer = llama_n_layer(model);
734
+ + std::cerr << "build eval state: " << num_prompts << " prompts, "
735
+ + << n_embd << " embd, " << n_layer << " layers\n";
736
+ +
737
+ + struct ggml_init_params params = {};
738
+ + params.mem_size = ((size_t)n_embd * num_prompts * sizeof(float) + 1024) * n_layer;
739
+ + eval_ctx = ggml_init(params);
740
+ +
741
+ + for (int i = 0; i < n_layer; ++i) {
742
+ + ggml_tensor * t = ggml_new_tensor_2d(eval_ctx, GGML_TYPE_F32, n_embd, num_prompts);
743
+ + snprintf(t->name, sizeof(t->name), "l_out-%d", i);
744
+ + eval_state.tensors.push_back(t);
745
+ + gguf_add_tensor(eval_gguf, t);
746
+ + }
747
+ + eval_state.first_prompt_idx = -1;
748
+ + }
749
+ +
750
+ +
751
+ + size_t batch_size = 32;
752
+ +
753
+ + // Max tokens to include in a single batch.
754
+ + //
755
+ + // TODO: Not sure if this calculation for the limit makes sense, but it
756
+ + // seems like the thing crashes if batch_max_tokens exceeds any of these
757
+ + // three parameters.
758
+ + int batch_max_tokens = std::min(params.n_ctx, std::min(params.n_batch, params.n_ubatch));
759
+ +
760
+ + // FIXME: Something is not quite right with the batching setup. The
761
+ + // embedding / hidden state values vary slightly depending on how the batch
762
+ + // size is set. Uncomment the "tensor contents of xxx" debug print in
763
+ + // `eval_callback` to see the actual numbers. A batch size of 1 produces
764
+ + // the same results as the original, unbatched verion of this code, but
765
+ + // higher batch sizes produce different values.
766
+ +
767
+ + for (size_t batch_start = 0; batch_start < num_prompts; batch_start += batch_size) {
768
+ + std::cerr << "start batch " << batch_start << "\n";
769
+ + eval_state.first_prompt_idx = batch_start;
770
+ + eval_state.extract_tokens.clear();
771
+ +
772
+ + size_t max_i = batch_start + std::min(batch_size, num_prompts - batch_start);
773
+ +
774
+ + struct llama_batch batch = llama_batch_init(batch_max_tokens, 0, max_i - batch_start);
775
+ + llama_sampling_reset(ctx_sampling);
776
+ +
777
+ + // Clear the KV cache of previous prompts
778
+ + llama_kv_cache_seq_rm(ctx, -1, -1, -1);
779
+ +
780
+ + for (size_t i = batch_start; i < max_i; ++i) {
781
+ + //if (i % 100 == 0) {
782
+ + // std::cerr << "start prompt " << i << " / " << num_prompts << "\n";
783
+ + //}
784
+ + size_t start = i == 0 ? 0 : delim_idxs[i - 1] + 1;
785
+ + size_t end = delim_idxs[i];
786
+ +
787
+ + for (size_t j = start; j < end; ++j) {
788
+ + int id = embd_inp[j];
789
+ +
790
+ + // push the prompt in the sampling context in order to apply
791
+ + // repetition penalties later for the prompt, we don't apply
792
+ + // grammar rules
793
+ + //llama_sampling_accept(ctx_sampling, ctx, id, false);
794
+ +
795
+ + if (batch.n_tokens >= batch_max_tokens) {
796
+ + LOG_TEE("error: too many tokens in prompt batch; the max is %d\n",
797
+ + batch_max_tokens);
798
+ + LOG_TEE("turn up -c, -b, and -ub sizes, or reduce `batch_size`\n");
799
+ + exit(1);
800
+ + }
801
+ +
802
+ + // Add the token to the current batch. Its position within the
803
+ + // context is relative to the start of the current prompt.
804
+ + //
805
+ + llama_batch_add(batch, id, j - start, {(int)(i - batch_start)}, false);
806
+ +
807
+ + //const std::string token_str = llama_token_to_piece(ctx, id);
808
+ + //std::cerr << "pos " << (j - start) << ": token "
809
+ + // << id << " \"" << token_str << "\"\n";
810
+ + }
811
+ +
812
+ + eval_state.extract_tokens.push_back(batch.n_tokens - 1);
813
+ + }
814
+ +
815
+ + if (batch.n_tokens == 0) {
816
+ + continue;
817
+ + }
818
+ +
819
+ + //std::cerr << "prompt " << eval_state.prompt_idx << ": " << batch.n_tokens << " tokens\n";
820
+ +
821
+ + //batch.logits[batch.n_tokens - 1] = true;
822
+ +
823
+ + if (llama_decode(ctx, batch)) {
824
+ + LOG_TEE("%s : failed to eval\n", __func__);
825
+ + return 1;
826
+ + }
827
+ +
828
+ + //const llama_token id = llama_sampling_sample(ctx_sampling, ctx, nullptr, batch.n_tokens - 1);
829
+ + //const std::string token_str = llama_token_to_piece(ctx, id);
830
+ + //LOG_TEE("sample token %d: \"%s\"\n", id, token_str.c_str());
831
+ + }
832
+ +
833
+ + gguf_write_to_file(eval_gguf, "control_vector_data.gguf", false);
834
+ +
835
+ + if (!path_session.empty() && params.prompt_cache_all && !params.prompt_cache_ro) {
836
+ + LOG_TEE("\n%s: saving final output to session file '%s'\n", __func__, path_session.c_str());
837
+ + llama_save_session_file(ctx, path_session.c_str(), session_tokens.data(), session_tokens.size());
838
+ + }
839
+ +
840
+ + llama_print_timings(ctx);
841
+ + //write_logfile(ctx, params, model, input_tokens, output_ss.str(), output_tokens);
842
+ +
843
+ + //if (ctx_guidance) { llama_free(ctx_guidance); }
844
+ + llama_free(ctx);
845
+ + llama_free_model(model);
846
+ +
847
+ + llama_sampling_free(ctx_sampling);
848
+ + llama_backend_free();
849
+ +
850
+ +#ifndef LOG_DISABLE_LOGS
851
+ + LOG_TEE("Log end\n");
852
+ +#endif // LOG_DISABLE_LOGS
853
+ +
854
+ + return 0;
855
+ +}
patches/repeng/README.md ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ Copy these files into repeng repo base dir.
2
+ Adjust emotion_prompts.py to generate different prompts for your vectors.
3
+ It can make sense to use different scales (e.g. -2.0 and +1.5) for the positive / negative vectors, as was done in the original emotion example from vgel in repeng. But I set it to +1.0 and -1.0 just for standardization.
4
+
patches/repeng/emotion_prompts.py ADDED
@@ -0,0 +1,59 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import json
2
+ from transformers import AutoTokenizer
3
+
4
+ MODEL_NAME = "mistralai/Mistral-7B-Instruct-v0.1"
5
+
6
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
7
+ tokenizer.pad_token_id = 0
8
+
9
+ user_tag, asst_tag = "[INST]", "[/INST]"
10
+
11
+
12
+ with open('notebooks/data/all_truncated_outputs.json') as f:
13
+ suffixes = json.load(f)
14
+
15
+ truncated_suffixes = []
16
+ truncated_suffixes_dedup = set()
17
+ for suffix in suffixes:
18
+ tokens = tokenizer.tokenize(suffix)
19
+ for i in range(1, len(tokens)):
20
+ truncated = tokenizer.convert_tokens_to_string(tokens[:i])
21
+ if truncated in truncated_suffixes_dedup:
22
+ continue
23
+ truncated_suffixes.append(truncated)
24
+ truncated_suffixes_dedup.add(truncated)
25
+
26
+
27
+ persona_pairs = [
28
+ ('incredibly charismatic, captivating everyone with your presence and words', 'unassuming, rarely drawing attention or swaying others'),
29
+ ('persuasive, easily influencing others with your charm and eloquence.', 'reticent, struggling to engage or influence those around you'),
30
+ ]
31
+ def template(persona: str, suffix: str) -> str:
32
+ return f"{user_tag} Act as if you are {persona}. {asst_tag} {suffix}"
33
+
34
+
35
+ OUT_FILE = 'control_vector_prompts.txt'
36
+
37
+ f = open(OUT_FILE, 'w')
38
+
39
+ # Use '\n' as delimiter between prompts. If you want to use a different
40
+ # delimiter, change this string and also change PROMPT_DELIMITER_TOKEN in
41
+ # llama.cpp/examples/repeng/repeng.cpp.
42
+ PROMPT_DELIMITER = '\n'
43
+ print('prompt delimiter string: %r' % PROMPT_DELIMITER)
44
+ print('prompt delimiter token id: %s' % (
45
+ tokenizer.encode(PROMPT_DELIMITER, add_special_tokens=False),))
46
+
47
+ count = 0
48
+
49
+ for suffix in truncated_suffixes:
50
+ for positive_persona, negative_persona in persona_pairs:
51
+ positive = template(positive_persona, suffix)
52
+ negative = template(negative_persona, suffix)
53
+ f.write(positive)
54
+ f.write(PROMPT_DELIMITER)
55
+ f.write(negative)
56
+ f.write(PROMPT_DELIMITER)
57
+ count += 2
58
+
59
+ print('wrote %d prompts to %s' % (count, OUT_FILE))
patches/repeng/extract_vector.py ADDED
@@ -0,0 +1,164 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gguf
2
+ import numpy as np
3
+ from sklearn.decomposition import PCA
4
+ import tqdm
5
+
6
+
7
+ def load_hidden_states(path):
8
+
9
+ print("\nyour mom\n")
10
+ '''Load hidden states produced by the llama.cpp ./repeng tool.'''
11
+ gguf_file = gguf.GGUFReader(path)
12
+ print("\nyour dad\n")
13
+
14
+ hidden_states = {}
15
+ for t in gguf_file.tensors:
16
+ if not t.name.startswith('l_out-'):
17
+ continue
18
+ layer = int(t.name[len('l_out-'):])
19
+ assert layer not in hidden_states, 'duplicate hidden states for layer %d' % layer
20
+ data = t.data.reshape((t.shape[1], t.shape[0]))
21
+ hidden_states[layer] = data
22
+
23
+ return hidden_states
24
+
25
+ def project_onto_direction(H, direction):
26
+ """Project matrix H (n, d_1) onto direction vector (d_2,)"""
27
+ mag = np.linalg.norm(direction)
28
+ assert not np.isinf(mag)
29
+ return (H @ direction) / mag
30
+
31
+ def read_representations(
32
+ layer_hiddens: dict[int, np.ndarray],
33
+ ) -> dict[int, np.ndarray]:
34
+ """
35
+ Extract the representations based on the contrast dataset.
36
+ """
37
+
38
+ hidden_layers = sorted(layer_hiddens.keys())
39
+ num_inputs = next(iter(layer_hiddens.values())).shape[0] // 2
40
+ print('%d inputs' % num_inputs)
41
+
42
+ # get differences between (positive, negative) pairs
43
+ relative_layer_hiddens = {}
44
+ for layer in hidden_layers:
45
+ relative_layer_hiddens[layer] = (
46
+ layer_hiddens[layer][::2] - layer_hiddens[layer][1::2]
47
+ )
48
+
49
+ # get directions for each layer using PCA
50
+ directions: dict[int, np.ndarray] = {}
51
+ for layer in tqdm.tqdm(hidden_layers):
52
+ assert layer_hiddens[layer].shape[0] == num_inputs * 2
53
+
54
+ # fit layer directions
55
+ train = np.vstack(
56
+ relative_layer_hiddens[layer]
57
+ - relative_layer_hiddens[layer].mean(axis=0, keepdims=True)
58
+ )
59
+ pca_model = PCA(n_components=1, whiten=False).fit(train)
60
+ # shape (n_features,)
61
+ directions[layer] = pca_model.components_.astype(np.float32).squeeze(axis=0)
62
+
63
+ # calculate sign
64
+ projected_hiddens = project_onto_direction(
65
+ layer_hiddens[layer], directions[layer]
66
+ )
67
+
68
+ # order is [positive, negative, positive, negative, ...]
69
+ positive_smaller_mean = np.mean(
70
+ [
71
+ projected_hiddens[i] < projected_hiddens[i + 1]
72
+ for i in range(0, num_inputs * 2, 2)
73
+ ]
74
+ )
75
+ positive_larger_mean = np.mean(
76
+ [
77
+ projected_hiddens[i] > projected_hiddens[i + 1]
78
+ for i in range(0, num_inputs * 2, 2)
79
+ ]
80
+ )
81
+
82
+ if positive_smaller_mean > positive_larger_mean: # type: ignore
83
+ directions[layer] *= -1
84
+
85
+ return directions
86
+
87
+ def export_gguf(directions, path: str):
88
+ """
89
+ Export a trained ControlVector to a llama.cpp .gguf file.
90
+ """
91
+
92
+ arch = "controlvector"
93
+ writer = gguf.GGUFWriter(path, arch)
94
+ #writer.add_string(f"{arch}.model_hint", model_type)
95
+ #writer.add_uint32(f"{arch}.layer_count", len(directions))
96
+ for layer in directions.keys():
97
+ if layer == 0:
98
+ # For some reason, llama.cpp bails out if it sees a direction.0
99
+ # tensor.
100
+ continue
101
+ writer.add_tensor(f"direction.{layer}", directions[layer])
102
+ writer.write_header_to_file()
103
+ writer.write_kv_data_to_file()
104
+ writer.write_tensors_to_file()
105
+ writer.close()
106
+
107
+ def test_model(model_name, directions):
108
+ import torch
109
+ from transformers import AutoModelForCausalLM, AutoTokenizer
110
+ from repeng import ControlVector, ControlModel
111
+
112
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
113
+ tokenizer.pad_token_id = 0
114
+
115
+ model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
116
+ model = model.to("cuda:0" if torch.cuda.is_available()
117
+ else "mps:0" if torch.backends.mps.is_available()
118
+ else "cpu")
119
+ model = ControlModel(model, list(range(-5, -18, -1)))
120
+
121
+ control_vector = ControlVector(model.config.model_type, directions)
122
+
123
+ user_tag, asst_tag = "[INST]", "[/INST]"
124
+
125
+ # the question to ask the modified model
126
+ # don't forget the space after {user_tag} and before {asst_tag}!
127
+ input = f"{user_tag} What are human beings like? {asst_tag}"
128
+
129
+ # tokenizer and generation settings
130
+ input_ids = tokenizer(input, return_tensors="pt").to(model.device)
131
+ settings = {
132
+ "pad_token_id": tokenizer.eos_token_id, # silence warning
133
+ "do_sample": False, # temperature=0
134
+ "max_new_tokens": 128,
135
+ "repetition_penalty": 1.1, # reduce control jank
136
+ }
137
+
138
+ print("==baseline")
139
+ model.reset()
140
+ print(tokenizer.decode(model.generate(**input_ids, **settings).squeeze()))
141
+
142
+ print("\n++control")
143
+ # add the control vector with a certain strength (try increasing or decreasing this!)
144
+ model.set_control(control_vector, 1.0)
145
+ print(tokenizer.decode(model.generate(**input_ids, **settings).squeeze()))
146
+
147
+ print("\n--control")
148
+ # subtract the control vector, giving the opposite result (e.g. sad instead of happy)
149
+ # depending on your vector, you may need more or less negative strength to
150
+ # match the positive effect
151
+ model.set_control(control_vector, -1.0)
152
+ print(tokenizer.decode(model.generate(**input_ids, **settings).squeeze()))
153
+ model.reset()
154
+
155
+
156
+ print("\nLoad hidden shit\n")
157
+ hidden_states = load_hidden_states('control_vector_data.gguf')
158
+ print("\nHidden shit loaded\n")
159
+ directions = read_representations(hidden_states)
160
+ print("\nExport this motherfucker\n")
161
+ export_gguf(directions, 'control_vector.gguf')
162
+
163
+ TEST_MODEL_NAME = 'mistralai/Mistral-7B-Instruct-v0.1'
164
+ #test_model(TEST_MODEL_NAME, directions)
patches/trainvector.sh ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+ set -e
3
+ vname="$1";
4
+ if test -z "$vname" ; then
5
+ vname="control-vector";
6
+ else
7
+ vname="$1";
8
+ fi
9
+ echo "The name is '$vname'";
10
+ echo "Files will be stored in:";
11
+ echo ."./llama.cpp/vectors/"$vname".gguf"
12
+ echo "./llama.cpp/vectors/"$vname"_data.gguf"
13
+ echo "./llama.cpp/vectors/"$vname"_prompts.gguf"
14
+
15
+ cd repeng;
16
+ echo "Generating prompts..."
17
+ poetry run python emotion_prompts.py;
18
+ cd ../llama.cpp;
19
+ echo "Generating gguf data..."
20
+ ./repeng -m models/miqu-1-70b.q5_K_M.gguf -f ../repeng/control_vector_prompts.txt --ctx_size 1024 -b 1024 -ub 1024 --threads 4 -ngl 40;
21
+ echo "Moving to repeng..."
22
+ mv control_vector_data.gguf ../repeng;
23
+ cd ../repeng;
24
+ poetry run python extract_vector.py;
25
+ mv control_vector.gguf "../llama.cpp/vectors/"$vname".gguf"
26
+ mv control_vector_data.gguf "../llama.cpp/vectors/"$vname"_data.gguf"
27
+ mv control_vector_prompts.txt "../llama.cpp/vectors/"$vname"_prompts.txt"
28
+ cd ..
29
+
30
+