Issues with Quantization and Functionality of the model

#2
by MatrixC7 - opened

Greetings!

First thank you for developing and supporting Chinese language capabilities with the Goliath model but we've encountered several issues about quantization and functionality that we hope can be addressed.

  1. Issues with Git LFS files during repository cloning:

We attempted to clone the repository using the git clone command but faced issues with the Git LFS files. Each file was only 1KB in size and contained a description rather than the expected data.

PS E:\> git clone https://huggingface.co./hongyin/chat-goliath-120b-80k
Cloning into 'chat-goliath-120b-80k'...
remote: Enumerating objects: 65, done.
remote: Counting objects: 100% (62/62), done.
remote: Compressing objects: 100% (61/61), done.
remote: Total 65 (delta 14), reused 0 (delta 0), pack-reused 3
Unpacking objects: 100% (65/65), 662.65 KiB | 215.00 KiB/s, done.
PS E:\> cd .\chat-goliath-120b-80k\
PS E:\chat-goliath-120b-80k> cat .\pytorch_model-00001-of-00024.bin
version https://git-lfs.github.com/spec/v1
oid sha256:34aaef9a13c63f0fd5c14747ecab157bae1a9c948aaa60a543e27d083492f2c1
size 9666282843

We then directly downloaded the files, ensuring their integrity through SHA256 checksums.

  1. Strange Metadata and Incoherent Responses in Kobold.cpp:

We use llama.cpp (commit b3a7c20) to quantize the model to Q2_K.

(llamacpp) PS F:\llama.cpp\build\bin\Release> .\quantize.exe F:\chat-goliath-120b-80k-f16.gguf F:\chat-goliath-120b-80k-f16-q2_k.gguf Q2_K
main: build = 1768 (b3a7c20)
main: built with MSVC 19.38.33133.0 for x64
main: quantizing 'F:\chat-goliath-120b-80k-f16.gguf' to 'F:\chat-goliath-120b-80k-f16-q2_k.gguf' as Q2_K
llama_model_loader: loaded meta data with 21 key-value pairs and 1236 tensors from F:\chat-goliath-120b-80k-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = F:\
llama_model_loader: - kv   2:                       llama.context_length u32              = 81920
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 8192
llama_model_loader: - kv   4:                          llama.block_count u32              = 137
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 28672
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 64
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 1
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,49300]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,49300]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,49300]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                      tokenizer.ggml.merges arr[str,71829]   = ["摘 要", "▁ 《", "中 国", "1 ...
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 2
llama_model_loader: - type  f32:  275 tensors
llama_model_loader: - type  f16:  961 tensors
llama_model_quantize_internal: meta size = 2310176 bytes
...
llama_model_quantize_internal: model size  = 225133.22 MB
llama_model_quantize_internal: quant size  = 47484.73 MB

Upon loading the files with Kobold.cpp, we observed oddities in the metadata. The Q2_K parameter exhibited an irregular 3.37 BPW value, whose value should usually be around 2.625 BPW.

llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 49300
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 81920
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 137
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 81920
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = unknown, may not work
llm_load_print_meta: model params     = 118.03 B
llm_load_print_meta: model size       = 46.37 GiB (3.37 BPW)
llm_load_print_meta: general.name     = F:\
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 2 '</s>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size       =    0.47 MiB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: system memory used  =  126.85 MiB
llm_load_tensors: VRAM used           = 47358.35 MiB
llm_load_tensors: offloading 137 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 138/138 layers to GPU

The model's responses were also inconsistent and interspersed with a mix of Chinese and English, lacking coherence and relevance.
kobldcpp gguf response 1.png
kobldcpp gguf response 2.png
kobldcpp gguf response 3.png

  1. Quantization Attempt in exl2:

Moving forward, we attempted to quantize the model in exl2. We first converted the model into .safetensors, with our SHA256 for the .safetensors matching the expected values.

PS F:\chat-goliath-120b-80k> Get-ChildItem -File | Get-FileHash -Algorithm SHA256

Algorithm       Hash                                                                   Path
---------       ----                                                                   ----
SHA256          D606246B713E31B1B43DFE913554DD47819089B9E44B9EBAA294EDA2589A6B4C       F:\chat-goliath-120b-80k\added_tokens.json
SHA256          EF58A0874EB14E647EF9DBF1B5A104701375FEE34F2D01BFD9795C984A07FDC0       F:\chat-goliath-120b-80k\config.json
SHA256          E9ADFCFDC532213AE59497E875478D95451D063640FB51755BC63C1070B9701D       F:\chat-goliath-120b-80k\generation_config.json
SHA256          3C9006082484A4A04AD7932BCAF0FAE22948CA063B4E1EF16A74C15D6E7F72CC       F:\chat-goliath-120b-80k\model-00001-of-00024.safetensors
SHA256          60B3F61167C18CFE8968A3FC377D3DF7D14051595CD3044E48549580387B1A42       F:\chat-goliath-120b-80k\model-00002-of-00024.safetensors
SHA256          358A58059687D8DEF130B2E4102DD596825255DA51F18DEAE277921EE6DA28B2       F:\chat-goliath-120b-80k\model-00003-of-00024.safetensors
SHA256          7D7E35B80E4AC0B59E2BAB33F0F8E6FE5F8D2DD36983EF36BCA1E2628C20264E       F:\chat-goliath-120b-80k\model-00004-of-00024.safetensors
SHA256          9CCB7212C979E15229DDBEFF6992E4A30C3D3AD1F1FFE571D14F8EB4721398EE       F:\chat-goliath-120b-80k\model-00005-of-00024.safetensors
SHA256          6753B8281079AC0927C0FB4E4AD9877F57E1BE40E4B5840ABB36F95BE8DB35D7       F:\chat-goliath-120b-80k\model-00006-of-00024.safetensors
SHA256          32BCF2B2BBF3E3BD052239158D14D0693142B6A3D5B014EA0C626C686C42AFA7       F:\chat-goliath-120b-80k\model-00007-of-00024.safetensors
SHA256          051B77A0CD28F54B4AE6D2087257CF4D2C7198C245D9031312582B2AEEEF143C       F:\chat-goliath-120b-80k\model-00008-of-00024.safetensors
SHA256          EFC8528F9FE25EAB6D72F7E84D9CADCD13CA998F179A8AFD6D7521A7F8BCAE1B       F:\chat-goliath-120b-80k\model-00009-of-00024.safetensors
SHA256          109643CE4559EC438CF0D1C8326CE5B32E9DFF4A35AC61EAF149CA1C0B27731C       F:\chat-goliath-120b-80k\model-00010-of-00024.safetensors
SHA256          EE9EB1BEEBC5FBA71DCF3124F82A41EB66F9B29583C4D9426A21D4367D1EF7C2       F:\chat-goliath-120b-80k\model-00011-of-00024.safetensors
SHA256          EEE031C8611D3E7E4E65C22CAA18F4E27319E359A5EB215C932828CB647BB81D       F:\chat-goliath-120b-80k\model-00012-of-00024.safetensors
SHA256          24B9486120B14560703E08D298A281FDD9035DE8DF3CBB528573C6C76E49B2EE       F:\chat-goliath-120b-80k\model-00013-of-00024.safetensors
SHA256          A54D145D4BB95E10F0594F8FE7FE4EBC66DCC5CA3A238EDE7A4D8E33B4E16082       F:\chat-goliath-120b-80k\model-00014-of-00024.safetensors
SHA256          0CAB89F83642A1553BEEB7965111C511BC05F59027C7D19DC9C853A6B422F85A       F:\chat-goliath-120b-80k\model-00015-of-00024.safetensors
SHA256          CCBBDAD6FA6E95720D4B70801E38A2A04154E8787BB5D4A6024E59824EC2191D       F:\chat-goliath-120b-80k\model-00016-of-00024.safetensors
SHA256          08137CCDA1B37C1EFF769EE894BB929FFD43EB7EE735386E626E88B1F187827C       F:\chat-goliath-120b-80k\model-00017-of-00024.safetensors
SHA256          6ED11E748D85699980A3990658D0D3B07BD89E0B55DE755C884555BEE2C4A620       F:\chat-goliath-120b-80k\model-00018-of-00024.safetensors
SHA256          39D879163BF2BF5B1A14906B707159A04C7E8381759A4E2837E517366F4B5D04       F:\chat-goliath-120b-80k\model-00019-of-00024.safetensors
SHA256          4DF0A12B117B57D6D882B0C18894973474E0D7DCD46A1841E9C87E61FC94F4F4       F:\chat-goliath-120b-80k\model-00020-of-00024.safetensors
SHA256          E5CAAA8E602F230CB752F0091E41B2FFCD2550B8C425325D3A3ABA2F6DC27209       F:\chat-goliath-120b-80k\model-00021-of-00024.safetensors
SHA256          D41A1AF612EFC80D051A6487812327EC6D312B5BEA97310986B9AD97B70A3B2A       F:\chat-goliath-120b-80k\model-00022-of-00024.safetensors
SHA256          5F9B8F2040BA5E448110BD9B4924CFAE3970309EAE32811FFB3F3641FE133A7F       F:\chat-goliath-120b-80k\model-00023-of-00024.safetensors
SHA256          6BDDB8D1C02E4F406F5157BC9709D1B23AC094448C712F6EB3797DEF7ED42B9B       F:\chat-goliath-120b-80k\model-00024-of-00024.safetensors
SHA256          E92E69756A90A7F7D90774BFA0A4B09D893E669BDFCFB810B5F3BC598DC52E7C       F:\chat-goliath-120b-80k\model.safetensors.index.json
SHA256          8E16855E765EFB8FF6A8D05DCC1FC4E1C62C633E9503550EA2C462B336A96FC7       F:\chat-goliath-120b-80k\README.md
SHA256          9921EDFD5322DD354EC44F825329E1C6C6B89B357BF3CE49377BE325B23EAA34       F:\chat-goliath-120b-80k\special_tokens_map.json
SHA256          79F2F33C690C49458F122090C1DFF223481CCF28162289407E8FA788074A330D       F:\chat-goliath-120b-80k\tokenizer_config.json
SHA256          4E28ED6CBB25942DCCBDBC000F414BC6075F580CC91E74C1AEC04D811CA60EAA       F:\chat-goliath-120b-80k\tokenizer.json
SHA256          D16FB023FB71C56B63E09C51631A3BC56297CAF04E60CEB1812614569DE3DBDD       F:\chat-goliath-120b-80k\tokenizer.model

However, the quantization process also failed to yield consistent replies, like the issues encountered in the previous steps.

ooba webui exl2 response.png

We believe that we need your assistance. Thank you for your time and consideration!

Kind regards,
Fangru Shao

Owner

Greetings!

First thank you for developing and supporting Chinese language capabilities with the Goliath model but we've encountered several issues about quantization and functionality that we hope can be addressed.

  1. Issues with Git LFS files during repository cloning:

We attempted to clone the repository using the git clone command but faced issues with the Git LFS files. Each file was only 1KB in size and contained a description rather than the expected data.

PS E:\> git clone https://huggingface.co./hongyin/chat-goliath-120b-80k
Cloning into 'chat-goliath-120b-80k'...
remote: Enumerating objects: 65, done.
remote: Counting objects: 100% (62/62), done.
remote: Compressing objects: 100% (61/61), done.
remote: Total 65 (delta 14), reused 0 (delta 0), pack-reused 3
Unpacking objects: 100% (65/65), 662.65 KiB | 215.00 KiB/s, done.
PS E:\> cd .\chat-goliath-120b-80k\
PS E:\chat-goliath-120b-80k> cat .\pytorch_model-00001-of-00024.bin
version https://git-lfs.github.com/spec/v1
oid sha256:34aaef9a13c63f0fd5c14747ecab157bae1a9c948aaa60a543e27d083492f2c1
size 9666282843

We then directly downloaded the files, ensuring their integrity through SHA256 checksums.

  1. Strange Metadata and Incoherent Responses in Kobold.cpp:

We use llama.cpp (commit b3a7c20) to quantize the model to Q2_K.

(llamacpp) PS F:\llama.cpp\build\bin\Release> .\quantize.exe F:\chat-goliath-120b-80k-f16.gguf F:\chat-goliath-120b-80k-f16-q2_k.gguf Q2_K
main: build = 1768 (b3a7c20)
main: built with MSVC 19.38.33133.0 for x64
main: quantizing 'F:\chat-goliath-120b-80k-f16.gguf' to 'F:\chat-goliath-120b-80k-f16-q2_k.gguf' as Q2_K
llama_model_loader: loaded meta data with 21 key-value pairs and 1236 tensors from F:\chat-goliath-120b-80k-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = F:\
llama_model_loader: - kv   2:                       llama.context_length u32              = 81920
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 8192
llama_model_loader: - kv   4:                          llama.block_count u32              = 137
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 28672
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 64
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 1
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,49300]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,49300]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,49300]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                      tokenizer.ggml.merges arr[str,71829]   = ["摘 要", "▁ 《", "中 国", "1 ...
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 2
llama_model_loader: - type  f32:  275 tensors
llama_model_loader: - type  f16:  961 tensors
llama_model_quantize_internal: meta size = 2310176 bytes
...
llama_model_quantize_internal: model size  = 225133.22 MB
llama_model_quantize_internal: quant size  = 47484.73 MB

Upon loading the files with Kobold.cpp, we observed oddities in the metadata. The Q2_K parameter exhibited an irregular 3.37 BPW value, whose value should usually be around 2.625 BPW.

llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 49300
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 81920
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 137
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 81920
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = unknown, may not work
llm_load_print_meta: model params     = 118.03 B
llm_load_print_meta: model size       = 46.37 GiB (3.37 BPW)
llm_load_print_meta: general.name     = F:\
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 2 '</s>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size       =    0.47 MiB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: system memory used  =  126.85 MiB
llm_load_tensors: VRAM used           = 47358.35 MiB
llm_load_tensors: offloading 137 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 138/138 layers to GPU

The model's responses were also inconsistent and interspersed with a mix of Chinese and English, lacking coherence and relevance.
kobldcpp gguf response 1.png
kobldcpp gguf response 2.png
kobldcpp gguf response 3.png

  1. Quantization Attempt in exl2:

Moving forward, we attempted to quantize the model in exl2. We first converted the model into .safetensors, with our SHA256 for the .safetensors matching the expected values.

PS F:\chat-goliath-120b-80k> Get-ChildItem -File | Get-FileHash -Algorithm SHA256

Algorithm       Hash                                                                   Path
---------       ----                                                                   ----
SHA256          D606246B713E31B1B43DFE913554DD47819089B9E44B9EBAA294EDA2589A6B4C       F:\chat-goliath-120b-80k\added_tokens.json
SHA256          EF58A0874EB14E647EF9DBF1B5A104701375FEE34F2D01BFD9795C984A07FDC0       F:\chat-goliath-120b-80k\config.json
SHA256          E9ADFCFDC532213AE59497E875478D95451D063640FB51755BC63C1070B9701D       F:\chat-goliath-120b-80k\generation_config.json
SHA256          3C9006082484A4A04AD7932BCAF0FAE22948CA063B4E1EF16A74C15D6E7F72CC       F:\chat-goliath-120b-80k\model-00001-of-00024.safetensors
SHA256          60B3F61167C18CFE8968A3FC377D3DF7D14051595CD3044E48549580387B1A42       F:\chat-goliath-120b-80k\model-00002-of-00024.safetensors
SHA256          358A58059687D8DEF130B2E4102DD596825255DA51F18DEAE277921EE6DA28B2       F:\chat-goliath-120b-80k\model-00003-of-00024.safetensors
SHA256          7D7E35B80E4AC0B59E2BAB33F0F8E6FE5F8D2DD36983EF36BCA1E2628C20264E       F:\chat-goliath-120b-80k\model-00004-of-00024.safetensors
SHA256          9CCB7212C979E15229DDBEFF6992E4A30C3D3AD1F1FFE571D14F8EB4721398EE       F:\chat-goliath-120b-80k\model-00005-of-00024.safetensors
SHA256          6753B8281079AC0927C0FB4E4AD9877F57E1BE40E4B5840ABB36F95BE8DB35D7       F:\chat-goliath-120b-80k\model-00006-of-00024.safetensors
SHA256          32BCF2B2BBF3E3BD052239158D14D0693142B6A3D5B014EA0C626C686C42AFA7       F:\chat-goliath-120b-80k\model-00007-of-00024.safetensors
SHA256          051B77A0CD28F54B4AE6D2087257CF4D2C7198C245D9031312582B2AEEEF143C       F:\chat-goliath-120b-80k\model-00008-of-00024.safetensors
SHA256          EFC8528F9FE25EAB6D72F7E84D9CADCD13CA998F179A8AFD6D7521A7F8BCAE1B       F:\chat-goliath-120b-80k\model-00009-of-00024.safetensors
SHA256          109643CE4559EC438CF0D1C8326CE5B32E9DFF4A35AC61EAF149CA1C0B27731C       F:\chat-goliath-120b-80k\model-00010-of-00024.safetensors
SHA256          EE9EB1BEEBC5FBA71DCF3124F82A41EB66F9B29583C4D9426A21D4367D1EF7C2       F:\chat-goliath-120b-80k\model-00011-of-00024.safetensors
SHA256          EEE031C8611D3E7E4E65C22CAA18F4E27319E359A5EB215C932828CB647BB81D       F:\chat-goliath-120b-80k\model-00012-of-00024.safetensors
SHA256          24B9486120B14560703E08D298A281FDD9035DE8DF3CBB528573C6C76E49B2EE       F:\chat-goliath-120b-80k\model-00013-of-00024.safetensors
SHA256          A54D145D4BB95E10F0594F8FE7FE4EBC66DCC5CA3A238EDE7A4D8E33B4E16082       F:\chat-goliath-120b-80k\model-00014-of-00024.safetensors
SHA256          0CAB89F83642A1553BEEB7965111C511BC05F59027C7D19DC9C853A6B422F85A       F:\chat-goliath-120b-80k\model-00015-of-00024.safetensors
SHA256          CCBBDAD6FA6E95720D4B70801E38A2A04154E8787BB5D4A6024E59824EC2191D       F:\chat-goliath-120b-80k\model-00016-of-00024.safetensors
SHA256          08137CCDA1B37C1EFF769EE894BB929FFD43EB7EE735386E626E88B1F187827C       F:\chat-goliath-120b-80k\model-00017-of-00024.safetensors
SHA256          6ED11E748D85699980A3990658D0D3B07BD89E0B55DE755C884555BEE2C4A620       F:\chat-goliath-120b-80k\model-00018-of-00024.safetensors
SHA256          39D879163BF2BF5B1A14906B707159A04C7E8381759A4E2837E517366F4B5D04       F:\chat-goliath-120b-80k\model-00019-of-00024.safetensors
SHA256          4DF0A12B117B57D6D882B0C18894973474E0D7DCD46A1841E9C87E61FC94F4F4       F:\chat-goliath-120b-80k\model-00020-of-00024.safetensors
SHA256          E5CAAA8E602F230CB752F0091E41B2FFCD2550B8C425325D3A3ABA2F6DC27209       F:\chat-goliath-120b-80k\model-00021-of-00024.safetensors
SHA256          D41A1AF612EFC80D051A6487812327EC6D312B5BEA97310986B9AD97B70A3B2A       F:\chat-goliath-120b-80k\model-00022-of-00024.safetensors
SHA256          5F9B8F2040BA5E448110BD9B4924CFAE3970309EAE32811FFB3F3641FE133A7F       F:\chat-goliath-120b-80k\model-00023-of-00024.safetensors
SHA256          6BDDB8D1C02E4F406F5157BC9709D1B23AC094448C712F6EB3797DEF7ED42B9B       F:\chat-goliath-120b-80k\model-00024-of-00024.safetensors
SHA256          E92E69756A90A7F7D90774BFA0A4B09D893E669BDFCFB810B5F3BC598DC52E7C       F:\chat-goliath-120b-80k\model.safetensors.index.json
SHA256          8E16855E765EFB8FF6A8D05DCC1FC4E1C62C633E9503550EA2C462B336A96FC7       F:\chat-goliath-120b-80k\README.md
SHA256          9921EDFD5322DD354EC44F825329E1C6C6B89B357BF3CE49377BE325B23EAA34       F:\chat-goliath-120b-80k\special_tokens_map.json
SHA256          79F2F33C690C49458F122090C1DFF223481CCF28162289407E8FA788074A330D       F:\chat-goliath-120b-80k\tokenizer_config.json
SHA256          4E28ED6CBB25942DCCBDBC000F414BC6075F580CC91E74C1AEC04D811CA60EAA       F:\chat-goliath-120b-80k\tokenizer.json
SHA256          D16FB023FB71C56B63E09C51631A3BC56297CAF04E60CEB1812614569DE3DBDD       F:\chat-goliath-120b-80k\tokenizer.model

However, the quantization process also failed to yield consistent replies, like the issues encountered in the previous steps.

ooba webui exl2 response.png

We believe that we need your assistance. Thank you for your time and consideration!

Kind regards,
Fangru Shao

Dear Dr Shao,

Thank you for your detailed test report, you did a great job!!

  1. Regarding the problem that the files downloaded by git clone are very small, because I uploaded them directly through a web browser and VPN, I cannot upload them directly because my network environment is very poor.
  2. About the 46.37 GiB (3.37 BPW) after quantization is correct, this is because I expanded the Chinese vocabulary, which resulted in the expansion of the word embedding matrix, which has more parameters than the original Goliath model.
  3. Regarding the problem of a mixture of Chinese and English responses generated by the model, I guess the main reason is that I did not fully train with a large amount of data after expanding the Chinese vocabulary. I used some low-quality data for instruction tuning before uploading the model.

In the future, I may conduct training based on earlier checkpoints.

Best regards,
Hongyin Zhu

Hi Hongyin,

Thanks for your information! We then would wait for your process and hope you could successfully train a good model at that time.

Kind regards,
Fangru Shao

Greetings!

I have noticed that you have re-uploaded model files but it seems that only several parts of them have been updated? Is the upload not complete or the network issue?
Anyway take your time and we would be waiting for your model.
Wish you a nice day!

Kind regards,
Fangru Shao

Owner

Greetings!

I have noticed that you have re-uploaded model files but it seems that only several parts of them have been updated? Is the upload not complete or the network issue?
Anyway take your time and we would be waiting for your model.
Wish you a nice day!

Kind regards,
Fangru Shao

HI!

I suggest that you only replace the two files pytorch_model-00001-of-00024.bin and pytorch_model-00024-of-00024.bin, and do not need to download the others again. Since I only fine-tuned some of the model parameters, most of them were fixed. When I was uploading the model to hugginface, the website automatically verified the hash value and skipped files that had not changed.

Best regrads,
Hongyin Zhu

Sadly it might still not be working properly. I am using the Q2_K quantization of the model.
Snipaste_2024-01-17_03-00-47.png
Here are the checksums of my files.

Algorithm       Hash                                                                   Path
---------       ----                                                                   ----
SHA256          EB45E5C5BF988775AF2CEC6524040BD88EBC002A83891DBDA052AB59CBF3F628       F:\llm-models-raw\chat-goliath-120b-80k\added_tokens.json
SHA256          DA1F80396552FD1AF2F8FF019F3D17FA4E2903C5127F501249FC9C9DBD4DDE64       F:\llm-models-raw\chat-goliath-120b-80k\config.json
SHA256          B77BE1183605A19D3B3A115236EBEB8B15B04B784CCC386EC0B2B782D597A216       F:\llm-models-raw\chat-goliath-120b-80k\generation_config.json
SHA256          07FB40C64E2F645EBBB4450A5FB9882EDA477EC3B9DE629B422BD16557112CC8       F:\llm-models-raw\chat-goliath-120b-80k\pytorch_model-00001-of-00024.bin
SHA256          C6395064F3EBBF2DC4A9C5231D8FAE64E8441DEBF1BA69B5B387261AC6DB6B3F       F:\llm-models-raw\chat-goliath-120b-80k\pytorch_model-00002-of-00024.bin
SHA256          945CF5F3608DE8F01E0B31F0E9CB1A1FD0A1809F371B88B24877132E75635D33       F:\llm-models-raw\chat-goliath-120b-80k\pytorch_model-00003-of-00024.bin
SHA256          800A35D317B6AA9F4DE98CF0B0264FFE8D563750B0055555456F4EAFF2F9C5C3       F:\llm-models-raw\chat-goliath-120b-80k\pytorch_model-00004-of-00024.bin
SHA256          2125CB4B9EA54182E6ADA059022C5FC6BFAF06F25AEBE198F6CCA1913F5C918D       F:\llm-models-raw\chat-goliath-120b-80k\pytorch_model-00005-of-00024.bin
SHA256          420E6788E175CB6EB89CB57110A11A2A501EDD5F02B9FBA138B59AAB3DB51A8A       F:\llm-models-raw\chat-goliath-120b-80k\pytorch_model-00006-of-00024.bin
SHA256          B3322275DFFF7840BF38CAD7C26B94BB4DC32E0F308D85FED4BCC3DD74C03141       F:\llm-models-raw\chat-goliath-120b-80k\pytorch_model-00007-of-00024.bin
SHA256          4C0C938DF6A934B7A08467019FBE75EBAAAA4369E9A27614A1A7B84735B171A5       F:\llm-models-raw\chat-goliath-120b-80k\pytorch_model-00008-of-00024.bin
SHA256          DFEBA74B2278133351B9FC210A4CFB6A13AB8C518EA8B32113B0592BA11DE533       F:\llm-models-raw\chat-goliath-120b-80k\pytorch_model-00009-of-00024.bin
SHA256          3E62B8FDC48E4A0F72FB3A5F92B74E4ADCBA2F2F65F18B332EA19EAB305CC770       F:\llm-models-raw\chat-goliath-120b-80k\pytorch_model-00010-of-00024.bin
SHA256          274F1B52C8DB4B2A182FDE1AF9840D94DC5C18CA30CC1DE8E2F47D15EF777A88       F:\llm-models-raw\chat-goliath-120b-80k\pytorch_model-00011-of-00024.bin
SHA256          51678AF98D98DD4712DB8F14F6D9C55CC601F97A985958FFF1C4A6C77E6E38C7       F:\llm-models-raw\chat-goliath-120b-80k\pytorch_model-00012-of-00024.bin
SHA256          55C2365A54055CDB2C9432A4861DDA8F8EA1B8057D098FD53297976150E482C9       F:\llm-models-raw\chat-goliath-120b-80k\pytorch_model-00013-of-00024.bin
SHA256          7E9DEE41C3CDC3F7FCE9D23B5BD2D15AB94CE603CAC360E4A5E8BDCFEDCF2453       F:\llm-models-raw\chat-goliath-120b-80k\pytorch_model-00014-of-00024.bin
SHA256          431BA6777E57BF184E07FBF19B19C2FF11A63F524C0F3B43B7F9A6C6AD2B7FBE       F:\llm-models-raw\chat-goliath-120b-80k\pytorch_model-00015-of-00024.bin
SHA256          CB2AC86E13CD3352B4299C78DC87CA7AA9C09A8EA82F2D96660BB39272A248F5       F:\llm-models-raw\chat-goliath-120b-80k\pytorch_model-00016-of-00024.bin
SHA256          ABC02772BC996E04BF4ED576F679A54B2E26904BC6881A74ED82A54D68799AC1       F:\llm-models-raw\chat-goliath-120b-80k\pytorch_model-00017-of-00024.bin
SHA256          22EB0D7A4B0E1D430E9B208DA108521AED410C2B87EFE81F8699516DDE372716       F:\llm-models-raw\chat-goliath-120b-80k\pytorch_model-00018-of-00024.bin
SHA256          9FCE38CE560E92EEECE946B60188F46B476574C1B5183513C483908739E6B7D6       F:\llm-models-raw\chat-goliath-120b-80k\pytorch_model-00019-of-00024.bin
SHA256          327DEB0B6547A4B1B281188608086AC2FE5CFE5D157DA27C77A0D7EE1846647D       F:\llm-models-raw\chat-goliath-120b-80k\pytorch_model-00020-of-00024.bin
SHA256          E94589111171D1A634DEFD4462FD2B2B3DF6EF51C9F4D7313D0E13486D087ABE       F:\llm-models-raw\chat-goliath-120b-80k\pytorch_model-00021-of-00024.bin
SHA256          8BE9A89C7FC63596C251121EF799F35F60CEBA22CBDCA8D68B0C283FE321D803       F:\llm-models-raw\chat-goliath-120b-80k\pytorch_model-00022-of-00024.bin
SHA256          E952F146D9BCDED86E09F935290636993A307BD4FB0F75425DC080BEAF74EC22       F:\llm-models-raw\chat-goliath-120b-80k\pytorch_model-00023-of-00024.bin
SHA256          C867E8E7D6C2CB8D52E55C7A82A02C2D97E48CA965C986B51813F34AA7C797E0       F:\llm-models-raw\chat-goliath-120b-80k\pytorch_model-00024-of-00024.bin
SHA256          71C47CECFF8AF76F8247098000694A47000FBB0268C02CF3FA6AD8BC97A35F6A       F:\llm-models-raw\chat-goliath-120b-80k\pytorch_model.bin.index.json
SHA256          44D3B2B486728D1946EF5330D909EF36E36250BB7ED70164B91C9B004106DBBC       F:\llm-models-raw\chat-goliath-120b-80k\README.md
SHA256          42E10E2B1078436869C03BE50600A60ACD2C018B48224169E4F504E8382B77E1       F:\llm-models-raw\chat-goliath-120b-80k\special_tokens_map.json
SHA256          40F7BE569328E60645C00D732B7CE81E2DA455CC35145C6C1D2C94E395FF556C       F:\llm-models-raw\chat-goliath-120b-80k\tokenizer_config.json
SHA256          7BBB48CE49C00135753F4F8D093893650C3D249352B0672ED4A1B68DE236473F       F:\llm-models-raw\chat-goliath-120b-80k\tokenizer.json
SHA256          D16FB023FB71C56B63E09C51631A3BC56297CAF04E60CEB1812614569DE3DBDD       F:\llm-models-raw\chat-goliath-120b-80k\tokenizer.model

Kind regards,
Fangru Shao

Owner

Sadly it might still not be working properly. I am using the Q2_K quantization of the model.
Snipaste_2024-01-17_03-00-47.png
Here are the checksums of my files.

Algorithm       Hash                                                                   Path
---------       ----                                                                   ----
SHA256          EB45E5C5BF988775AF2CEC6524040BD88EBC002A83891DBDA052AB59CBF3F628       F:\llm-models-raw\chat-goliath-120b-80k\added_tokens.json
SHA256          DA1F80396552FD1AF2F8FF019F3D17FA4E2903C5127F501249FC9C9DBD4DDE64       F:\llm-models-raw\chat-goliath-120b-80k\config.json
SHA256          B77BE1183605A19D3B3A115236EBEB8B15B04B784CCC386EC0B2B782D597A216       F:\llm-models-raw\chat-goliath-120b-80k\generation_config.json
SHA256          07FB40C64E2F645EBBB4450A5FB9882EDA477EC3B9DE629B422BD16557112CC8       F:\llm-models-raw\chat-goliath-120b-80k\pytorch_model-00001-of-00024.bin
SHA256          C6395064F3EBBF2DC4A9C5231D8FAE64E8441DEBF1BA69B5B387261AC6DB6B3F       F:\llm-models-raw\chat-goliath-120b-80k\pytorch_model-00002-of-00024.bin
SHA256          945CF5F3608DE8F01E0B31F0E9CB1A1FD0A1809F371B88B24877132E75635D33       F:\llm-models-raw\chat-goliath-120b-80k\pytorch_model-00003-of-00024.bin
SHA256          800A35D317B6AA9F4DE98CF0B0264FFE8D563750B0055555456F4EAFF2F9C5C3       F:\llm-models-raw\chat-goliath-120b-80k\pytorch_model-00004-of-00024.bin
SHA256          2125CB4B9EA54182E6ADA059022C5FC6BFAF06F25AEBE198F6CCA1913F5C918D       F:\llm-models-raw\chat-goliath-120b-80k\pytorch_model-00005-of-00024.bin
SHA256          420E6788E175CB6EB89CB57110A11A2A501EDD5F02B9FBA138B59AAB3DB51A8A       F:\llm-models-raw\chat-goliath-120b-80k\pytorch_model-00006-of-00024.bin
SHA256          B3322275DFFF7840BF38CAD7C26B94BB4DC32E0F308D85FED4BCC3DD74C03141       F:\llm-models-raw\chat-goliath-120b-80k\pytorch_model-00007-of-00024.bin
SHA256          4C0C938DF6A934B7A08467019FBE75EBAAAA4369E9A27614A1A7B84735B171A5       F:\llm-models-raw\chat-goliath-120b-80k\pytorch_model-00008-of-00024.bin
SHA256          DFEBA74B2278133351B9FC210A4CFB6A13AB8C518EA8B32113B0592BA11DE533       F:\llm-models-raw\chat-goliath-120b-80k\pytorch_model-00009-of-00024.bin
SHA256          3E62B8FDC48E4A0F72FB3A5F92B74E4ADCBA2F2F65F18B332EA19EAB305CC770       F:\llm-models-raw\chat-goliath-120b-80k\pytorch_model-00010-of-00024.bin
SHA256          274F1B52C8DB4B2A182FDE1AF9840D94DC5C18CA30CC1DE8E2F47D15EF777A88       F:\llm-models-raw\chat-goliath-120b-80k\pytorch_model-00011-of-00024.bin
SHA256          51678AF98D98DD4712DB8F14F6D9C55CC601F97A985958FFF1C4A6C77E6E38C7       F:\llm-models-raw\chat-goliath-120b-80k\pytorch_model-00012-of-00024.bin
SHA256          55C2365A54055CDB2C9432A4861DDA8F8EA1B8057D098FD53297976150E482C9       F:\llm-models-raw\chat-goliath-120b-80k\pytorch_model-00013-of-00024.bin
SHA256          7E9DEE41C3CDC3F7FCE9D23B5BD2D15AB94CE603CAC360E4A5E8BDCFEDCF2453       F:\llm-models-raw\chat-goliath-120b-80k\pytorch_model-00014-of-00024.bin
SHA256          431BA6777E57BF184E07FBF19B19C2FF11A63F524C0F3B43B7F9A6C6AD2B7FBE       F:\llm-models-raw\chat-goliath-120b-80k\pytorch_model-00015-of-00024.bin
SHA256          CB2AC86E13CD3352B4299C78DC87CA7AA9C09A8EA82F2D96660BB39272A248F5       F:\llm-models-raw\chat-goliath-120b-80k\pytorch_model-00016-of-00024.bin
SHA256          ABC02772BC996E04BF4ED576F679A54B2E26904BC6881A74ED82A54D68799AC1       F:\llm-models-raw\chat-goliath-120b-80k\pytorch_model-00017-of-00024.bin
SHA256          22EB0D7A4B0E1D430E9B208DA108521AED410C2B87EFE81F8699516DDE372716       F:\llm-models-raw\chat-goliath-120b-80k\pytorch_model-00018-of-00024.bin
SHA256          9FCE38CE560E92EEECE946B60188F46B476574C1B5183513C483908739E6B7D6       F:\llm-models-raw\chat-goliath-120b-80k\pytorch_model-00019-of-00024.bin
SHA256          327DEB0B6547A4B1B281188608086AC2FE5CFE5D157DA27C77A0D7EE1846647D       F:\llm-models-raw\chat-goliath-120b-80k\pytorch_model-00020-of-00024.bin
SHA256          E94589111171D1A634DEFD4462FD2B2B3DF6EF51C9F4D7313D0E13486D087ABE       F:\llm-models-raw\chat-goliath-120b-80k\pytorch_model-00021-of-00024.bin
SHA256          8BE9A89C7FC63596C251121EF799F35F60CEBA22CBDCA8D68B0C283FE321D803       F:\llm-models-raw\chat-goliath-120b-80k\pytorch_model-00022-of-00024.bin
SHA256          E952F146D9BCDED86E09F935290636993A307BD4FB0F75425DC080BEAF74EC22       F:\llm-models-raw\chat-goliath-120b-80k\pytorch_model-00023-of-00024.bin
SHA256          C867E8E7D6C2CB8D52E55C7A82A02C2D97E48CA965C986B51813F34AA7C797E0       F:\llm-models-raw\chat-goliath-120b-80k\pytorch_model-00024-of-00024.bin
SHA256          71C47CECFF8AF76F8247098000694A47000FBB0268C02CF3FA6AD8BC97A35F6A       F:\llm-models-raw\chat-goliath-120b-80k\pytorch_model.bin.index.json
SHA256          44D3B2B486728D1946EF5330D909EF36E36250BB7ED70164B91C9B004106DBBC       F:\llm-models-raw\chat-goliath-120b-80k\README.md
SHA256          42E10E2B1078436869C03BE50600A60ACD2C018B48224169E4F504E8382B77E1       F:\llm-models-raw\chat-goliath-120b-80k\special_tokens_map.json
SHA256          40F7BE569328E60645C00D732B7CE81E2DA455CC35145C6C1D2C94E395FF556C       F:\llm-models-raw\chat-goliath-120b-80k\tokenizer_config.json
SHA256          7BBB48CE49C00135753F4F8D093893650C3D249352B0672ED4A1B68DE236473F       F:\llm-models-raw\chat-goliath-120b-80k\tokenizer.json
SHA256          D16FB023FB71C56B63E09C51631A3BC56297CAF04E60CEB1812614569DE3DBDD       F:\llm-models-raw\chat-goliath-120b-80k\tokenizer.model

Kind regards,
Fangru Shao

I suggest you input the following template directly.

input_str = "Human: How is an earthquake is measured? \nAssistant:"

"Human:" and "Assistant:" are special symbols.

Still no luck :(
To confirm, am I doing it appropriately?
Below is the full code and you could also see there is an error for llm_load_vocab as llm_load_vocab: mismatch in special tokens definition ( 1151/49300 vs 259/49300 )..

PowerShell 7.4.1
PS F:\llama.cpp\build\bin\Release> .\main.exe -m F:\chat-goliath-120b-80k-q2_k.gguf --prompt "Human: How is an earthquake is measured? \nAssistant: "
Log start
main: build = 1879 (3e5ca79)
main: built with MSVC 19.38.33134.0 for x64
main: seed  = 1705473913
llama_model_loader: loaded meta data with 21 key-value pairs and 1236 tensors from F:\chat-goliath-120b-80k-q2_k.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = F:\llm-models-raw
llama_model_loader: - kv   2:                       llama.context_length u32              = 81920
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 8192
llama_model_loader: - kv   4:                          llama.block_count u32              = 137
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 28672
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 64
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 10
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,49300]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,49300]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,49300]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 2
llama_model_loader: - kv  20:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  275 tensors
llama_model_loader: - type q2_K:  549 tensors
llama_model_loader: - type q3_K:  411 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: mismatch in special tokens definition ( 1151/49300 vs 259/49300 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 49300
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 81920
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 137
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 81920
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q2_K - Medium
llm_load_print_meta: model params     = 118.03 B
llm_load_print_meta: model size       = 40.28 GiB (2.93 BPW)
llm_load_print_meta: general.name     = F:\llm-models-raw
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 2 '</s>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.47 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/138 layers to GPU
llm_load_tensors:        CPU buffer size = 41251.23 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   274.00 MiB
llama_new_context_with_model: KV self size  =  274.00 MiB, K (f16):  137.00 MiB, V (f16):  137.00 MiB
llama_new_context_with_model: graph splits (measure): 1
llama_new_context_with_model:        CPU compute buffer size =   145.00 MiB

system_info: n_threads = 16 / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0


 Human: How is an earthquake is measured? \nAssistant: 演唱歌曲肠漳 make Buyer东南城市 Buyer顶 town顶 town Mill have town Mill details Mill have town details Mill have town details Mill have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town

Kind regards,
Fangru Shao

Owner

Still no luck :(
To confirm, am I doing it appropriately?
Below is the full code and you could also see there is an error for llm_load_vocab as llm_load_vocab: mismatch in special tokens definition ( 1151/49300 vs 259/49300 )..

PowerShell 7.4.1
PS F:\llama.cpp\build\bin\Release> .\main.exe -m F:\chat-goliath-120b-80k-q2_k.gguf --prompt "Human: How is an earthquake is measured? \nAssistant: "
Log start
main: build = 1879 (3e5ca79)
main: built with MSVC 19.38.33134.0 for x64
main: seed  = 1705473913
llama_model_loader: loaded meta data with 21 key-value pairs and 1236 tensors from F:\chat-goliath-120b-80k-q2_k.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = F:\llm-models-raw
llama_model_loader: - kv   2:                       llama.context_length u32              = 81920
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 8192
llama_model_loader: - kv   4:                          llama.block_count u32              = 137
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 28672
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 64
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 10
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,49300]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,49300]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,49300]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 2
llama_model_loader: - kv  20:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  275 tensors
llama_model_loader: - type q2_K:  549 tensors
llama_model_loader: - type q3_K:  411 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: mismatch in special tokens definition ( 1151/49300 vs 259/49300 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 49300
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 81920
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 137
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 81920
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q2_K - Medium
llm_load_print_meta: model params     = 118.03 B
llm_load_print_meta: model size       = 40.28 GiB (2.93 BPW)
llm_load_print_meta: general.name     = F:\llm-models-raw
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 2 '</s>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.47 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/138 layers to GPU
llm_load_tensors:        CPU buffer size = 41251.23 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   274.00 MiB
llama_new_context_with_model: KV self size  =  274.00 MiB, K (f16):  137.00 MiB, V (f16):  137.00 MiB
llama_new_context_with_model: graph splits (measure): 1
llama_new_context_with_model:        CPU compute buffer size =   145.00 MiB

system_info: n_threads = 16 / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0


 Human: How is an earthquake is measured? \nAssistant: 演唱歌曲肠漳 make Buyer东南城市 Buyer顶 town顶 town Mill have town Mill details Mill have town details Mill have town details Mill have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town

Kind regards,
Fangru Shao

It doesn't seem to be using my tokenizers. I don't know if this main.exe is loading the tokenizer correctly. usually this is specified through the AutoTokenizer class of the transformers toolkit.

F:\llama.cpp\build\bin\Release> .\main.exe -m F:\chat-goliath-120b-80k-q2_k.gguf

Owner

Still no luck :(
To confirm, am I doing it appropriately?
Below is the full code and you could also see there is an error for llm_load_vocab as llm_load_vocab: mismatch in special tokens definition ( 1151/49300 vs 259/49300 )..

PowerShell 7.4.1
PS F:\llama.cpp\build\bin\Release> .\main.exe -m F:\chat-goliath-120b-80k-q2_k.gguf --prompt "Human: How is an earthquake is measured? \nAssistant: "
Log start
main: build = 1879 (3e5ca79)
main: built with MSVC 19.38.33134.0 for x64
main: seed  = 1705473913
llama_model_loader: loaded meta data with 21 key-value pairs and 1236 tensors from F:\chat-goliath-120b-80k-q2_k.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = F:\llm-models-raw
llama_model_loader: - kv   2:                       llama.context_length u32              = 81920
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 8192
llama_model_loader: - kv   4:                          llama.block_count u32              = 137
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 28672
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 64
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 10
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,49300]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,49300]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,49300]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 2
llama_model_loader: - kv  20:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  275 tensors
llama_model_loader: - type q2_K:  549 tensors
llama_model_loader: - type q3_K:  411 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: mismatch in special tokens definition ( 1151/49300 vs 259/49300 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 49300
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 81920
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 137
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 81920
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q2_K - Medium
llm_load_print_meta: model params     = 118.03 B
llm_load_print_meta: model size       = 40.28 GiB (2.93 BPW)
llm_load_print_meta: general.name     = F:\llm-models-raw
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 2 '</s>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.47 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/138 layers to GPU
llm_load_tensors:        CPU buffer size = 41251.23 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   274.00 MiB
llama_new_context_with_model: KV self size  =  274.00 MiB, K (f16):  137.00 MiB, V (f16):  137.00 MiB
llama_new_context_with_model: graph splits (measure): 1
llama_new_context_with_model:        CPU compute buffer size =   145.00 MiB

system_info: n_threads = 16 / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0


 Human: How is an earthquake is measured? \nAssistant: 演唱歌曲肠漳 make Buyer东南城市 Buyer顶 town顶 town Mill have town Mill details Mill have town details Mill have town details Mill have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town details have town

Kind regards,
Fangru Shao

If you want to learn large language models, you can add my wechat shuilaizhujiu2022

Sign up or log in to comment