Tested with GGUF Quantization and not receiving expected output

#11
by ReXommendation - opened

I was testing this model with https://huggingface.co./MaziyarPanahi/solar-pro-preview-instruct-GGUF and the output isn't as expected, it is full of typographical errors and the model is confident that it is using and spelling the words correctly despite the error being explicitly pointed out.

I forgot to mention I was using the 5_K_M file.

Same here. Frequent spelling errors, made up words, reduced coherency... There appears to be something wrong with the tokenizer.

upstage org

Could you please check with ollama? https://x.com/hunkims/status/1836550727881388381

@ReXommendation did you install the latest version of Ollama? https://ollama.com/download – 0.3.11 is required for Solar Pro Preview

There's also a PR open for llama.cpp: https://github.com/ggerganov/llama.cpp/pull/9541 that will fix this for other GGUF files

@ReXommendation did you install the latest version of Ollama? https://ollama.com/download – 0.3.11 is required for Solar Pro Preview

There's also a PR open for llama.cpp: https://github.com/ggerganov/llama.cpp/pull/9541 that will fix this for other GGUF files

Its still full of typographical errors with that patch rebased to the latest version (rebase needed after granite moe was merged) . Hallucinates, typos. IQ4_XS gguf quant.

bash-5.1$ lm give a summary of the ollama LLM system
Ollama is a large language model developed by Upstage, a leading AI company. It'utes a powerful AI system that can understand and generate human-like text, making it capable of various tasks such as conversation, content creation, and more.

Old Mistral based solar pure hallucincation:

bash-5.1$ lm give a summary of the ollama LLM system
The LLM (Linguistic Language Model) system, also known as the OLLMA (Optimized Language Model for Machine Translation and Associated Systems), is a statistical machine translation framework that uses probabilistic models to improve the accuracy and fluency of translations between languages. Developed by the National Institute for Research in Computer Science and Automation (INRIA) and the University of Edinburgh, the LLM system predicts the likelihood of a sequence of words in a target language based on a training corpus of parallel texts. It employs a hierarchical language model that combines a phrase-based model with a lexical transfer model, allowing it to handle long-distance dependencies and improve translation quality. LLM has been widely used in various machine translation systems and has contributed significantly to the advancement of statistical machine translation research.

gemma 2 9b:

bash-5.1$ lm give a summary of the ollama LLM system
Ollama is an open-source, modular, and customizable large language model (LLM) system developed by the Ollama team.

Here's a breakdown of its key features:

Open-Source and Modular:

  • Accessible: Ollama's code, weights, and documentation are freely available to the public, fostering transparency and community development.
  • Modular Design: The system is built with a modular architecture, allowing users to easily swap out components like the tokenizer, model architecture, and inference engine. This flexibility enables customization and experimentation.

Focus on Efficiency:

  • Quantization: Ollama supports quantization techniques, which reduce the model's size and memory footprint without significant performance loss. This makes it more accessible for deployment on resource-constrained devices.
  • Inference Optimization: The system incorporates various inference optimization strategies to improve speed and efficiency.

Customization and Extensibility:

  • Fine-Tuning: Users can fine-tune Ollama on their own datasets to adapt it to specific tasks or domains.
  • Plugin System: Ollama features a plugin system that allows developers to extend its functionality by adding new capabilities.

Community-Driven Development:

  • Active Community: Ollama benefits from an active community of developers and researchers who contribute to its development, provide support, and share resources.

Use Cases:

Ollama's open-source nature and modularity make it suitable for a wide range of applications, including:

  • Text Generation: Writing stories, articles, summaries, and more.
  • Chatbots and Conversational AI: Creating interactive chatbots for customer service, education, or entertainment.
  • Code Generation: Assisting developers in writing and understanding code.
  • Data Analysis and Summarization: Extracting insights and summarizing large amounts of text data.

Overall, Ollama is a promising open-source LLM system that offers a balance of performance, efficiency, and customization options, making it a valuable tool for developers, researchers, and anyone interested in exploring the potential of large language models.

@steampunque Hi, have you tried the model downloaded from Ollama? The Ollama team has supported the creation of GGUF files with block skip connections, which have been newly added to the Solar model. This is Q4_K_M quant. link

@steampunque Hi, have you tried the model downloaded from Ollama? The Ollama team has supported the creation of GGUF files with block skip connections, which have been newly added to the Solar model. This is Q4_K_M quant. link

Thanks for your response. I do not use ollama and they do not appear to give a link to download the gguf outside of the ollama system. I'll investigate the block skip connections further. It seems it should not be too much trouble to merge this function into the llama.cpp convert if its already done in Ollama system.

Update. Block skips were already integrated into the solar patch:

[email protected]("SolarForCausalLM")
+class SolarModel(LlamaModel):
+    model_arch = gguf.MODEL_ARCH.SOLAR
+
+    def set_gguf_parameters(self):
+        super().set_gguf_parameters()
+
+        for i, bskcn in enumerate(self.hparams[k] for k in self.hparams.keys()
+            # store the skip connections as a layer index where a non-zero val
+            # this approach simplifies lookup at inference time
+            self.gguf_writer.add_block_skip_connection(i, [1 if n in bskcn els
+
+    def prepare_tensors(self):
+        if bskcn_tv := self.find_hparam(['bskcn_tv'], optional=True):
+          # use bskcn_tv[1] for inference since bskcn_tv[0] is for training
+          self.gguf_writer.add_tensor(self.format_tensor_name(gguf.MODEL_TENSO
+
+        super().prepare_tensors()
+

I have rebased the patch twice to keep it current with latest llama.cpp release. There is now activity on the solar pro PR so when it gets officially merged I will try again.

Sign up or log in to comment