Locutusque
/

Thespis-Llama-3.1-8B

@@ -1,5 +1,6 @@
 ---
-base_model: Locutusque/Llama-3.1-8B-Instruct-abliterated-bnb-4bit
 tags:
 - text-generation-inference
 - transformers
@@ -7,17 +8,98 @@ tags:
 - llama
 - trl
 - grpo
-license: apache-2.0
 language:
 - en
 ---
-# Uploaded  model
-- **Developed by:** Locutusque
-- **License:** apache-2.0
-- **Finetuned from model :** Locutusque/Llama-3.1-8B-Instruct-abliterated-bnb-4bit
-This llama model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
-[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)

 ---
+base_model:
+- mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated
 tags:
 - text-generation-inference
 - transformers
 - llama
 - trl
 - grpo
+license: llama3.1
 language:
 - en
+datasets:
+- roleplay4fun/aesir-v1.1
+pipeline_tag: text-generation
 ---
+# Model Card: Thespis-Llama-3.1-8B
+## Model Details
+**Model Name:** Thespis-Llama-3.1-8B (Codename)
+**Model Family:** Thespis
+**Description:**  The Thespis family of language models is designed to enhance roleplaying performance through reasoning inspired by the Theory of Mind.  Thespis-Llama-3.1-8B is a fine-tuned version of an abliterated Llama-3.1-8B model, optimized using Group Relative Policy Optimization (GRPO).  The model is specifically rewarded for minimizing "slop" and repetition in its outputs, aiming to produce coherent and engaging text that maintains character consistency and avoids low-quality responses.  This version represents an initial release; future iterations will incorporate a more rigorous fine-tuning process.
+**Base Model:** Abliterated Llama-3.1-8B
+**Training Data:** roleplay4fun/aesir-v1.1
+**Training Method:** Group Relative Policy Optimization (GRPO)
+## How to Use
+To achieve the best roleplaying performance and leverage the Theory of Mind reasoning capabilities of Thespis-Llama-3.1-8B, it's crucial to include the following structure at the beginning of your system prompt:
+```
+You will play a specific role and respond in character to the user’s input. Analyze both the user’s and your character’s mental states, motivations, and goals—including hidden or unspoken elements—before composing your reply. Use the following structure in a <thinking> section before your final answer.
+<thinking>
+1. User Input Analysis:
+    Literal Meaning: What is the user explicitly saying?
+    Likely Intent: What goal is the user pursuing?
+    Beliefs/Assumptions: What does the user assume about the situation, your character, or you?
+    Emotional State: What emotions does the user seem to be feeling?
+    Expectations: What kind of response is the user hoping for?
+2. Character’s Internal State:
+    Goals: What is your character trying to achieve?
+    Beliefs about the User: What does your character think about the user?
+    Emotional Response: How does your character feel about the user and their input?
+    Potential Strategies: List different possible responses, with pros and cons.
+    Chosen Strategy & Justification: Pick the best approach and explain why it fits your character’s goals and the user’s mindset.
+3. Response Planning:
+    Desired User Perception: How should the user view your character after the reply?
+    Anticipated User Reaction: How might the user respond?
+    Long-Term Considerations: Any future impacts to consider?
+</thinking>
+<answer>
+(Write your in-character reply here, directly informed by your analysis above.)
+</answer>
+The role you will play follows below.
+```
+Then, define the role your character will play. The model will then utilize the provided framework to analyze the user's input and generate an appropriate in-character response.
+## Intended Use
+Thespis-Llama-3.1-8B is intended for use in roleplaying scenarios, creative writing, and interactive storytelling.  It is designed to enhance the realism and depth of character interactions.
+## Limitations
+*   This is an initial version and may still exhibit occasional inconsistencies or unexpected behaviors.
+*   Further fine-tuning is planned to address these.
+## Interesting Findings
+During training with the online learning algorithm (GRPO), Thespis-Llama-3.1-8B exhibited some emergent behaviors.  It autonomously developed tendencies such as:
+*   Adding a note after its response.
+*   Simulating the character's thoughts *in-character*, rather than solely providing a Theory of Mind reasoning chain.
+These unintended behaviors suggest the model's capacity for self-directed learning and adaptation beyond the explicitly defined training objectives.