philipp-zettl
/

GGU-CLF-xx

Text Classification

Safetensors

multilingual

torch

xlm-roberta

Model card Files Files and versions Community

philipp-zettl commited on Jun 19

Commit

6cc2b88

•

1 Parent(s): 62e2378

Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +162 -14

README.md CHANGED Viewed

@@ -1,9 +1,22 @@
 ---
-library_name: transformers
 tags: []
 ---
-# Model Card for Model ID
 <!-- Provide a quick summary of what the model is/does. -->
@@ -15,21 +28,38 @@ tags: []
 <!-- Provide a longer summary of what this model is. -->
-This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
-- **Developed by:** [More Information Needed]
 - **Funded by [optional]:** [More Information Needed]
 - **Shared by [optional]:** [More Information Needed]
 - **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
 ### Model Sources [optional]
 <!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
 - **Paper [optional]:** [More Information Needed]
 - **Demo [optional]:** [More Information Needed]
@@ -41,7 +71,7 @@ This is the model card of a 🤗 transformers model that has been pushed on the
 <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
 ### Downstream Use [optional]
@@ -53,7 +83,11 @@ This is the model card of a 🤗 transformers model that has been pushed on the
 <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
 ## Bias, Risks, and Limitations
@@ -71,7 +105,28 @@ Users (both direct and downstream) should be made aware of the risks, biases and
 Use the code below to get started with the model.
-[More Information Needed]
 ## Training Details
@@ -79,7 +134,11 @@ Use the code below to get started with the model.
 <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
 ### Training Procedure
@@ -87,7 +146,24 @@ Use the code below to get started with the model.
 #### Preprocessing [optional]
-[More Information Needed]
 #### Training Hyperparameters
@@ -136,7 +212,79 @@ Use the code below to get started with the model.
 <!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
 ## Environmental Impact

 ---
+language: multilingual
+license: mit
+library_name: torch
 tags: []
+base_model: BAAI/bge-m3
+datasets: philipp-zettl/GGU-xx
+metrics:
+- accuracy
+- f1
+- recall
+model_name: GGU-CLF
+pipeline_tag: text-classification
+widget:
+- name: test1
+  text: hello world
 ---
+# Model Card for GGU-CLF
 <!-- Provide a quick summary of what the model is/does. -->
 <!-- Provide a longer summary of what this model is. -->
+This is a simple classification model trained on a custom dataset.
+Please note that this model, although it is implemented in the `transformers` library. Is not a usual transformer.
+It combines the underlying embedding model with the required tokenizer into a simple-to-use pipeline for sequence classification.
+It is used to classify user text into the following classes:
+- 0: Greeting
+- 1: Gratitude
+- 2: Unknown
+**Note**: To use this model please remember the following things
+1. The model is an XLMRoberta model based on [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3).
+2. The required tokenizer is baked into the classifier implementation.
+- **Developed by:** [philipp-zettl](https://huggingface.co/philipp-zettl/)
 - **Funded by [optional]:** [More Information Needed]
 - **Shared by [optional]:** [More Information Needed]
 - **Model type:** [More Information Needed]
+- **Language(s) (NLP):** multilingual
+- **License:** mit
+- **Finetuned from model [optional]:** BAAI/bge-m3
 ### Model Sources [optional]
 <!-- Provide the basic links for the model. -->
+- **Repository:** [philipp-zettl/GGU-CLF](https://huggingface.co/philipp-zettl/GGU-CLF)
 - **Paper [optional]:** [More Information Needed]
 - **Demo [optional]:** [More Information Needed]
 <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+Use this model to classify messages from natural language chats.
 ### Downstream Use [optional]
 <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+The model was not trained on multi-sentence samples. **You should avoid those.**
+Oficially tested and supported languages are **english and german** any other language is considered out of scope.
 ## Bias, Risks, and Limitations
 Use the code below to get started with the model.
+```python
+from transformers import AutoModel
+model = AutoModel.from_pretrained("philipp-zettl/GGU-xx").to(torch.float16).to('cuda')
+model([
+    'Hi wie gehts?',
+    'Dannke dir mein freund!',
+    'Merci freundchen, send mir mal ein paar Machine Learning jobs.',
+    'Works as expected, cheers!',
+    'How you doin my boy',
+    'send me immediately some matching jobs, thanks',
+    'wer's eigentlich tom selleck?',
+    'sprichst du deutsch?',
+    'sprechen sie deutsch sie hurensohn?',
+    'vergeltsgott',
+    'heidenei dank dir recht herzlich',
+    'grazie mille bambino, come estas'
+]).argmax(dim=1)
+```
 ## Training Details
 <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+This model was trained using the [philipp-zettl/GGU-xx](https://huggingface.co/dataset/philipp-zettl/GGU-xx) dataset.
+You can find it's performance metrics under [Evaluation Results](#evaluation-results).
 ### Training Procedure
 #### Preprocessing [optional]
+The following code was used to create the data set as well as split the data set into training and validation sets.
+```python
+from datasets import load_dataset
+class Dataset:
+    def __init__(self, dataset, target_names=None):
+        self.data = list(map(lambda x: x[0], dataset))
+        self.target = list(map(lambda x: x[1], dataset))
+        self.target_names = target_names
+ds = load_dataset('philipp-zettl/GGU-xx')
+data = Dataset([[e['sample'], e['label']] for e in ds['train']], ['greeting', 'gratitude', 'unknown'])
+X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)
+```
 #### Training Hyperparameters
 <!-- Relevant interpretability work for the model goes here -->
+You can find the initial implementation of the classification model here:
+```python
+from transformers import PreTrainedModel, PretrainedConfig, AutoModel, AutoTokenizer
+import torch
+import torch.nn as nn
+class EmbeddingClassifierConfig(PretrainedConfig):
+    model_type = 'xlm-roberta'
+    def __init__(self, num_classes=3, base_model='BAAI/bge-m3', tokenizer='BAAI/bge-m3', dropout=0.0, l2_reg=0.01, torch_dtype=torch.float16, **kwargs):
+        self.num_classes = num_classes
+        self.base_model = base_model
+        self.tokenizer = tokenizer
+        self.dropout = dropout
+        self.l2_reg = l2_reg
+        self.torch_dtype = torch_dtype
+        super().__init__(**kwargs)
+class EmbeddingClassifier(PreTrainedModel):
+    config_class = EmbeddingClassifierConfig
+    def __init__(self, config):
+        super().__init__(config)
+        base_model = config.base_model
+        tokenizer = config.tokenizer
+        if base_model is None or isinstance(tokenizer, str):
+            base_model = AutoModel.from_pretrained(base_model)#, torch_dtype=config.torch_dtype)
+        if tokenizer is None or isinstance(tokenizer, str):
+            tokenizer = AutoTokenizer.from_pretrained(tokenizer)
+        self.tokenizer = tokenizer
+        self.base = base_model
+        self.fc = nn.Linear(base_model.config.hidden_size, config.num_classes)#, torch_dtype=config.torch_dtype)
+        self.do = nn.Dropout(config.dropout)#, torch_dtype=config.torch_dtype)
+        self.l2_reg = config.l2_reg
+        self.to(config.torch_dtype)
+    def forward(self, X):
+        encoding = self.tokenizer(
+            X, return_tensors='pt',
+            padding=True, truncation=True
+        ).to(self.device)
+        input_ids = encoding['input_ids']
+        attention_mask = encoding['attention_mask']
+        emb = self.base(
+            input_ids,
+            attention_mask=attention_mask,
+            return_dict=True,
+            output_hidden_states=True
+        ).last_hidden_state[:, 0, :]
+        return self.fc(self.do(emb))
+    def train(self, set_val=True):
+        self.base.train(False)
+        for param in self.base.parameters():
+            param.requires_grad = False
+        for param in self.fc.parameters():
+            param.requires_grad = set_val
+    def get_l2_loss(self):
+        l2_loss = torch.tensor(0.).to('cuda')
+        for param in self.parameters():
+            if param.requires_grad:
+                l2_loss += torch.norm(param, 2)
+        return self.l2_reg * l2_loss
+```
 ## Environmental Impact