CodeHima commited on
Commit
df3c75d
1 Parent(s): 1cdaf3f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +166 -1
README.md CHANGED
@@ -10,6 +10,171 @@ tags:
10
  - terms of services
11
  - bert
12
  ---
 
13
  # TOSBert
14
 
15
- This model is trained to classify clauses in Terms of Service (ToS) documents.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  - terms of services
11
  - bert
12
  ---
13
+
14
  # TOSBert
15
 
16
+ **TOSBert** is a fine-tuned BERT model for sequence classification tasks. It is trained on a custom dataset for multi-label classification.
17
+
18
+ ## Model Details
19
+
20
+ - **Model Name**: TOSBert
21
+ - **Model Architecture**: BERT
22
+ - **Framework**: [Hugging Face Transformers](https://huggingface.co/transformers/)
23
+ - **Model Type**: Sequence Classification (Multi-label Classification)
24
+
25
+ ## Dataset
26
+
27
+ The model is trained on the [online_terms_of_service](https://huggingface.co/datasets/joelniklaus/online_terms_of_service) dataset hosted on Hugging Face. This dataset consists of text sequences extracted from various online terms of service documents. Each sequence is labeled with multiple categories related to legal and privacy terms.
28
+
29
+ ## Training
30
+
31
+ The model was fine-tuned using the following parameters:
32
+
33
+ - **Number of Epochs**: 3
34
+ - **Batch Size**: 16 (both for training and evaluation)
35
+ - **Warmup Steps**: 500
36
+ - **Weight Decay**: 0.01
37
+ - **Learning Rate**: Automatically adjusted
38
+
39
+ ## Usage
40
+
41
+ ### Installation
42
+
43
+ To use this model, you need to install the `transformers` library from Hugging Face:
44
+
45
+ ```bash
46
+ pip install transformers
47
+ ```
48
+
49
+ ### Loading the Model
50
+
51
+ You can load the model using the following code:
52
+
53
+ ```python
54
+ from transformers import BertForSequenceClassification, BertTokenizer
55
+
56
+ model_name = "CodeHima/TOSBert"
57
+ model = BertForSequenceClassification.from_pretrained(model_name)
58
+ tokenizer = BertTokenizer.from_pretrained(model_name)
59
+ ```
60
+
61
+ ### Inference
62
+
63
+ Here is an example of how to use the model for inference:
64
+
65
+ ```python
66
+ from transformers import pipeline
67
+
68
+ classifier = pipeline("text-classification", model=model, tokenizer=tokenizer, return_all_scores=True)
69
+
70
+ text = "Your input text here"
71
+ predictions = classifier(text)
72
+
73
+ print(predictions)
74
+ ```
75
+
76
+ ### Training Script
77
+
78
+ Below is an example script used for training the model:
79
+
80
+ ```python
81
+ from transformers import Trainer, TrainingArguments, BertForSequenceClassification, BertTokenizer
82
+ import torch
83
+ from sklearn.metrics import accuracy_score, precision_recall_fscore_support
84
+
85
+ # Define the model
86
+ model_name = "bert-base-uncased"
87
+ model = BertForSequenceClassification.from_pretrained(model_name, num_labels=3)
88
+
89
+ # Define the tokenizer
90
+ tokenizer = BertTokenizer.from_pretrained(model_name)
91
+
92
+ # Load your dataset
93
+ # train_dataset and eval_dataset should be instances of torch.utils.data.Dataset
94
+ # Example: train_dataset = YourDataset(train_data)
95
+
96
+ # Define training arguments
97
+ training_args = TrainingArguments(
98
+ output_dir='./results',
99
+ num_train_epochs=3,
100
+ per_device_train_batch_size=16,
101
+ per_device_eval_batch_size=16,
102
+ warmup_steps=500,
103
+ weight_decay=0.01,
104
+ logging_dir='./logs',
105
+ logging_steps=10,
106
+ eval_strategy="epoch"
107
+ )
108
+
109
+ # Custom data collator to convert labels to floats
110
+ def data_collator(features):
111
+ batch = {}
112
+ first = features[0]
113
+ if 'label' in first and first['label'] is not None:
114
+ dtype = torch.float32
115
+ batch['labels'] = torch.tensor([f['label'] for f in features], dtype=dtype)
116
+ for k, v in first.items():
117
+ if k != 'label' and v is not None and not isinstance(v, str):
118
+ batch[k] = torch.stack([f[k] for f in features])
119
+ return batch
120
+
121
+ # Define the compute metrics function
122
+ def compute_metrics(pred):
123
+ labels = pred.label_ids
124
+ preds = (pred.predictions > 0.5).astype(int)
125
+ precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='micro')
126
+ acc = accuracy_score(labels, preds)
127
+ return {
128
+ 'accuracy': acc,
129
+ 'f1': f1,
130
+ 'precision': precision,
131
+ 'recall': recall
132
+ }
133
+
134
+ # Initialize the Trainer
135
+ trainer = Trainer(
136
+ model=model,
137
+ args=training_args,
138
+ train_dataset=train_dataset,
139
+ eval_dataset=eval_dataset,
140
+ compute_metrics=compute_metrics,
141
+ data_collator=data_collator
142
+ )
143
+
144
+ # Train the model
145
+ trainer.train()
146
+ ```
147
+
148
+ ## Evaluation
149
+
150
+ To evaluate the model on the validation set, you can use the following code:
151
+
152
+ ```python
153
+ results = trainer.evaluate()
154
+ print(results)
155
+ ```
156
+
157
+ ## License
158
+
159
+ This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for more details.
160
+
161
+ ## Citation
162
+
163
+ If you use this model in your research, please cite it as follows:
164
+
165
+ ```bibtex
166
+ @misc{TOSBert,
167
+ author = {Himanshu Mohanty},
168
+ title = {TOSBert: Fine-tuned BERT model for multi-label classification},
169
+ year = {2024},
170
+ publisher = {Hugging Face},
171
+ journal = {Hugging Face Model Hub},
172
+ howpublished = {\url{https://huggingface.co/CodeHima/TOSBert}}
173
+ }
174
+ ```
175
+
176
+ ## Acknowledgements
177
+
178
+ This project uses the [Hugging Face Transformers](https://huggingface.co/transformers/) library. Special thanks to the developers and contributors of this library.
179
+
180
+ ```