AnnaWegmann commited on
Commit
bd054e7
1 Parent(s): 38b9330

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +143 -1
README.md CHANGED
@@ -7,8 +7,150 @@ language:
7
  base_model: microsoft/deberta-v3-large
8
  ---
9
 
10
- Model was created as described in https://arxiv.org/abs/2404.06670 , this is the best `DeBERTa AGGREGATED` model.
11
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  ```
13
  @article{wegmann2024,
14
  title={What's Mine becomes Yours: Defining, Annotating and Detecting Context-Dependent Paraphrases in News Interview Dialogs},
 
7
  base_model: microsoft/deberta-v3-large
8
  ---
9
 
10
+ Model was created as described in https://arxiv.org/abs/2404.06670 , this is the best `DeBERTa AGGREGATED` model. See also the [GitHub](https://github.com/nlpsoc/Paraphrases-in-News-Interviews) repository.
11
 
12
+ ```python
13
+ from transformers import AutoTokenizer, AutoModelForTokenClassification
14
+ import torch
15
+
16
+ class ParaphraseHighlighter:
17
+ def __init__(self, model_name="AnnaWegmann/Highlight-Paraphrases-in-Dialog"):
18
+ # Load the tokenizer and model
19
+ self.tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
20
+ self.model = AutoModelForTokenClassification.from_pretrained(model_name)
21
+
22
+ # Get the label id for 'LABEL_1'
23
+ self.label2id = self.model.config.label2id
24
+ self.label_id = self.label2id['LABEL_1']
25
+
26
+ def highlight_paraphrase(self, text1, text2):
27
+ # Tokenize the inputs with the tokenizer
28
+ encoding = self.tokenizer(text1, text2, return_tensors="pt", padding=True, truncation=True)
29
+
30
+ outputs = self.model(**encoding)
31
+ logits = outputs.logits # Shape: (batch_size, sequence_length, num_labels)
32
+ # Apply softmax to get probabilities, automatically places [SEP] token
33
+ probs = torch.nn.functional.softmax(logits, dim=-1) # Shape: (batch_size, sequence_length, num_labels)
34
+
35
+ # Convert token IDs back to tokens
36
+ tokens = self.tokenizer.convert_ids_to_tokens(encoding["input_ids"][0])
37
+ # Get word IDs to map tokens to words
38
+ word_ids = encoding.word_ids(batch_index=0)
39
+ # Get sequence IDs to know which text the token belongs to
40
+ sequence_ids = encoding.sequence_ids(batch_index=0)
41
+
42
+ # Collect words and probabilities for each text
43
+ words_text1 = []
44
+ words_text2 = []
45
+ probs_text1 = []
46
+ probs_text2 = []
47
+
48
+ previous_word_idx = None
49
+
50
+ # For determining if there are high-probability words in both texts
51
+ has_high_prob_text1 = False
52
+ has_high_prob_text2 = False
53
+
54
+ for idx, (word_idx, seq_id) in enumerate(zip(word_ids, sequence_ids)):
55
+ if word_idx is None:
56
+ # Skip special tokens like [CLS], [SEP], [PAD]
57
+ continue
58
+
59
+ if word_idx != previous_word_idx:
60
+ # Start of a new word
61
+ word_tokens = [tokens[idx]]
62
+
63
+ # Get the probability for LABEL_1 for the first token of the word
64
+ prob_LABEL_1 = probs[0][idx][self.label_id].item()
65
+
66
+ # Collect subsequent tokens belonging to the same word
67
+ j = idx + 1
68
+ while j < len(word_ids) and word_ids[j] == word_idx:
69
+ word_tokens.append(tokens[j])
70
+ j += 1
71
+
72
+ # Reconstruct the word
73
+ word = self.tokenizer.convert_tokens_to_string(word_tokens).strip()
74
+
75
+ # Check if probability >= 0.5 to uppercase
76
+ if prob_LABEL_1 >= 0.5:
77
+ word_display = word.upper()
78
+ if seq_id == 0:
79
+ has_high_prob_text1 = True
80
+ elif seq_id == 1:
81
+ has_high_prob_text2 = True
82
+ else:
83
+ word_display = word
84
+
85
+ # Append the word and probability to the appropriate list
86
+ if seq_id == 0:
87
+ words_text1.append(word_display)
88
+ probs_text1.append(prob_LABEL_1)
89
+ elif seq_id == 1:
90
+ words_text2.append(word_display)
91
+ probs_text2.append(prob_LABEL_1)
92
+ else:
93
+ # Should not happen
94
+ pass
95
+
96
+ previous_word_idx = word_idx
97
+
98
+ # Determine if there are words in both texts with prob >= 0.5
99
+ if has_high_prob_text1 and has_high_prob_text2:
100
+ print("is a paraphrase")
101
+ else:
102
+ print("is not a paraphrase")
103
+
104
+ # Function to format and align words and probabilities
105
+ def print_aligned(words, probs):
106
+ # Determine the maximum length of words for formatting
107
+ max_word_length = max(len(word) for word in words)
108
+ # Create format string for alignment
109
+ format_str = f'{{:<{max_word_length}}}'
110
+ # Print words
111
+ for word in words:
112
+ print(format_str.format(word), end=' ')
113
+ print()
114
+ # Print probabilities aligned below words
115
+ for prob in probs:
116
+ prob_str = f"{prob:.2f}"
117
+ print(format_str.format(prob_str), end=' ')
118
+ print('\n')
119
+
120
+ # Print text1's words and probabilities aligned
121
+ print("\nSpeaker 1:")
122
+ print_aligned(words_text1, probs_text1)
123
+
124
+ # Print text2's words and probabilities aligned
125
+ print("Speaker 2:")
126
+ print_aligned(words_text2, probs_text2)
127
+
128
+ # Example usage
129
+ highlighter = ParaphraseHighlighter()
130
+ text1 = "And it will be my 20th time in doing it as a television commentator from Rome so."
131
+ text2 = "Yes, you've been doing this for a while now."
132
+ highlighter.highlight_paraphrase(text1, text2)
133
+ ```
134
+
135
+ should return
136
+
137
+ ```
138
+ is a paraphrase
139
+
140
+ Speaker 1:
141
+ And IT will BE MY 20TH TIME IN DOING IT as a TELEVISION COMMENTATOR from Rome so.
142
+ 0.15 0.54 0.49 0.56 0.74 0.83 0.77 0.75 0.78 0.76 0.44 0.45 0.52 0.52 0.30 0.37 0.21
143
+
144
+ Speaker 2:
145
+ Yes, YOU'VE BEEN DOING THIS FOR A WHILE NOW.
146
+ 0.12 0.79 0.78 0.82 0.82 0.69 0.70 0.72 0.66
147
+ ```
148
+
149
+
150
+ For comments or questions reach out to Anna (a.m.wegmann @ uu.nl) or raise an issue on GitHub.
151
+
152
+
153
+ If you find this model helpful, consider citing our paper:
154
  ```
155
  @article{wegmann2024,
156
  title={What's Mine becomes Yours: Defining, Annotating and Detecting Context-Dependent Paraphrases in News Interview Dialogs},