thesephist commited on
Commit
02295d2
·
1 Parent(s): 3761eee

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +116 -0
README.md CHANGED
@@ -1,3 +1,119 @@
1
  ---
2
  license: mit
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ datasets:
4
+ - wikipedia
5
+ language:
6
+ - en
7
  ---
8
+
9
+ # Bottleneck T5 ⏳
10
+
11
+ The Bottleneck T5 model powers many of my experiments and demos exploring interfaces for inspecting and editing text in latent space. This model is an autoencoder for text; it's able to encode text up to 512 tokens into an embedding, then reconstruct the original text from the embedding. The structure of the embedding space produced by this model also allows for semantic edits to text through vector arithmetic in latent space.
12
+
13
+ ## Model Details
14
+
15
+ Using embeddings produced by this model, we can semantically interpolate between pieces of text and edit sentences using their latent attributes like length, tone, structure, or topic.
16
+
17
+ All Bottleneck T5 models are trained on a filtered subset of the English Wikipedia, and performs best at encoding and decoding encyclopedic and other similar kinds of text. Text that's heavily technical, conversational, or otherwise unconventional may be out of distribution for the model, and the model may not perform as well on such inputs.
18
+
19
+ Bottleneck T5 embeddings are always normalized to length 1; the encoder produces embeddings of length 1, and any inputs to the decoder will be normalized to length 1.
20
+
21
+ - **Developed by:** [Linus Lee](https://thesephist.com/)
22
+ - **Model type:** T5-style encoder-decoder transformer with an attention pooled bottleneck and gated cross-attention
23
+ - **Language(s) (NLP):** English
24
+ - **License:** MIT
25
+ - **Finetuned from model:** LM-adapted T5 v1.1
26
+
27
+ ## Using the model
28
+
29
+ The model is currently in a prototype state implemented on top of the T5 language model, so we need a small wrapper class around it to use it for embedding and generating text:
30
+
31
+ ```py
32
+ import os
33
+ import torch
34
+ import torch.nn as nn
35
+ import torch.nn.functional as F
36
+
37
+ from tqdm import tqdm
38
+ from transformers import AutoTokenizer, AutoModelForCausalLM
39
+
40
+ class BottleneckT5Autoencoder:
41
+ def __init__(self, model_path: str, device='cpu'):
42
+ self.device = device
43
+ self.tokenizer = AutoTokenizer.from_pretrained(model_path, model_max_length=512)
44
+ self.model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True).to(self.device)
45
+ self.model.eval()
46
+
47
+ @torch.no_grad()
48
+ def embed(self, text: str) -> torch.FloatTensor:
49
+ inputs = self.tokenizer(text, return_tensors='pt').to(self.device)
50
+ decoder_inputs = self.tokenizer('', return_tensors='pt').to(self.device)
51
+ return self.model(
52
+ **inputs,
53
+ decoder_input_ids=decoder_inputs['input_ids'],
54
+ encode_only=True,
55
+ )[0]
56
+
57
+ @torch.no_grad()
58
+ def generate_from_latent(self, latent: torch.FloatTensor, max_length=512, temperature=1.0) -> str:
59
+ dummy_text = '.'
60
+ dummy = self.embed(dummy_text)
61
+ perturb_vector = latent - dummy
62
+ self.model.perturb_vector = perturb_vector
63
+ input_ids = self.tokenizer(dummy_text, return_tensors='pt').to(self.device).input_ids
64
+ output = self.model.generate(
65
+ input_ids=input_ids,
66
+ max_length=max_length,
67
+ do_sample=True,
68
+ temperature=temperature,
69
+ top_p=0.9,
70
+ num_return_sequences=1,
71
+ )
72
+ return self.tokenizer.decode(output[0], skip_special_tokens=True)
73
+ ```
74
+
75
+ Then we can initialize this autoencoder class based on a model class.
76
+
77
+ ```py
78
+ device = 'cuda' if torch.cuda.is_available() else 'cpu'
79
+ autoencoder = BottleneckT5Autoencoder(model_path='thesephist/contra-bottleneck-t5-large-wikipedia', device=device)
80
+ ```
81
+
82
+ Embed and un-embed text with `.embed(text: str)` and `.generate_from_latent(embedding: torch.FloatTensor)`.
83
+
84
+ ```py
85
+ texts = [
86
+ 'The quick brown fox jumps over the lazy dog',
87
+ 'Hi there! My name is Linus, and I spend a lot of my time thinking about latent spaces of neural network models.',
88
+ 'Notion is a single space where you can think, write, and plan. Capture thoughts, manage projects, or even run an entire company — and do it exactly the way you want.',
89
+ ]
90
+
91
+ for t in texts:
92
+ embedding = autoencoder.embed(t)
93
+ reconstruction = autoencoder.generate_from_latent(embedding)
94
+ print(reconstruction)
95
+ ```
96
+
97
+ produces the text:
98
+
99
+ ```
100
+ The quick brown fox jumps over the lazy dog
101
+ I'm named after Linus, and I spend a lot of my time thinking about neural networks of latent space models.
102
+ Notion is a single place where you can think, plan, and spend time. Capture ideas, manage projects, and even do your own writing — or plan it exactly the way you want.
103
+ ```
104
+
105
+
106
+ For more examples on how to use the model to compute interpolations and semantic edits with Contra, see [this Google Colab notebook](https://linus.zone/contra-colab).
107
+
108
+ ## Training Details
109
+
110
+ Contra was initialized from the [language modeling-adapted T5 v1.1 checkpoint](https://huggingface.co/models?other=t5-lm-adapt) and trained on a subset of the English [Wikipedia](https://huggingface.co/datasets/wikipedia) dataset filtered for length, for a single epoch, as a denoising autoencoder with 30% of tokens randomly masked, using the Adafactor optimizer.
111
+
112
+ #### Model family and checkpoints
113
+
114
+ I recommend experimenting first with `thesephist/contra-bottleneck-t5-large-wikipedia`, which strikes a good balance between model size and output quality, but I've trained four variants ranging from 330M to 3B parameters:
115
+
116
+ - [thesephist/contra-bottleneck-t5-small-wikipedia](https://huggingface.co/thesephist/contra-bottleneck-t5-small-wikipedia): 60M params, 512 embedding dimensions
117
+ - [thesephist/contra-bottleneck-t5-base-wikipedia](https://huggingface.co/thesephist/contra-bottleneck-t5-base-wikipedia): 220M params, 768 embedding dimensions
118
+ - [thesephist/contra-bottleneck-t5-large-wikipedia](https://huggingface.co/thesephist/contra-bottleneck-t5-large-wikipedia): 770M params, 1024 embedding dimensions
119
+ - [thesephist/contra-bottleneck-t5-xl-wikipedia](https://huggingface.co/thesephist/contra-bottleneck-t5-xl-wikipedia): 3B params, 2048 embedding dimensions