tags:
- antibody language model
- antibody
- protein language model
base_model: Exscientia/IgT5_unpaired
license: mit
IgT5 model
Model pretrained on protein and antibody sequences using a masked language modeling (MLM) objective. It was introduced in the paper Large scale paired antibody language models.
The model is finetuned from IgT5-unpaired using paired antibody sequences from the Observed Antibody Space.
Use
The encoder part of the model and tokeniser can be loaded using the transformers
library
from transformers import T5EncoderModel, T5Tokenizer
tokeniser = T5Tokenizer.from_pretrained("Exscientia/IgT5", do_lower_case=False)
model = T5EncoderModel.from_pretrained("Exscientia/IgT5")
The tokeniser is used to prepare batch inputs
# heavy chain sequences
sequences_heavy = [
"VQLAQSGSELRKPGASVKVSCDTSGHSFTSNAIHWVRQAPGQGLEWMGWINTDTGTPTYAQGFTGRFVFSLDTSARTAYLQISSLKADDTAVFYCARERDYSDYFFDYWGQGTLVTVSS",
"QVQLVESGGGVVQPGRSLRLSCAASGFTFSNYAMYWVRQAPGKGLEWVAVISYDGSNKYYADSVKGRFTISRDNSKNTLYLQMNSLRTEDTAVYYCASGSDYGDYLLVYWGQGTLVTVSS"
]
# light chain sequences
sequences_light = [
"EVVMTQSPASLSVSPGERATLSCRARASLGISTDLAWYQQRPGQAPRLLIYGASTRATGIPARFSGSGSGTEFTLTISSLQSEDSAVYYCQQYSNWPLTFGGGTKVEIK",
"ALTQPASVSGSPGQSITISCTGTSSDVGGYNYVSWYQQHPGKAPKLMIYDVSKRPSGVSNRFSGSKSGNTASLTISGLQSEDEADYYCNSLTSISTWVFGGGTKLTVL"
]
# The tokeniser expects input of the form ["V Q ... S S </s> E V ... I K", ...]
paired_sequences = []
for sequence_heavy, sequence_light in zip(sequences_heavy, sequences_light):
paired_sequences.append(' '.join(sequence_heavy)+' </s> '+' '.join(sequence_light))
tokens = tokeniser.batch_encode_plus(
paired_sequences,
add_special_tokens=True,
pad_to_max_length=True,
return_tensors="pt",
return_special_tokens_mask=True
)
Note that the tokeniser adds a </s>
token at the end of each paired sequence and pads using the <pad>
token. For example a batch containing sequences V Q L </s> E V V
, Q V </s> A L
will be tokenised to V Q L </s> E V V </S>
and Q V </s> A L </s> <pad> <pad>
.
Sequence embeddings are generated by feeding tokens through the model
output = model(
input_ids=tokens['input_ids'],
attention_mask=tokens['attention_mask']
)
residue_embeddings = output.last_hidden_state
To obtain a sequence representation, the residue tokens can be averaged over like so
import torch
# mask special tokens before summing over embeddings
residue_embeddings[tokens["special_tokens_mask"] == 1] = 0
sequence_embeddings_sum = residue_embeddings.sum(1)
# average embedding by dividing sum by sequence lengths
sequence_lengths = torch.sum(tokens["special_tokens_mask"] == 0, dim=1)
sequence_embeddings = sequence_embeddings_sum / sequence_lengths.unsqueeze(1)