--- license: apache-2.0 datasets: - nferruz/UR50_2021_04 tags: - chemistry - biology --- ### Model Description This model card describes the distilled version of [ProtGPT2](https://huggingface.co./nferruz/ProtGPT2), referred to as `protgpt2-distilled-tiny`. The distillation process for this model follows the methodology of knowledge distillation from a larger teacher model to a smaller, more efficient student model. The process combines both "Soft Loss" (Knowledge Distillation Loss) and "Hard Loss" (Cross-Entropy Loss) to ensure the student model not only generalizes like its teacher but also retains practical prediction capabilities. ### Technical Details **Distillation Parameters:** - **Temperature (T):** 10 - **Alpha (α):** 0.1 - **Model Architecture:** - **Number of Layers:** 4 - **Number of Attention Heads:** 4 - **Embedding Size:** 512 **Dataset Used:** - The model was distilled using a subset of the evaluation dataset provided by [nferruz/UR50_2021_04](https://huggingface.co./datasets/nferruz/UR50_2021_04). Loss Formulation: ### Performance The distilled model, `protgpt2-distilled-tiny`, demonstrates a substantial increase in inference speed—up to 6 times faster than the pretrained version. This assessment is based on evaluations using \(n=5\) tests, showing that while the speed is significantly enhanced, the model still maintains perplexities comparable to the original. ![Evals](https://images.mobilism.org/?di=PYFQ1N5V) ### Usage ``` from transformers import GPT2Tokenizer, GPT2LMHeadModel, TextGenerationPipeline import re # Load the model and tokenizer model_name = "littleworth/protgpt2-distilled-tiny" tokenizer = GPT2Tokenizer.from_pretrained(model_name) model = GPT2LMHeadModel.from_pretrained(model_name) # Initialize the pipeline text_generator = TextGenerationPipeline( model=model, tokenizer=tokenizer, device=0 ) # specify device if needed # Generate sequences generated_sequences = text_generator( "<|endoftext|>", max_length=100, do_sample=True, top_k=950, repetition_penalty=1.2, num_return_sequences=10, pad_token_id=tokenizer.eos_token_id, # Set pad_token_id to eos_token_id eos_token_id=0, truncation=True, ) def clean_sequence(text): # Remove the "<|endoftext|>" token text = text.replace("<|endoftext|>", "") # Remove newline characters and non-alphabetical characters text = "".join(char for char in text if char.isalpha()) return text # Print the generated sequences for i, seq in enumerate(generated_sequences): cleaned_text = clean_sequence(seq["generated_text"]) print(f">Seq_{i}") print(cleaned_text) ``` ### Use Cases 1. **High-Throughput Screening in Drug Discovery:** The distilled ProtGPT2 is ideal for rapid screening of mutation effects in protein sequences within pharmaceutical research. For example, it can quickly predict the stability of protein variants in large datasets, speeding up the identification of viable drug targets. 2. **Portable Diagnostics in Healthcare:** This model is suitable for use in handheld diagnostic devices that perform real-time protein analysis in clinical settings. For instance, it can be used in portable devices to analyze blood samples for markers of diseases, providing immediate results to healthcare providers in remote areas. 3. **Interactive Learning Tools in Academia:** The distilled model can be integrated into educational software tools that allow biology students to simulate and study the impact of genetic mutations on protein structures. This hands-on learning helps students understand protein dynamics without the need for high-end computational facilities. ### References - Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the Knowledge in a Neural Network. arXiv:1503.02531. - Original ProtGPT2 Paper: [Link to paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9329459/)