Noelia Ferruz commited on
Commit
ebeb946
1 Parent(s): 559ef80

Fixed typo endoftag - > endoftext

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -35,7 +35,7 @@ Example 1: Generating de novo proteins in a zero-shot fashion. We recommend the
35
  {'generated_text': 'M\nRRAVGNADLGMEAARYEPSGAYQASEGDGAHGKPHSLPFVALERWQQLGPEERTLAEAVR\nAVLASGQYLLGEAVRRFETAVAAWLGVPFALGVASGTAALTLALRAYGVGPGDEVIVPAI\nTFIATSNAITAAGARPVLVDIDPSTWNMSVASLAARLTPKTKAILAVHLWGQPVDMHPLL\nDIAAQANLAVIEDCAQALGASIAGTKVGTFGDAAAFSFYPTKNMTTGEGGMLVTNARDLA\nQAARMLRSHGQDPPTAYMHSQVGFN'}
36
  ```
37
 
38
- Example 2: Finetuning on a set of user-defined sequences. This example finetunes using a user-defined training and validation files that contain a set of sequences of interest. The create the validation and training file, it is necessary to (1) substitute the FASTA headers for each sequence with the tag "<|endoftag|>" and (2) split the originating dataset into training and validation files, (this is often done with the ratio 90/10, 80/20 or 95/5). Here we show a learning rate of 1e-06, but ideally should be optimized in separate runs. After training, the finetuned model will be stored in the ./output folder. This model can be used as in the example above to generate tailored sequences.
39
 
40
  The HuggingFace script can be found here: https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_clm.py
41
 
 
35
  {'generated_text': 'M\nRRAVGNADLGMEAARYEPSGAYQASEGDGAHGKPHSLPFVALERWQQLGPEERTLAEAVR\nAVLASGQYLLGEAVRRFETAVAAWLGVPFALGVASGTAALTLALRAYGVGPGDEVIVPAI\nTFIATSNAITAAGARPVLVDIDPSTWNMSVASLAARLTPKTKAILAVHLWGQPVDMHPLL\nDIAAQANLAVIEDCAQALGASIAGTKVGTFGDAAAFSFYPTKNMTTGEGGMLVTNARDLA\nQAARMLRSHGQDPPTAYMHSQVGFN'}
36
  ```
37
 
38
+ Example 2: Finetuning on a set of user-defined sequences. This example finetunes using a user-defined training and validation files that contain a set of sequences of interest. The create the validation and training file, it is necessary to (1) substitute the FASTA headers for each sequence with the tag "<|endoftext|>" and (2) split the originating dataset into training and validation files, (this is often done with the ratio 90/10, 80/20 or 95/5). Here we show a learning rate of 1e-06, but ideally should be optimized in separate runs. After training, the finetuned model will be stored in the ./output folder. This model can be used as in the example above to generate tailored sequences.
39
 
40
  The HuggingFace script can be found here: https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_clm.py
41