--- tags: - biology - ibm - mammal - pytorch - transformers library_name: mammal license: apache-2.0 --- ## Model Summary **MAMMAL (Molecular Aliened Multi-Modal Architect Language)**, a versatile multi-task foundation model that learns from large-scale biological datasets (over 2 billion samples) across diverse modalities, including proteins, small molecules, and genes. We introduce a query syntax that supports a wide range of tasks such as classification, regression, and generation—by combining different modalities and entity types as inputs and/or outputs. - **Developers:** IBM Research - **GitHub Repository:** [TBD](TBD) - **Paper:** [TBD](https://arxiv.org/abs/TBD) - **Release Date**: Oct ?th, 2024 - **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0). ## Usage Using `MAMMAL` requires [TBD](https://github.com/TBD) ``` pip install TBD ``` A simple example: ```python import torch from fuse.data.tokenizers.modular_tokenizer.op import ModularTokenizerOp from mammal.model import Mammal from mammal.keys import * # Load Model model = Mammal.from_pretrained("ibm/biomed.omics.bl.sm.ma-ted-400m") # Load Tokenizer tokenizer_op = ModularTokenizerOp.from_pretrained("ibm/biomed.omics.bl.sm.ma-ted-400m") # Prepare Input Prompt protein_calmodulin = "MADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMISELDQDGFIDKEDLHDGDGKISFEEFLNLVNKEMTADVDGDGQVNYEEFVTMMTSK" protein_calcineurin = "MSSKLLLAGLDIERVLAEKNFYKEWDTWIIEAMNVGDEEVDRIKEFKEDEIFEEAKTLGTAEMQEYKKQKLEEAIEGAFDIFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIRQMWDQNGDWDRIKELKFGEIKKLSAKDTRGTIFIKVFENLGTGVDSEYEDVSKYMLKHQ" # Create and load sample sample_dict = dict() # Formatting prompt to match pre-training syntax sample_dict[ENCODER_INPUTS_STR] = f"<@TOKENIZER-TYPE=AA>{protein_calmodulin}{protein_calcineurin}" # Tokenize tokenizer_op( sample_dict=sample_dict, key_in=ENCODER_INPUTS_STR, key_out_tokens_ids=ENCODER_INPUTS_TOKENS, key_out_attention_mask=ENCODER_INPUTS_ATTENTION_MASK, ) sample_dict[ENCODER_INPUTS_TOKENS] = torch.tensor(sample_dict[ENCODER_INPUTS_TOKENS]) sample_dict[ENCODER_INPUTS_ATTENTION_MASK] = torch.tensor(sample_dict[ENCODER_INPUTS_ATTENTION_MASK]) # Generate Prediction batch_dict = model.generate( [sample_dict], output_scores=True, return_dict_in_generate=True, max_new_tokens=5, ) # Get output generated_output = tokenizer_op._tokenizer.decode(batch_dict[CLS_PRED][0]) print(f"{generated_output=}") ``` For more advanced usage, see our detailed example at: ## Citation If you found our work useful, please consider to give a star to the repo and cite our paper: ``` @article{TBD, title={TBD}, author={IBM Research Team}, jounal={arXiv preprint arXiv:TBD}, year={2024} } ```