sagawa
/

PubChem-10m-deberta

Generated from Trainer

Inference Endpoints

Model card Files Files and versions Community

PubChem-10m-deberta / README.md

sagawa's picture

Update README.md

b5f5d96 almost 2 years ago

|

history blame contribute delete

2.24 kB

	---
	license: mit
	tags:
	- generated_from_trainer
	datasets:
	- sagawa/pubchem-10m-canonicalized
	metrics:
	- accuracy
	model-index:
	- name: PubChem-10m-deberta
	results:
	- task:
	name: Masked Language Modeling
	type: fill-mask
	dataset:
	name: sagawa/pubchem-10m-canonicalized
	type: sagawa/pubchem-10m-canonicalized
	metrics:
	- name: Accuracy
	type: accuracy
	value: 0.9741235263046233
	---

	<!-- This model card has been generated automatically according to the information the Trainer had access to. You
	should probably proofread and complete it, then remove this comment. -->

	# PubChem10m-deberta-base-output

	This model is a fine-tuned version of [microsoft/deberta-base](https://huggingface.co./microsoft/deberta-base) on the sagawa/pubchem-10m-canonicalized dataset.
	It achieves the following results on the evaluation set:
	- Loss: 0.0698
	- Accuracy: 0.9741

	## Model description

	We trained deberta-base on SMILES from PubChem using the task of masked-language modeling (MLM). Its tokenizer is a character-level tokenizer trained on PubChem.

	## Intended uses & limitations

	This model can be used for the prediction of molecules' properties, reactions, or interactions with proteins by changing the way of finetuning.

	## Training and evaluation data

	We downloaded [PubChem data](https://drive.google.com/file/d/1ygYs8dy1-vxD1Vx6Ux7ftrXwZctFjpV3/view) and canonicalized them using RDKit. Then, we dropped duplicates. The total number of data is 9999960, and they were randomly split into train:validation=10:1.

	## Training procedure

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 5e-05
	- train_batch_size: 30
	- eval_batch_size: 48
	- seed: 42
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: linear
	- num_epochs: 10.0

	### Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \| Accuracy \|
	\|:-------------:\|:-----:\|:------:\|:---------------:\|:--------:\|
	\| 0.0855 \| 3.68 \| 100000 \| 0.0801 \| 0.9708 \|
	\| 0.0733 \| 7.37 \| 200000 \| 0.0702 \| 0.9740 \|


	### Framework versions

	- Transformers 4.22.0.dev0
	- Pytorch 1.12.0
	- Datasets 2.4.1.dev0
	- Tokenizers 0.11.6