oeg/software_benchmark_multidomain

#Software Benchmark SCIBERT model. This model is a fine-tuned version of the SCIBERT model on a dataset built based on the corpora SoMESCi and Softcite.

The objective of this model is to extract software mentions from scientific texts in the BIO domain.

The training code can be found on Github.

Corpus

The corpus have been built using two corpora in software mentions.

SoMESCi [1]. We have used the corpus uploaded to Github, more specifically, the corpus created with sentences.
Softcite [2]. This project has published another corpus for software mentions, which is also available on Github. We have used the annotations from bio and economics domain.
Papers with code. We have downloaded a list of publications from the Papers with Code site. You can find there publications and software from machine learning domain. To build this corpus, we have selected texts where you can find mentions of the software related with the publication. DOI: 10.5281/zenodo.10033751

To build this corpus, we have removed the annotations of other entities such as version, url and those which are related with the relation of teh entity with the text. IN this case, we only use the label Application_Mention.

To reconciliate both corpora, we have mapping the labels of both corpora. Also, some decisions about the annotations have been taken, for example, in the case of Microsoft Excel, we have decided to annotate Excel as software mention, not the whole text.

Training

The corpus have been splitted in a 70-30 proportion for training and testing.

The training code can be found on Github.

Evaluation Results

These are the hyperparameters used to train the model:

evaluation_strategy = "epoch"
save_strategy="no"
per_device_train_batch_size=16
per_device_eval_batch_size=16
num_train_epochs=3
weight_decay=1e-5
learning_rate=1e-4

The evaluation results are:

Precision: 0.8928176795580111
Recall: 0.8568398727465536
F1-score: 0.8744588744588745

This model has been compared with some generative models such as llama2 and hermes using the testing part of the benchmark. Following, we present the results of partial matches, it means, the predictions are included in the corpus

Llama2 (7B)

Precision: 0.6342857142857142
Recall: 0.7161290322580646
F1-score: 0.67

Hermes (13B)

Precision: 0.4666666666666667
Recall: 0.509090909090909
F1-score: 0.4869565217391304

Acknoledgements

This is a work done thank to the effort of other projects:

Softcite
SoMESCi
SCIBERT

Authors

Esteban González Guardia
Daniel Garijo Verdejo

Contributors

References

Schindler, D., Bensmann, F., Dietze, S., & Krüger, F. (2021, October). Somesci-A 5 star open data gold standard knowledge graph of software mentions in scientific articles. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management (pp. 4574-4583).
Du, C., Cohoon, J., Lopez, P., & Howison, J. (2021). Softcite dataset: A dataset of software mentions in biomedical and economic research publications. Journal of the Association for Information Science and Technology, 72(7), 870-884.

oeg
/

software_benchmark_multidomain