File size: 3,164 Bytes
d6dec80 6dfecca 33a55c7 45a54f3 d6dec80 6dfecca 0bc6581 6dfecca 0bc6581 6dfecca 33a55c7 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 |
---
license: cc-by-4.0
language:
- en
tags:
- business
- finance
- industry-classification
pipeline_tag: fill-mask
widget:
- text: "Sanofi is in the [MASK] industry."
- text: "The current ratio measures [MASK]."
---
# BusinessBERT
An industry-sensitive language model for business applications pretrained on business communication corpora. The model incorporates industry classification (IC) as a pretraining objective besides masked language modeling (MLM).
It was introduced in
[this paper](https://www.sciencedirect.com/science/article/pii/S0377221724000444) and released in
[this repository](https://github.com/pnborchert/BusinessBERT).
## Model description
We introduce BusinessBERT, an industry-sensitive language model for business applications. The advantage of the model is the training approach focused on incorporating industry information relevant for business related natural language processing (NLP) tasks.
We compile three large-scale textual corpora consisting of annual disclosures, company website content and scientific literature representing business communication. In total, the corpora include 2.23 billion token.
BusinessBERT builds upon the bidirectional encoder representations from transformer architecture (BERT) and embeds industry information during pretraining in two ways: (1) The business communication corpora contain a variety of industry-specific terminology; (2) We employ industry classification (IC) as an additional pretraining objective for text documents originating from companies.
## Intended uses & limitations
The model is intended to be fine-tuned on business related NLP tasks, i.e. sequence classification, named entity recognition, sentiment analysis or question answering.
## Training data
- [CompanyWeb](https://huggingface.co./datasets/pborchert/CompanyWeb): 0.77 billion token, 3.5 GB raw text file
- [MD&A Disclosures](https://data.caltech.edu/records/1249): 1.06 billion token, 5.1 GB raw text file
- [Semantic Scholar Open Research Corpus](https://api.semanticscholar.org/corpus): 0.40 billion token, 1.9 GB raw text file
## Evaluation results
Classification Tasks:
| Task | Financial Risk (F1/Acc) | News Headline Topic (F1/Acc) |
|:----:|:-----------:|:----:|
| | 85.89/87.02 | 75.06/67.71 |
Named Entity Recognition:
| Task | SEC Filings (F1/Prec/Rec) |
|:----:|:-----------:|
| | 79.82/77.45/83.38 |
Sentiment Analysis:
| Task | FiQA (MSE/MAE) | Financial Phrasebank (F1/Acc) | StockTweets (F1/Acc) |
|:----:|:-----------:|:----:| :----:|
| | 0.0758/0.202 | 75.06/67.71 | 69.14/69.54 |
Question Answering:
| Task | FinQA (Exe Acc/Prog Acc) |
|:----:|:-----------:|
| | 60.07/57.19 |
### BibTeX entry and citation info
```bibtex
@article{BORCHERT2024,
title = {Industry-sensitive language modeling for business},
journal = {European Journal of Operational Research},
year = {2024},
issn = {0377-2217},
doi = {https://doi.org/10.1016/j.ejor.2024.01.023},
url = {https://www.sciencedirect.com/science/article/pii/S0377221724000444},
author = {Philipp Borchert and Kristof Coussement and Jochen {De Weerdt} and Arno {De Caigny}},
}
``` |