|
--- |
|
license: cc-by-4.0 |
|
language: |
|
- en |
|
tags: |
|
- business |
|
- finance |
|
- industry-classification |
|
pipeline_tag: fill-mask |
|
widget: |
|
- text: "Sanofi is in the [MASK] industry." |
|
- text: "The current ratio measures [MASK]." |
|
--- |
|
|
|
# BusinessBERT |
|
|
|
An industry-sensitive language model for business applications pretrained on business communication corpora. The model incorporates industry classification (IC) as a pretraining objective besides masked language modeling (MLM). |
|
|
|
It was introduced in |
|
[this paper](https://www.sciencedirect.com/science/article/pii/S0377221724000444) and released in |
|
[this repository](https://github.com/pnborchert/BusinessBERT). |
|
|
|
## Model description |
|
|
|
We introduce BusinessBERT, an industry-sensitive language model for business applications. The advantage of the model is the training approach focused on incorporating industry information relevant for business related natural language processing (NLP) tasks. |
|
We compile three large-scale textual corpora consisting of annual disclosures, company website content and scientific literature representing business communication. In total, the corpora include 2.23 billion token. |
|
BusinessBERT builds upon the bidirectional encoder representations from transformer architecture (BERT) and embeds industry information during pretraining in two ways: (1) The business communication corpora contain a variety of industry-specific terminology; (2) We employ industry classification (IC) as an additional pretraining objective for text documents originating from companies. |
|
|
|
## Intended uses & limitations |
|
|
|
The model is intended to be fine-tuned on business related NLP tasks, i.e. sequence classification, named entity recognition, sentiment analysis or question answering. |
|
|
|
## Training data |
|
|
|
- [CompanyWeb](https://huggingface.co./datasets/pborchert/CompanyWeb): 0.77 billion token, 3.5 GB raw text file |
|
- [MD&A Disclosures](https://data.caltech.edu/records/1249): 1.06 billion token, 5.1 GB raw text file |
|
- [Semantic Scholar Open Research Corpus](https://api.semanticscholar.org/corpus): 0.40 billion token, 1.9 GB raw text file |
|
|
|
## Evaluation results |
|
|
|
Classification Tasks: |
|
|
|
| Task | Financial Risk (F1/Acc) | News Headline Topic (F1/Acc) | |
|
|:----:|:-----------:|:----:| |
|
| | 85.89/87.02 | 75.06/67.71 | |
|
|
|
Named Entity Recognition: |
|
|
|
| Task | SEC Filings (F1/Prec/Rec) | |
|
|:----:|:-----------:| |
|
| | 79.82/77.45/83.38 | |
|
|
|
Sentiment Analysis: |
|
|
|
| Task | FiQA (MSE/MAE) | Financial Phrasebank (F1/Acc) | StockTweets (F1/Acc) | |
|
|:----:|:-----------:|:----:| :----:| |
|
| | 0.0758/0.202 | 75.06/67.71 | 69.14/69.54 | |
|
|
|
Question Answering: |
|
|
|
| Task | FinQA (Exe Acc/Prog Acc) | |
|
|:----:|:-----------:| |
|
| | 60.07/57.19 | |
|
|
|
|
|
### BibTeX entry and citation info |
|
|
|
```bibtex |
|
@article{BORCHERT2024, |
|
title = {Industry-sensitive language modeling for business}, |
|
journal = {European Journal of Operational Research}, |
|
year = {2024}, |
|
issn = {0377-2217}, |
|
doi = {https://doi.org/10.1016/j.ejor.2024.01.023}, |
|
url = {https://www.sciencedirect.com/science/article/pii/S0377221724000444}, |
|
author = {Philipp Borchert and Kristof Coussement and Jochen {De Weerdt} and Arno {De Caigny}}, |
|
} |
|
``` |