File size: 3,547 Bytes
d7bc175
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
---

license: apache-2.0
language: en
datasets:
- wikipedia
- bookcorpus
tags:
- bert
- exbert
- linkbert
- feature-extraction
- fill-mask
- question-answering
- text-classification
- token-classification
---


## LinkBERT-large

LinkBERT-large model pretrained on English Wikipedia articles along with hyperlink information. It is introduced in the paper [LinkBERT: Pretraining Language Models with Document Links (ACL 2022)](https://arxiv.org/abs/2203.15827). The code and data are available in [this repository](https://github.com/michiyasunaga/LinkBERT).


## Model description

LinkBERT is a transformer encoder (BERT-like) model pretrained on a large corpus of documents. It is an improvement of BERT that newly captures **document links** such as hyperlinks and citation links to include knowledge that spans across multiple documents. Specifically, it was pretrained by feeding linked documents into the same language model context, besides a single document.

LinkBERT can be used as a drop-in replacement for BERT. It achieves better performance for general language understanding tasks (e.g. text classification), and is also particularly effective for **knowledge-intensive** tasks (e.g. question answering) and **cross-document** tasks (e.g. reading comprehension, document retrieval).


## Intended uses & limitations

The model can be used by fine-tuning on a downstream task, such as question answering, sequence classification, and token classification.
You can also use the raw model for feature extraction (i.e. obtaining embeddings for input text).


### How to use

To use the model to get the features of a given text in PyTorch:

```python

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained('michiyasunaga/LinkBERT-large')

model = AutoModel.from_pretrained('michiyasunaga/LinkBERT-large')

inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")

outputs = model(**inputs)

last_hidden_states = outputs.last_hidden_state

```

For fine-tuning, you can use [this repository](https://github.com/michiyasunaga/LinkBERT) or follow any other BERT fine-tuning codebases.


## Evaluation results

When fine-tuned on downstream tasks, LinkBERT achieves the following results.

**General benchmarks ([MRQA](https://github.com/mrqa/MRQA-Shared-Task-2019) and [GLUE](https://gluebenchmark.com/)):**

|                         | HotpotQA | TriviaQA | SearchQA | NaturalQ | NewsQA   | SQuAD    | GLUE      |
| ----------------------  | -------- | -------- | -------- | -------- | ------   | -----    | --------  |
|                         | F1       | F1       | F1       |  F1      | F1       | F1       | Avg score |
| BERT-base               | 76.0     | 70.3     | 74.2     | 76.5     | 65.7     | 88.7     | 79.2      |
| **LinkBERT-base**       | **78.2** | **73.9** | **76.8** | **78.3** | **69.3** | **90.1** | **79.6**  |
| BERT-large              | 78.1     | 73.7     | 78.3     | 79.0     | 70.9     | 91.1     | 80.7      |
| **LinkBERT-large**      | **80.8** | **78.2** | **80.5** | **81.0** | **72.6** | **92.7** | **81.1**  |


## Citation

If you find LinkBERT useful in your project, please cite the following:

```bibtex

@InProceedings{yasunaga2022linkbert,

  author =  {Michihiro Yasunaga and Jure Leskovec and Percy Liang},

  title =   {LinkBERT: Pretraining Language Models with Document Links},

  year =    {2022},  

  booktitle = {Association for Computational Linguistics (ACL)},  

}

```