# MarkupLM
**Multimodal (text +markup language) pre-training for [Document AI](https://www.microsoft.com/en-us/research/project/document-ai/)**
## Introduction
MarkupLM is a simple but effective multi-modal pre-training method of text and markup language for visually-rich document understanding and information extraction tasks, such as webpage QA and webpage information extraction. MarkupLM achieves the SOTA results on multiple datasets. For more details, please refer to our paper:
[MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding](https://arxiv.org/abs/2110.08518) Junlong Li, Yiheng Xu, Lei Cui, Furu Wei, [ACL 2022](https://github.com/microsoft/unilm/tree/master/markuplm#)
The overview of our framework is as follows:
And the core XPath Embedding Layer is as follows:
## Release Notes
******* New Nov 22th, 2021: Initial release of pre-trained models and fine-tuning code for MarkupLM *******
## Pre-trained Models
We pre-train MarkupLM on a subset of the CommonCrawl dataset.
| Name | HuggingFace |
| - | - |
| MarkupLM-Base | [microsoft/markuplm-base](https://huggingface.co./microsoft/markuplm-base) |
| MarkupLM-Large | [microsoft/markuplm-large](https://huggingface.co./microsoft/markuplm-large) |
An example might be ``model = markuplm.from_pretrained("microsoft/markuplm-base")``.
## Installation
### Command
```
conda create -n markuplmft python=3.7
conda activate markuplmft
git clone https://github.com/microsoft/unilm.git
cd unilm
cd markuplm
pip install -r requirements.txt
pip install -e .
```
## Finetuning
### WebSRC
#### Prepare data
Download the dataset from the [official website](https://x-lance.github.io/WebSRC/).
Extract **release.zip** to **/Path/To/WebSRC**.
Download **dataset_split.json** from [this link](https://github.com/X-LANCE/WebSRC-Baseline/blob/master/data/dataset_split.json) and put it into **/Path/To/WebSRC**.
#### Generate dataset
```
cd ./examples/fine_tuning/run_websrc
python dataset_generation.py --root_dir /Path/To/WebSRC --version websrc1.0
```
#### Run
```
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python run.py \
--train_file /Path/To/WebSRC/websrc1.0_train_.json \
--predict_file /Path/To/WebSRC/websrc1.0_dev_.json \
--root_dir /Path/To/WebSRC \
--model_name_or_path microsoft/markuplm-large \
--output_dir /Your/Output/Path \
--do_train \
--do_eval \
--eval_all_checkpoints \
--per_gpu_train_batch_size 8 \
--warmup_ratio 0.1 \
--num_train_epochs 5
```
### SWDE
#### Prepare data
Download the dataset from the [official website](https://archive.codeplex.com/?p=swde).
Update: the above website is down, please use this [backup](http://web.archive.org/web/20210630013015/https://codeplexarchive.blob.core.windows.net/archive/projects/swde/swde.zip).
Unzip **swde.zip**, and extract everything in **/sourceCode**, make sure we have folders like **auto / book / camera** ... under this directory, and we name this path as **/Path/To/SWDE**.
#### Generate dataset
```
cd ./examples/fine_tuning/run_swde
python pack_data.py \
--input_swde_path /Path/To/SWDE \
--output_pack_path /Path/To/SWDE/swde.pickle
python prepare_data.py \
--input_groundtruth_path /Path/To/SWDE/groundtruth \
--input_pickle_path /Path/To/SWDE/swde.pickle \
--output_data_path /Path/To/Processed_SWDE
```
And the needed data is in **/Path/To/Processed_SWDE**.
#### Run
Take **seed=1, vertical=nbaplayer** as example.
```
CUDA_VISIBLE_DEVICES=0,1 python run.py \
--root_dir /Path/To/Processed_SWDE \
--vertical nbaplayer \
--n_seed 1 \
--n_pages 2000 \
--prev_nodes_into_account 4 \
--model_name_or_path microsoft/markuplm-base \
--output_dir /Your/Output/Path \
--do_train \
--do_eval \
--per_gpu_train_batch_size 32 \
--per_gpu_eval_batch_size 32 \
--num_train_epochs 10 \
--learning_rate 2e-5 \
--save_steps 1000000 \
--warmup_ratio 0.1 \
--overwrite_output_dir \
```
### Results
#### WebSRC (dev set)
Some of the baseline results are from [Chen et al., 2021](https://aclanthology.org/2021.emnlp-main.343/).
| Model | EM | F1 | POS |
| - | - | -| -|
|H-PLM (RoBERTa-Large) | 69.57| 74.13 | 85.93|
|H-PLM (ELECTRA-Large) |70.12 |74.14 | 86.33|
|V-PLM (ELECTRA-Large) |73.22 | 76.16|87.06 |
|**MarkupLM-Large** |**74.43** |**80.54** | **90.15**|
#### SWDE
The metric is **page-level F1**.
|Model \\ #Seed Sites| 1 | 2 | 3 | 4 | 5 |
|-|-|-|-|-|-|
|[Render-Full (Hao et al., 2011)](https://dl.acm.org/doi/10.1145/2009916.2010020)|84.30|86.00|86.80|88.40|88.60|
|[FreeDOM-Full (Lin et al., 2020)](https://dl.acm.org/doi/10.1145/3394486.3403153)|82.32|86.36|90.49|91.29|92.56|
|[SimpDOM (Zhou et al., 2021)](https://arxiv.org/abs/2101.02415)|83.06|88.96|91.63|92.84|93.75|
|**MarkupLM-Large**|**85.71**|**93.57**|**96.12**|**96.71**|**97.37**|
## Citation
If you find markupLM useful in your research, please cite the following paper:
@article{li2021markuplm,
title={MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding},
author={Junlong Li and Yiheng Xu and Lei Cui and Furu Wei},
year={2021},
eprint={2110.08518},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
## License
This project is licensed under the license found in the LICENSE file in the root directory of this source tree. Portions of the source code are based on the [transformers](https://github.com/huggingface/transformers) project. [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct)
### Contact Information
For help or issues using MarkupLM, please submit a GitHub issue.
For other communications related to MarkupLM, please contact Lei Cui (`lecu@microsoft.com`), Furu Wei (`fuwei@microsoft.com`).