MarkupLM

Multimodal (text +markup language) pre-training for Document AI

Introduction

MarkupLM is a simple but effective multi-modal pre-training method of text and markup language for visually-rich document understanding and information extraction tasks, such as webpage QA and webpage information extraction. MarkupLM achieves the SOTA results on multiple datasets. For more details, please refer to our paper:

MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding Junlong Li, Yiheng Xu, Lei Cui, Furu Wei, ACL 2022

The overview of our framework is as follows:

And the core XPath Embedding Layer is as follows:

Release Notes

******* New Nov 22th, 2021: Initial release of pre-trained models and fine-tuning code for MarkupLM *******

Pre-trained Models

We pre-train MarkupLM on a subset of the CommonCrawl dataset.

Name	HuggingFace
MarkupLM-Base	microsoft/markuplm-base
MarkupLM-Large	microsoft/markuplm-large

An example might be model = markuplm.from_pretrained("microsoft/markuplm-base").

Installation

Command

conda create -n markuplmft python=3.7
conda activate markuplmft
git clone https://github.com/microsoft/unilm.git
cd unilm
cd markuplm
pip install -r requirements.txt
pip install -e .

Finetuning

WebSRC

Prepare data

Download the dataset from the official website.

Extract release.zip to /Path/To/WebSRC.

Download dataset_split.json from this link and put it into /Path/To/WebSRC.

Generate dataset

cd ./examples/fine_tuning/run_websrc
python dataset_generation.py --root_dir /Path/To/WebSRC --version websrc1.0

Run

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python run.py \
    --train_file /Path/To/WebSRC/websrc1.0_train_.json \
    --predict_file /Path/To/WebSRC/websrc1.0_dev_.json \
    --root_dir /Path/To/WebSRC \
    --model_name_or_path microsoft/markuplm-large \
    --output_dir /Your/Output/Path \
    --do_train \
    --do_eval \
    --eval_all_checkpoints \
    --per_gpu_train_batch_size 8 \
    --warmup_ratio 0.1 \
    --num_train_epochs 5

SWDE

Prepare data

Download the dataset from the official website.

Update: the above website is down, please use this backup.

Unzip swde.zip, and extract everything in /sourceCode, make sure we have folders like auto / book / camera ... under this directory, and we name this path as /Path/To/SWDE.

Generate dataset

cd ./examples/fine_tuning/run_swde

python pack_data.py \
    --input_swde_path /Path/To/SWDE \
    --output_pack_path /Path/To/SWDE/swde.pickle

python prepare_data.py \
    --input_groundtruth_path /Path/To/SWDE/groundtruth \
    --input_pickle_path /Path/To/SWDE/swde.pickle \
    --output_data_path /Path/To/Processed_SWDE

And the needed data is in /Path/To/Processed_SWDE.

Run

Take seed=1, vertical=nbaplayer as example.

CUDA_VISIBLE_DEVICES=0,1 python run.py \
    --root_dir /Path/To/Processed_SWDE \
    --vertical nbaplayer \
    --n_seed 1 \
    --n_pages 2000 \
    --prev_nodes_into_account 4 \
    --model_name_or_path microsoft/markuplm-base \
    --output_dir /Your/Output/Path \
    --do_train \
    --do_eval \
    --per_gpu_train_batch_size 32 \
    --per_gpu_eval_batch_size 32 \
    --num_train_epochs 10 \
    --learning_rate 2e-5 \
    --save_steps 1000000 \
    --warmup_ratio 0.1 \
    --overwrite_output_dir \

Results

WebSRC (dev set)

Some of the baseline results are from Chen et al., 2021.

Model	EM	F1	POS
H-PLM (RoBERTa-Large)	69.57	74.13	85.93
H-PLM (ELECTRA-Large)	70.12	74.14	86.33
V-PLM (ELECTRA-Large)	73.22	76.16	87.06
MarkupLM-Large	74.43	80.54	90.15

SWDE

The metric is page-level F1.

Model \ #Seed Sites	1	2	3	4	5
Render-Full (Hao et al., 2011)	84.30	86.00	86.80	88.40	88.60
FreeDOM-Full (Lin et al., 2020)	82.32	86.36	90.49	91.29	92.56
SimpDOM (Zhou et al., 2021)	83.06	88.96	91.63	92.84	93.75
MarkupLM-Large	85.71	93.57	96.12	96.71	97.37

Citation

If you find markupLM useful in your research, please cite the following paper:

@article{li2021markuplm,
      title={MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding}, 
      author={Junlong Li and Yiheng Xu and Lei Cui and Furu Wei},
      year={2021},
      eprint={2110.08518},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree. Portions of the source code are based on the transformers project. Microsoft Open Source Code of Conduct

Contact Information

For help or issues using MarkupLM, please submit a GitHub issue.

For other communications related to MarkupLM, please contact Lei Cui ([email protected]), Furu Wei ([email protected]).