A newer version of the Gradio SDK is available:
5.5.0
MarkupLM
Multimodal (text +markup language) pre-training for Document AI
Introduction
MarkupLM is a simple but effective multi-modal pre-training method of text and markup language for visually-rich document understanding and information extraction tasks, such as webpage QA and webpage information extraction. MarkupLM achieves the SOTA results on multiple datasets. For more details, please refer to our paper:
MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding Junlong Li, Yiheng Xu, Lei Cui, Furu Wei, ACL 2022
The overview of our framework is as follows:
And the core XPath Embedding Layer is as follows:
Release Notes
******* New Nov 22th, 2021: Initial release of pre-trained models and fine-tuning code for MarkupLM *******
Pre-trained Models
We pre-train MarkupLM on a subset of the CommonCrawl dataset.
Name | HuggingFace |
---|---|
MarkupLM-Base | microsoft/markuplm-base |
MarkupLM-Large | microsoft/markuplm-large |
An example might be model = markuplm.from_pretrained("microsoft/markuplm-base")
.
Installation
Command
conda create -n markuplmft python=3.7
conda activate markuplmft
git clone https://github.com/microsoft/unilm.git
cd unilm
cd markuplm
pip install -r requirements.txt
pip install -e .
Finetuning
WebSRC
Prepare data
Download the dataset from the official website.
Extract release.zip to /Path/To/WebSRC.
Download dataset_split.json from this link and put it into /Path/To/WebSRC.
Generate dataset
cd ./examples/fine_tuning/run_websrc
python dataset_generation.py --root_dir /Path/To/WebSRC --version websrc1.0
Run
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python run.py \
--train_file /Path/To/WebSRC/websrc1.0_train_.json \
--predict_file /Path/To/WebSRC/websrc1.0_dev_.json \
--root_dir /Path/To/WebSRC \
--model_name_or_path microsoft/markuplm-large \
--output_dir /Your/Output/Path \
--do_train \
--do_eval \
--eval_all_checkpoints \
--per_gpu_train_batch_size 8 \
--warmup_ratio 0.1 \
--num_train_epochs 5
SWDE
Prepare data
Download the dataset from the official website.
Update: the above website is down, please use this backup.
Unzip swde.zip, and extract everything in /sourceCode, make sure we have folders like auto / book / camera ... under this directory, and we name this path as /Path/To/SWDE.
Generate dataset
cd ./examples/fine_tuning/run_swde
python pack_data.py \
--input_swde_path /Path/To/SWDE \
--output_pack_path /Path/To/SWDE/swde.pickle
python prepare_data.py \
--input_groundtruth_path /Path/To/SWDE/groundtruth \
--input_pickle_path /Path/To/SWDE/swde.pickle \
--output_data_path /Path/To/Processed_SWDE
And the needed data is in /Path/To/Processed_SWDE.
Run
Take seed=1, vertical=nbaplayer as example.
CUDA_VISIBLE_DEVICES=0,1 python run.py \
--root_dir /Path/To/Processed_SWDE \
--vertical nbaplayer \
--n_seed 1 \
--n_pages 2000 \
--prev_nodes_into_account 4 \
--model_name_or_path microsoft/markuplm-base \
--output_dir /Your/Output/Path \
--do_train \
--do_eval \
--per_gpu_train_batch_size 32 \
--per_gpu_eval_batch_size 32 \
--num_train_epochs 10 \
--learning_rate 2e-5 \
--save_steps 1000000 \
--warmup_ratio 0.1 \
--overwrite_output_dir \
Results
WebSRC (dev set)
Some of the baseline results are from Chen et al., 2021.
Model | EM | F1 | POS |
---|---|---|---|
H-PLM (RoBERTa-Large) | 69.57 | 74.13 | 85.93 |
H-PLM (ELECTRA-Large) | 70.12 | 74.14 | 86.33 |
V-PLM (ELECTRA-Large) | 73.22 | 76.16 | 87.06 |
MarkupLM-Large | 74.43 | 80.54 | 90.15 |
SWDE
The metric is page-level F1.
Model \ #Seed Sites | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|
Render-Full (Hao et al., 2011) | 84.30 | 86.00 | 86.80 | 88.40 | 88.60 |
FreeDOM-Full (Lin et al., 2020) | 82.32 | 86.36 | 90.49 | 91.29 | 92.56 |
SimpDOM (Zhou et al., 2021) | 83.06 | 88.96 | 91.63 | 92.84 | 93.75 |
MarkupLM-Large | 85.71 | 93.57 | 96.12 | 96.71 | 97.37 |
Citation
If you find markupLM useful in your research, please cite the following paper:
@article{li2021markuplm, title={MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding}, author={Junlong Li and Yiheng Xu and Lei Cui and Furu Wei}, year={2021}, eprint={2110.08518}, archivePrefix={arXiv}, primaryClass={cs.CL} }
License
This project is licensed under the license found in the LICENSE file in the root directory of this source tree. Portions of the source code are based on the transformers project. Microsoft Open Source Code of Conduct
Contact Information
For help or issues using MarkupLM, please submit a GitHub issue.
For other communications related to MarkupLM, please contact Lei Cui ([email protected]
), Furu Wei ([email protected]
).