# Incremental BPE builder Modified, simplified version of text_encoder_build_subword.py and its dependencies included in [tensor2tensor library](https://github.com/tensorflow/tensor2tensor), making its output fits to [google research's open-sourced BERT project](https://github.com/google-research/bert). ## Requirement The environment I made this project in consists of : - python 3.6 - tensorflow 1.11 ## Basic usage #### Build domain-specific vocabulary automatically If you want to build a proper size of vocabulary for specific domain using the incremental algorithm in our paper, you can do as following: ```bash python vocab_extend.py \ --corpus {file for the domain corpus} \ --raw_vocab {bert_raw_vocab_file} \ --output_file {he output file of the final vocabulary} \ --interval {vocab size interval} \ --threshold {threshold for P(D)} # Example using sample data python vocab_extend.py --corpus test_data/chem.txt \ --raw_vocab test_data/vocab.txt \ --output_file test_data/chem.vocab \ --interval 1000 --threshold 1 ``` If you simply want to get a specific size of vocab, you can run the following ``` python subword_builder.py \ --corpus_filepattern {corpus_for_vocab} \ --raw_vocab {bert_raw_vocab_file} \ --output_filename {name_of_vocab} \ --vocab_size {final vocab size} \ --do_lower_case ```