data/Readme.md · deepcode-ai/Compact

Dataset Processing

Our Benchmark (processed OIE2016)

Firstly, download our benchmark tailored for compact extractions provided here and put it under data/OIE2016(processed). Secondly, split out the train, development, test set for the constituent extraction model by running:

cd OIE2016(processed)/constituent_model
python process_constituent_data.py

Lastly, split out the train, development, test set for the constituent linking model by running:

cd OIE2016(processed)/relation_model
python process_linking_data.py

Note that the data folders for training each model are set to the ones mentioned above.

Evaluation Benchmarks

Three evaluation benchmarks (BenchIE, CaRB, and Wire57) are used for evaluating CompactIE's performance. Note that since these datasets are not targeted for compact triples, we exclude triples that have at least one clause within a constituent. To get the final data (json format) for these benchmarks, run:

./process_test_data.sh

Other files

Since the schema design of the table filling model does not support conjunctions inside constituents, we use the conjunction module developed by OpenIE6 to break sentences into smaller conjunction-free sentences before passing them to the system. Therefore, input new test files (source_file.txt), produce the conjunction file (conjunctions.txt) and then run:

python process.py --source_file source_file.txt --target_file output.json --conjunctions_file conjunctions.txt

Compactness measurement

To measure the compactness metrics mentioned in the paper (AL, NCC, RPA), set the INPUT_FILE variable inside the following scrip to the test file path and run it as follows:

python compactness_measurements.py