--- title: camelot-pg app_file: src/app/run.py sdk: gradio sdk_version: 4.32.2 --- # PDF Table Parser This script extracts tables from PDF files and saves them as CSV files. It supports command-line interface (CLI) for batch processing and also provides an optional web UI for interactive processing. ## Features - Multi-page PDF support - Progress display per lines/rows, per page, and per file - CSV output with UTF-8 with BOM encoding - Customizable edge and row tolerances for table detection - Optional web UI for interactive processing using Gradio ## Installation 1. Clone the repository or download the script. 2. Install the required dependencies: ```bash pip install rich camelot-py polars gradio gradio_pdf ``` ## Usage ### Command-Line Interface (CLI) To run the script via CLI, use the following command: ```bash python src/app/run.py input1.pdf input2.pdf output1.csv output2.csv ``` #### Arguments: - `input_files`: List of input PDF files - `output_files`: List of output CSV files (must match the number of input files) #### Optional Arguments: - `--delimiter`: Output file delimiter (default: `,`) - `--edge_tol`: Tolerance parameter used to specify the distance between text and table edges (default: `50`) - `--row_tol`: Tolerance parameter used to specify the distance between table rows (default: `10`) - `--webui`: Launch the web UI ### Web UI To run the script with the web UI, use the following command: ```bash python src/app/run.py data/demo.pdf data/output.csv --webui ``` This will launch a Gradio-based web application where you can upload PDFs and view the extracted tables interactively. ## Example ### CLI Example ```bash python src/app/run.py data/demo.pdf data/output.csv --delimiter ";" --edge_tol 60 --row_tol 40 ``` ### Web UI Example ```bash python src/app/run.py data/demo.pdf data/output.csv --webui ``` ## Handling Interruptions The script handles `SIGINT` and `SIGTERM` signals gracefully, ensuring that processing can be interrupted safely. ## License This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details. ## Acknowledgements This script uses the following libraries: - [Rich](https://github.com/willmcgugan/rich) for console output and progress bars - [Camelot](https://github.com/camelot-dev/camelot) for PDF table extraction - [Polars](https://github.com/pola-rs/polars) for efficient DataFrame operations - [Gradio](https://github.com/gradio-app/gradio) for the web UI - [gradio_pdf](https://github.com/gradio-app/gradio) for PDF handling in Gradio