Spaces:

zundom
/

camelot-pg

Sleeping

App Files Files Community

morisono commited on Jun 15

Commit

3ee750e

•

1 Parent(s): 22eec52

Upload folder using huggingface_hub

Browse files

Files changed (7) hide show

.gitattributes +1 -0
README.md +85 -8
data/demo.pdf +3 -0
data/readme.txt +2 -0
data/success.csv +0 -0
requirements.txt +5 -0
src/app/run.py +162 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+data/demo.pdf filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,12 +1,89 @@
 ---
-title: Camelot Pg
-emoji: 📚
-colorFrom: blue
-colorTo: indigo
 sdk: gradio
-sdk_version: 4.36.1
-app_file: app.py
-pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: camelot-pg
+app_file: src/app/run.py
 sdk: gradio
+sdk_version: 4.32.2
 ---
+# PDF Table Parser
+This script extracts tables from PDF files and saves them as CSV files. It supports command-line interface (CLI) for batch processing and also provides an optional web UI for interactive processing.
+## Features
+- Multi-page PDF support
+- Progress display per lines/rows, per page, and per file
+- CSV output with UTF-8 with BOM encoding
+- Customizable edge and row tolerances for table detection
+- Optional web UI for interactive processing using Gradio
+## Installation
+1. Clone the repository or download the script.
+2. Install the required dependencies:
+    ```bash
+    pip install rich camelot-py polars gradio gradio_pdf
+    ```
+## Usage
+### Command-Line Interface (CLI)
+To run the script via CLI, use the following command:
+```bash
+python src/app/run.py input1.pdf input2.pdf output1.csv output2.csv
+```
+#### Arguments:
+- `input_files`: List of input PDF files
+- `output_files`: List of output CSV files (must match the number of input files)
+#### Optional Arguments:
+- `--delimiter`: Output file delimiter (default: `,`)
+- `--edge_tol`: Tolerance parameter used to specify the distance between text and table edges (default: `50`)
+- `--row_tol`: Tolerance parameter used to specify the distance between table rows (default: `10`)
+- `--webui`: Launch the web UI
+### Web UI
+To run the script with the web UI, use the following command:
+```bash
+python src/app/run.py data/demo.pdf data/output.csv --webui
+```
+This will launch a Gradio-based web application where you can upload PDFs and view the extracted tables interactively.
+## Example
+### CLI Example
+```bash
+python src/app/run.py data/demo.pdf data/output.csv --delimiter ";" --edge_tol 60 --row_tol 40
+```
+### Web UI Example
+```bash
+python src/app/run.py data/demo.pdf data/output.csv --webui
+```
+## Handling Interruptions
+The script handles `SIGINT` and `SIGTERM` signals gracefully, ensuring that processing can be interrupted safely.
+## License
+This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
+## Acknowledgements
+This script uses the following libraries:
+- [Rich](https://github.com/willmcgugan/rich) for console output and progress bars
+- [Camelot](https://github.com/camelot-dev/camelot) for PDF table extraction
+- [Polars](https://github.com/pola-rs/polars) for efficient DataFrame operations
+- [Gradio](https://github.com/gradio-app/gradio) for the web UI
+- [gradio_pdf](https://github.com/gradio-app/gradio) for PDF handling in Gradio

data/demo.pdf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6cdbced74a67ec85b5891d3ce931edc3345e9251512d7de314b867d2a716211b
+size 1437530

data/readme.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ Demo file from :
2	+ - https://www.npo-homepage.go.jp/npoportal/certification

data/success.csv ADDED Viewed

The diff for this file is too large to render. See raw diff

requirements.txt ADDED Viewed

	@@ -0,0 +1,5 @@

+camelot_py==0.11.0
+gradio==4.36.1
+gradio_pdf==0.0.11
+polars==0.20.31
+rich==13.7.1

src/app/run.py ADDED Viewed

	@@ -0,0 +1,162 @@

+import argparse
+import os
+import signal
+import sys
+import json
+import time
+import tempfile
+import zipfile
+from rich.console import Console
+from rich.progress import track
+import camelot
+import polars as pl
+import gradio as gr
+from gradio_pdf import PDF
+console = Console()
+class Interface:
+    def get_tempdir():
+        timestamp = int(time.time())
+        temp_dir = tempfile.mkdtemp()
+        return timestamp, temp_dir
+    def create_zip(file_list, zip_path, password=None):
+        with zipfile.ZipFile(zip_path, "w", zipfilep64=True) as zipf:
+            if password:
+                zipf.setpassword(bytes(password, 'utf-8'))
+            for item in file_list:
+                if os.path.isdir(item):
+                    for root, _, files in os.walk(item):
+                        for file in files:
+                            file_path = os.path.join(root, file)
+                            arcname = os.path.relpath(file_path, item)
+                            zipf.write(file_path, arcname)
+                else:
+                    arcname = os.path.basename(item)
+                    zipf.write(item, arcname)
+class PDFTableParser:
+    def __init__(self, input_files, output_files, delimiter, edge_tol, row_tol, pages):
+        self.input_files = input_files
+        self.output_files = output_files
+        self.delimiter = delimiter
+        self.edge_tol = edge_tol
+        self.row_tol = row_tol
+        self.pages = pages
+    def read_tables(self, file_name):
+        try:
+            console.print(f"Reading tables from {file_name}...")
+            tables = camelot.read_pdf(file_name, flavor='stream', edge_tol=self.edge_tol, row_tol=self.row_tol, pages=self.pages)
+            console.print(f"Found {len(tables)} tables in {file_name}.")
+            return tables
+        except Exception as e:
+            console.print(f"[red]Error reading {file_name}: {e}[/red]")
+            return None
+    def save_tables_as_csv(self, tables, output_file):
+        try:
+            console.print(f"Saving tables to {output_file}...")
+            df = pl.concat([pl.DataFrame(table.df) for table in tables])
+            df.write_csv(output_file, separator=self.delimiter)
+            console.print(f"Saved tables to {output_file}.")
+        except Exception as e:
+            console.print(f"[red]Error saving to {output_file}: {e}[/red]")
+    def estimate_processing_time(self, file_name):
+        try:
+            with open(file_name, 'rb') as f:
+                content = f.read().decode('utf-8', errors='ignore')
+            pages = content.count('\n')
+            words = len(content.split())
+            chars = len(content)
+            estimated_time = (lines / 1000) + (words / 1000) + (chars / 1000)
+            console.print(f"Estimated processing time for {file_name}: {estimated_time:.2f} seconds.")
+            return estimated_time
+        except Exception as e:
+            console.print(f"[red]Error estimating processing time for {file_name}: {e}[/red]")
+            return 0
+    def process_files(self):
+        for input_file, output_file in track(zip(self.input_files, self.output_files), description="Processing files"):
+            self.estimate_processing_time(input_file)
+            tables = self.read_tables(input_file)
+            if tables:
+                self.save_tables_as_csv(tables, output_file)
+class WebUI:
+    def __init__(self):
+        pass
+    def process_pdf(pdf_file, output_path, edge_tol, row_tol, pages):
+        ts, tempd = Interface.get_tempdir()
+        tempf = os.path.join(tempd, output_path)
+        parser = PDFTableParser([pdf_file], [tempf], ',', edge_tol, row_tol, pages)
+        tables = parser.read_tables(pdf_file)
+        if tables:
+            parser.save_tables_as_csv(tables, tempf)
+            df = pl.concat([pl.DataFrame(table.df) for table in tables])
+            return df, [tempf], {"status": "success", "message": f"Processed PDF and saved as {tempf}"}
+        return None, None, {"status": "error", "message": "Failed to process PDF"}
+    def run(self):
+        with gr.Blocks(title="PDF Table Parser", css="body { font-family: Arial, sans-serif; } footer { visibility: hidden; }") as app:
+            gr.Markdown("# PDF Table Parser")
+            description="Upload a PDF file to extract tables"
+            gr.Markdown(f"### {description}")
+            with gr.Row():
+                with gr.Column():
+                    pdf_in = PDF(label="Document")
+                    with gr.Row():
+                        edge_tol = gr.Number(50, label="Edge tol")
+                        row_tol = gr.Number(50, label="Row tol")
+                        pages = gr.Textbox('1', label="Pages", info="You can pass 'all', '3-end', etc.")
+                        output_path = gr.Textbox(f"output.csv", label="Output Path")
+                with gr.Column():
+                    status_msg = gr.JSON(label="Status Message")
+                    output_files = gr.Files(label="Output Files")
+            with gr.Row():
+                output_df = gr.Dataframe(label="Extracted Table")
+            examples = gr.Examples([["data/demo.pdf"]], inputs=pdf_in)
+            pdf_in.change(WebUI.process_pdf,
+            inputs=[pdf_in, output_path, edge_tol, row_tol, pages],
+            outputs=[output_df, output_files, status_msg])
+        app.launch()
+def handle_signal(signum, frame):
+    console.print("\n[red]Process interrupted.[/red]")
+    sys.exit(1)
+def main(args):
+    parser = PDFTableParser(args.input_files, args.output_files, args.delimiter, args.edge_tol, args.row_tol, args.pages)
+    parser.process_files()
+if __name__ == "__main__":
+    signal.signal(signal.SIGINT, handle_signal)
+    signal.signal(signal.SIGTERM, handle_signal)
+    parser = argparse.ArgumentParser(description="PDF Table Parser")
+    parser.add_argument("input_files", nargs='+', help="List of input PDF files")
+    parser.add_argument("output_files", nargs='+', help="List of output CSV files")
+    parser.add_argument("--delimiter", default=',', help="Output file delimiter (default: ,)")
+    parser.add_argument("--edge_tol", type=int, default=50, help="Tolerance parameter used to specify the distance between text and table edges (default: 50)")
+    parser.add_argument("--row_tol", type=int, default=50, help="Tolerance parameter used to specify the distance between table rows (default: 50)")
+    parser.add_argument("--pages", type=str, default='all', help="Pages you can pass the number of pages to process. (default: all)")
+    parser.add_argument("--webui", action='store_true', help="Launch the web UI")
+    args = parser.parse_args()
+    if len(args.input_files) != len(args.output_files):
+        console.print("[red]The number of input files and output files must match.[/red]")
+        sys.exit(1)
+    if args.webui:
+        webui = WebUI()
+        webui.run()
+    else:
+        main(args)