Christopher Capobianco commited on
Commit
b1eea1f
1 Parent(s): a2d3475

Try embedding document classifier inside main block

Browse files
Files changed (1) hide show
  1. projects/01_Document_Classifier.py +32 -31
projects/01_Document_Classifier.py CHANGED
@@ -72,34 +72,35 @@ def autoclassifier(images):
72
  # Delete image file
73
  os.remove(image.name)
74
 
75
- st.header('Document Classifier', divider='green')
76
-
77
- st.warning("Work in Progress")
78
-
79
- st.markdown("#### What is OCR?")
80
- st.markdown("OCR stands for Optical Character Recognition, and the technology for it has been around for over 30 years.")
81
- st.markdown("In this project, we leverage the extraction of the text from an image to classify the document. I am using EasyOCR as the OCR Engine, and I do some pre-processing of the raw OCR text to improve the quality of the words used to classify the documents.")
82
- st.markdown("After an investigation I settled on a Random Forest classifier for this project, since it had the best classification accuracy of the different models I investigated.")
83
- st.markdown("This project makes use of the [Real World Documents Collections](https://www.kaggle.com/datasets/shaz13/real-world-documents-collections) found at `Kaggle`")
84
- st.markdown("*This project is based off the tutorial by Animesh Giri [Intelligent Document Classification](https://www.kaggle.com/code/animeshgiri/intelligent-document-classification)*")
85
- st.markdown("*N.B. I created a similar document classifier in my first ML project, but that relied on IBM's Datacap for the OCR Engine. I also used a Support Vector Machine (SVM) classifier library (libsvm) at the time, but it was slow to train. I tried to re-create that document classifier again, using open source tools and modern techniques outlined in the referenced tutorial.*")
86
- st.divider()
87
-
88
- # Load the Spacy tokenizer
89
- nlp = load_nlp()
90
-
91
- # Initialze the OCR Engine
92
- ocr_engine = load_ocr_engine()
93
-
94
- # Load the Model
95
- stopwords, punctuations, model_pipe = load_model()
96
-
97
- # Fetch uploaded images
98
- images = st.file_uploader(
99
- "Choose an image to classify",
100
- type=['png','jpg','jpeg'],
101
- accept_multiple_files=True
102
- )
103
-
104
- # Process and predict document classification
105
- autoclassifier(images)
 
 
72
  # Delete image file
73
  os.remove(image.name)
74
 
75
+ if __name__ == "__main__":
76
+ st.header('Document Classifier', divider='green')
77
+
78
+ st.warning("Work in Progress")
79
+
80
+ st.markdown("#### What is OCR?")
81
+ st.markdown("OCR stands for Optical Character Recognition, and the technology for it has been around for over 30 years.")
82
+ st.markdown("In this project, we leverage the extraction of the text from an image to classify the document. I am using EasyOCR as the OCR Engine, and I do some pre-processing of the raw OCR text to improve the quality of the words used to classify the documents.")
83
+ st.markdown("After an investigation I settled on a Random Forest classifier for this project, since it had the best classification accuracy of the different models I investigated.")
84
+ st.markdown("This project makes use of the [Real World Documents Collections](https://www.kaggle.com/datasets/shaz13/real-world-documents-collections) found at `Kaggle`")
85
+ st.markdown("*This project is based off the tutorial by Animesh Giri [Intelligent Document Classification](https://www.kaggle.com/code/animeshgiri/intelligent-document-classification)*")
86
+ st.markdown("*N.B. I created a similar document classifier in my first ML project, but that relied on IBM's Datacap for the OCR Engine. I also used a Support Vector Machine (SVM) classifier library (libsvm) at the time, but it was slow to train. I tried to re-create that document classifier again, using open source tools and modern techniques outlined in the referenced tutorial.*")
87
+ st.divider()
88
+
89
+ # Load the Spacy tokenizer
90
+ nlp = load_nlp()
91
+
92
+ # Initialze the OCR Engine
93
+ ocr_engine = load_ocr_engine()
94
+
95
+ # Load the Model
96
+ stopwords, punctuations, model_pipe = load_model()
97
+
98
+ # Fetch uploaded images
99
+ images = st.file_uploader(
100
+ "Choose an image to classify",
101
+ type=['png','jpg','jpeg'],
102
+ accept_multiple_files=True
103
+ )
104
+
105
+ # Process and predict document classification
106
+ autoclassifier(images)