--- datasets: - ganchengguang/resume_seven_class language: - en base_model: - spacy/en_core_web_md pipeline_tag: text-classification --- # Model Card for en_textcat_resume_sections This model is designed to classify sections within English-language resumes, including labels such as Skills, Education, Experience, and others. ## Model Details ### Model Description This model utilizes spaCy's text classification component to categorize sections of resumes into predefined labels. It is trained on the `ganchengguang/resume_seven_class` dataset, which contains examples of various resume sections. - **Model type:** Text Classification - **Language(s) (NLP):** English - **Finetuned from model:** spacy/en_core_web_md ## Uses ### Direct Use This model can be used to automatically classify sections within English-language resumes, facilitating the extraction of structured information from unstructured resume text. It can only classify Skills, Education, Experience, Profile and Summary successfully for now. ### Downstream Use This model can serve as a component in larger systems for resume parsing, candidate screening, or any application requiring the identification of specific sections within resumes. ### Out-of-Scope Use This model is not designed for tasks outside of resume section classification, such as general text classification or Named Entity Recognition (NER) in non-resume texts. ## Bias, Risks, and Limitations The model's performance is dependent on the quality and diversity of the training data. It may not perform well on resumes that differ significantly from the training examples. Additionally, the model may have biases based on the dataset it was trained on. ### Recommendations Users should be aware of the model's limitations and biases. It is recommended to evaluate the model's performance on a diverse set of resumes before deploying it in production environments. ## How to Get Started with the Model - https://github.com/ssobii2/Wozify-CV-Parser - Checks Spacy's Website ## Training Details ### Training Data The model was trained on the `ganchengguang/resume_seven_class` dataset, which contains examples of various resume sections. ### Training Procedure The model was fine-tuned using spaCy's text classification component. The training involved the following steps: 1. Data preprocessing: Tokenization and vectorization of resume text. 2. Model training: Fine-tuning the `spacy/en_core_web_md` model on the preprocessed data. 3. Evaluation: Assessing the model's performance on a validation set. #### Preprocessing The text data was cleaned by removing special characters, normalizing whitespace, and converting text to lowercase. Tokenization was performed using spaCy's tokenizer. ## Evaluation ### Testing Data, Factors & Metrics #### Testing Data The model was evaluated on a separate test set from the `ganchengguang/resume_seven_class` dataset, containing examples of resume sections not seen during training. #### Factors The evaluation considered factors such as resume length, formatting, and the presence of uncommon sections. #### Metrics The model's performance was measured using accuracy, precision, recall, and F1-score. ### Results The model achieved the following results on the test set: ## Text Categorization Model Performance Metrics ### Summary Section - **Precision:** 88.4% - **Recall:** 89.8% - **F1-score:** 89.1% ### Profile Section - **Precision:** 95.2% - **Recall:** 88.3% - **F1-score:** 91.6% ### Education Section - **Precision:** 93.2% - **Recall:** 90.5% - **F1-score:** 91.9% ### Experience Section - **Precision:** 78.8% - **Recall:** 82.5% - **F1-score:** 80.6% ### Skills Section - **Precision:** 88.5% - **Recall:** 88.5% - **F1-score:** 88.5% ### Overall Model Performance - **Micro Precision:** 88.3% - **Micro Recall:** 87.7% - **Micro F1-score:** 88.0% - **Macro Precision:** 88.8% - **Macro Recall:** 87.9% - **Macro F1-score:** 88.3% - **Macro AUC:** 97.8% #### Summary The model performs best on Education and Profile sections, while the Experience section has relatively lower performance metrics. The Skills section shows balanced precision and recall. ## Technical Specifications ### Model Architecture and Objective The model is based on spaCy's text classification component, utilizing the `spacy/en_core_web_md` base model. The objective is to classify resume sections into predefined categories. ### Compute Infrastructure The model was trained on my personal gaming laptop. The config file can be found inside the model folder. #### Hardware - Intel Core-i7-13620H - 16GB RAM - RTX 4070 Laptop GPU 8GB VRAM #### Software - **Operating System:** Windows 11 - **Libraries:** spaCy