diegovelilla commited on
Commit
786c2e2
1 Parent(s): ce1611d
Files changed (1) hide show
  1. README.md +133 -4
README.md CHANGED
@@ -1,13 +1,142 @@
1
  ---
2
  title: EssAI App
3
- emoji: 🌍
4
- colorFrom: yellow
5
  colorTo: blue
6
  sdk: gradio
7
  sdk_version: 4.41.0
8
  app_file: app.py
9
- pinned: false
10
  license: apache-2.0
 
11
  ---
 
12
 
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  title: EssAI App
3
+ emoji: 📚
4
+ colorFrom: green
5
  colorTo: blue
6
  sdk: gradio
7
  sdk_version: 4.41.0
8
  app_file: app.py
9
+ pinned: true
10
  license: apache-2.0
11
+ short_description: 'AI powered app for AI-generated essay detection'
12
  ---
13
+ # EssAI: AI-generated essays detector
14
 
15
+ ## Table of Contents
16
+
17
+ 1. [Overview](#overview)
18
+ 2. [Features](#features)
19
+ 3. [Files](#files)
20
+ 4. [Installation](#installation)
21
+ 5. [Usage](#usage)
22
+ 6. [Model Details](#model-details)
23
+ 7. [Dataset](#dataset)
24
+ 8. [Fine-tuning](#fine-tuning)
25
+ 9. [Results](#results)
26
+ 10. [Additional Resources](#additional-resources)
27
+ 11. [License](#license)
28
+ 12. [Contact](#contact)
29
+
30
+ ## Overview
31
+
32
+ This project fine-tunes a Large Language Model (LLM) in order to detect AI-generated essays. The model aims to help educators, researchers or individuals identify text that has been generated by AI, ensuring the authenticity of written content. You can find the full model uploaded to [Hugging Face](https://huggingface.co/diegovelilla/EssAI) since I ran into some problems with Github's LFS restrictions trying to upload it here.
33
+
34
+ ## Features
35
+
36
+ - Detects AI-generated essays with very high accuracy (over 95%).
37
+ - Fine-tuned on massive dataset combining ~500K human-written and AI-generated essays.
38
+
39
+ ## Files
40
+
41
+ ### `requirements.txt`
42
+ This file lists all the Python packages required to run the project. It ensures that all necessary dependencies are installed for the project to function correctly.
43
+
44
+ ### `essai_user_input.py`
45
+ This script is responsible for handling user inputs. Just copy in your essay and run it to get the prediction.
46
+
47
+ ### `training.py`
48
+ This script has handled the training process of the model. It includes code for loading the dataset, fine-tuning it and saving the trained model.
49
+
50
+ ### `testing.py`
51
+ This script is used to evaluate the performance of the trained model. It loads the test dataset, performs predictions, and calculates performance metrics such as accuracy and F1-score.
52
+
53
+ ### `data_insights.py`
54
+ This script generates insights and visualizations from the data used in this project. It includes functions for analyzing dataset statistics, plotting graphs, and summarizing key data points to help understand the dataset better.
55
+
56
+ ## Installation
57
+
58
+ To install the required dependencies, clone the repository and install the necessary Python packages in the **requirements.txt** file:
59
+
60
+ ```bash
61
+ git clone https://github.com/diegovelilla/EssAI
62
+ cd EssAI
63
+ pip install -r requirements.txt
64
+ ```
65
+
66
+ ## Usage
67
+
68
+ You can use the model to check your own essays by running the **essai_user_input.py** file and coping the text into the input part right after the imports:
69
+
70
+ ```python
71
+ # --- INPUT ---
72
+
73
+ input_list = [""" WRITE HERE YOUR FIRST ESSAY """,
74
+ """ WRITE HERE YOUR SECOND ESSAY """]
75
+
76
+ # -------------
77
+ ```
78
+ As you can see, you can check more than one essay at a time. This model has been trained with 350-400 word long essays, so just keep that in mind when using it. Learn more about the data used in the [data_insights](https://github.com/diegovelilla/EssAI/blob/main/essai_data_insights.ipynb) notebook.
79
+
80
+ ## Model details
81
+ The base model selected for this project was the [bert-base-uncased](https://huggingface.co/google-bert/bert-base-uncased) model from hugging face. BERT (Bidirectional Encoder Representations from Transformers) is a transformer model developed and published in 2018 by Google's AI Reasearch Team. This is an open-source model with 110M parameters pretrained on a large corpus of English written data with the objectives of:
82
+
83
+ - Predicting missing words in a sentence.
84
+ - Guessing if two sentences were next to each other in the original text.
85
+
86
+ Which makes it a really competent text classification model and a great candidate for our project.
87
+
88
+ ## Dataset
89
+ The dataset used was taken from Kaggle and can be found [here](https://www.kaggle.com/datasets/shanegerami/ai-vs-human-text). It contains about 500K different essays with around 60% being human written and the 40% left AI-generated. For further data info, check out the [data_insights](https://github.com/diegovelilla/EssAI/blob/main/essai_data_insights.ipynb) notebook. Also check out the [training](https://github.com/diegovelilla/EssAI/blob/main/essai_training.ipynb) and [testing](https://github.com/diegovelilla/EssAI/blob/main/essai_testing.ipynb) notebooks if interested in how the model was fine-tuned or want to check the model's performance (instructions inside).
90
+
91
+ ## Fine-tuning
92
+ For resource issues and since this was intended as a learning project, only 1% from de full 500K dataset has been used which would still mean a training dataset of 4.000 essays and a testing dataset of 1.000 essays.
93
+
94
+ I encourage anyone reading this to try to further train this model increasing the data used with the [training](https://github.com/diegovelilla/EssAI/blob/main/essai_training.ipynb) notebook.
95
+
96
+ ## Results
97
+ For the first 1.000 datasets tested, the model showed a 98% accuracy. For the second one, and with a testing sample of 20.000 essays, the accuracy shown was 97%.
98
+ Further testing can be done using the [testing](https://github.com/diegovelilla/EssAI/blob/main/essai_testing.ipynb) notebook
99
+
100
+ In the initial testing phase with a sample of 1.000 essays, the model demonstrated an impressive accuracy of 98%. In a subsequent, more extensive test involving 20.000 essays, the model maintained a high accuracy of 97%.
101
+
102
+ For more detailed evaluation and further testing, please refer to the [testing](https://github.com/diegovelilla/EssAI/blob/main/essai_testing.ipynb) notebook.
103
+
104
+ ## Additional Resources
105
+
106
+ Throughout the development, I've found some resources very useful that I would like to share apart from others related to the project.
107
+
108
+ ### Tutorials and Documentation
109
+
110
+ - **[Hugging Face NLP Course](huggingface.co/learn/nlp-course/)**: Comprehensive tutorials and documentation on what is NLP and how to use Hugging Face's libraries.
111
+ - **[Hugging Face Transformers Documentation](https://huggingface.co/transformers/)**: The official documentation for the Transformers library.
112
+
113
+ ### Articles and Papers
114
+
115
+ - **[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)**: The original research paper on BERT, which provided insights into the architecture and capabilities of the model.
116
+ - **[A Comprehensive Guide to Fine-Tuning BERT](https://towardsdatascience.com/a-comprehensive-guide-to-fine-tuning-bert-for-nlp-tasks-39ef4a51c7d3)**: An article that outlines various techniques for fine-tuning BERT models for specific tasks.
117
+
118
+ ### Tools and Libraries
119
+
120
+ - **[Kaggle Datasets](https://www.kaggle.com/datasets)**: Platform used to source the dataset for this project.
121
+ - **[Git Large File Storage (LFS)](https://git-lfs.github.com/)**: Tool used for managing large files in the Git repository. Very useful for moving big files like the ones that form the model.
122
+
123
+ ### YouTube channels
124
+
125
+ - **[Andrej Karpathy](https://www.youtube.com/@AndrejKarpathy)**: One of my favourite ML/DL YouTube channels with amazing videos. Can't stress enough how much I have learned from this man.
126
+ - **[DotCSV](https://www.youtube.com/@DotCSV)**: The first AI related YouTube channel I did ever follow. Great spanish speaking channel to keep up with AI news.
127
+
128
+ These resources provided valuable information and tools throughout the project's development. If you’re working on similar projects, they might be helpful to you as well.
129
+
130
+ ## License
131
+ This project is licensed under the **Apache 2.0 License**. See the [LICENSE](https://github.com/diegovelilla/EssAI/blob/main/LICENSE) file for more details.
132
+
133
+ ## Contact
134
+
135
+ For any questions or feedback please reach out to:
136
+
137
+ - **Email**: [[email protected]](mailto:[email protected])
138
+ - **GitHub Profile**: [diegovelilla](https://github.com/diegovelilla)
139
+ - **Hugging Face Profile**: [diegovelilla](https://huggingface.co/diegovelilla)
140
+ - **LinkedIn**: [Diego Velilla Recio](https://www.linkedin.com/in/diego-velilla-recio/)
141
+
142
+ Feel free to open an issue on GitHub or contact me in any way if you have any queries or suggestions.