Spaces:
Sleeping
Sleeping
diegovelilla
commited on
Commit
•
786c2e2
1
Parent(s):
ce1611d
Readme
Browse files
README.md
CHANGED
@@ -1,13 +1,142 @@
|
|
1 |
---
|
2 |
title: EssAI App
|
3 |
-
emoji:
|
4 |
-
colorFrom:
|
5 |
colorTo: blue
|
6 |
sdk: gradio
|
7 |
sdk_version: 4.41.0
|
8 |
app_file: app.py
|
9 |
-
pinned:
|
10 |
license: apache-2.0
|
|
|
11 |
---
|
|
|
12 |
|
13 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
title: EssAI App
|
3 |
+
emoji: 📚
|
4 |
+
colorFrom: green
|
5 |
colorTo: blue
|
6 |
sdk: gradio
|
7 |
sdk_version: 4.41.0
|
8 |
app_file: app.py
|
9 |
+
pinned: true
|
10 |
license: apache-2.0
|
11 |
+
short_description: 'AI powered app for AI-generated essay detection'
|
12 |
---
|
13 |
+
# EssAI: AI-generated essays detector
|
14 |
|
15 |
+
## Table of Contents
|
16 |
+
|
17 |
+
1. [Overview](#overview)
|
18 |
+
2. [Features](#features)
|
19 |
+
3. [Files](#files)
|
20 |
+
4. [Installation](#installation)
|
21 |
+
5. [Usage](#usage)
|
22 |
+
6. [Model Details](#model-details)
|
23 |
+
7. [Dataset](#dataset)
|
24 |
+
8. [Fine-tuning](#fine-tuning)
|
25 |
+
9. [Results](#results)
|
26 |
+
10. [Additional Resources](#additional-resources)
|
27 |
+
11. [License](#license)
|
28 |
+
12. [Contact](#contact)
|
29 |
+
|
30 |
+
## Overview
|
31 |
+
|
32 |
+
This project fine-tunes a Large Language Model (LLM) in order to detect AI-generated essays. The model aims to help educators, researchers or individuals identify text that has been generated by AI, ensuring the authenticity of written content. You can find the full model uploaded to [Hugging Face](https://huggingface.co/diegovelilla/EssAI) since I ran into some problems with Github's LFS restrictions trying to upload it here.
|
33 |
+
|
34 |
+
## Features
|
35 |
+
|
36 |
+
- Detects AI-generated essays with very high accuracy (over 95%).
|
37 |
+
- Fine-tuned on massive dataset combining ~500K human-written and AI-generated essays.
|
38 |
+
|
39 |
+
## Files
|
40 |
+
|
41 |
+
### `requirements.txt`
|
42 |
+
This file lists all the Python packages required to run the project. It ensures that all necessary dependencies are installed for the project to function correctly.
|
43 |
+
|
44 |
+
### `essai_user_input.py`
|
45 |
+
This script is responsible for handling user inputs. Just copy in your essay and run it to get the prediction.
|
46 |
+
|
47 |
+
### `training.py`
|
48 |
+
This script has handled the training process of the model. It includes code for loading the dataset, fine-tuning it and saving the trained model.
|
49 |
+
|
50 |
+
### `testing.py`
|
51 |
+
This script is used to evaluate the performance of the trained model. It loads the test dataset, performs predictions, and calculates performance metrics such as accuracy and F1-score.
|
52 |
+
|
53 |
+
### `data_insights.py`
|
54 |
+
This script generates insights and visualizations from the data used in this project. It includes functions for analyzing dataset statistics, plotting graphs, and summarizing key data points to help understand the dataset better.
|
55 |
+
|
56 |
+
## Installation
|
57 |
+
|
58 |
+
To install the required dependencies, clone the repository and install the necessary Python packages in the **requirements.txt** file:
|
59 |
+
|
60 |
+
```bash
|
61 |
+
git clone https://github.com/diegovelilla/EssAI
|
62 |
+
cd EssAI
|
63 |
+
pip install -r requirements.txt
|
64 |
+
```
|
65 |
+
|
66 |
+
## Usage
|
67 |
+
|
68 |
+
You can use the model to check your own essays by running the **essai_user_input.py** file and coping the text into the input part right after the imports:
|
69 |
+
|
70 |
+
```python
|
71 |
+
# --- INPUT ---
|
72 |
+
|
73 |
+
input_list = [""" WRITE HERE YOUR FIRST ESSAY """,
|
74 |
+
""" WRITE HERE YOUR SECOND ESSAY """]
|
75 |
+
|
76 |
+
# -------------
|
77 |
+
```
|
78 |
+
As you can see, you can check more than one essay at a time. This model has been trained with 350-400 word long essays, so just keep that in mind when using it. Learn more about the data used in the [data_insights](https://github.com/diegovelilla/EssAI/blob/main/essai_data_insights.ipynb) notebook.
|
79 |
+
|
80 |
+
## Model details
|
81 |
+
The base model selected for this project was the [bert-base-uncased](https://huggingface.co/google-bert/bert-base-uncased) model from hugging face. BERT (Bidirectional Encoder Representations from Transformers) is a transformer model developed and published in 2018 by Google's AI Reasearch Team. This is an open-source model with 110M parameters pretrained on a large corpus of English written data with the objectives of:
|
82 |
+
|
83 |
+
- Predicting missing words in a sentence.
|
84 |
+
- Guessing if two sentences were next to each other in the original text.
|
85 |
+
|
86 |
+
Which makes it a really competent text classification model and a great candidate for our project.
|
87 |
+
|
88 |
+
## Dataset
|
89 |
+
The dataset used was taken from Kaggle and can be found [here](https://www.kaggle.com/datasets/shanegerami/ai-vs-human-text). It contains about 500K different essays with around 60% being human written and the 40% left AI-generated. For further data info, check out the [data_insights](https://github.com/diegovelilla/EssAI/blob/main/essai_data_insights.ipynb) notebook. Also check out the [training](https://github.com/diegovelilla/EssAI/blob/main/essai_training.ipynb) and [testing](https://github.com/diegovelilla/EssAI/blob/main/essai_testing.ipynb) notebooks if interested in how the model was fine-tuned or want to check the model's performance (instructions inside).
|
90 |
+
|
91 |
+
## Fine-tuning
|
92 |
+
For resource issues and since this was intended as a learning project, only 1% from de full 500K dataset has been used which would still mean a training dataset of 4.000 essays and a testing dataset of 1.000 essays.
|
93 |
+
|
94 |
+
I encourage anyone reading this to try to further train this model increasing the data used with the [training](https://github.com/diegovelilla/EssAI/blob/main/essai_training.ipynb) notebook.
|
95 |
+
|
96 |
+
## Results
|
97 |
+
For the first 1.000 datasets tested, the model showed a 98% accuracy. For the second one, and with a testing sample of 20.000 essays, the accuracy shown was 97%.
|
98 |
+
Further testing can be done using the [testing](https://github.com/diegovelilla/EssAI/blob/main/essai_testing.ipynb) notebook
|
99 |
+
|
100 |
+
In the initial testing phase with a sample of 1.000 essays, the model demonstrated an impressive accuracy of 98%. In a subsequent, more extensive test involving 20.000 essays, the model maintained a high accuracy of 97%.
|
101 |
+
|
102 |
+
For more detailed evaluation and further testing, please refer to the [testing](https://github.com/diegovelilla/EssAI/blob/main/essai_testing.ipynb) notebook.
|
103 |
+
|
104 |
+
## Additional Resources
|
105 |
+
|
106 |
+
Throughout the development, I've found some resources very useful that I would like to share apart from others related to the project.
|
107 |
+
|
108 |
+
### Tutorials and Documentation
|
109 |
+
|
110 |
+
- **[Hugging Face NLP Course](huggingface.co/learn/nlp-course/)**: Comprehensive tutorials and documentation on what is NLP and how to use Hugging Face's libraries.
|
111 |
+
- **[Hugging Face Transformers Documentation](https://huggingface.co/transformers/)**: The official documentation for the Transformers library.
|
112 |
+
|
113 |
+
### Articles and Papers
|
114 |
+
|
115 |
+
- **[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)**: The original research paper on BERT, which provided insights into the architecture and capabilities of the model.
|
116 |
+
- **[A Comprehensive Guide to Fine-Tuning BERT](https://towardsdatascience.com/a-comprehensive-guide-to-fine-tuning-bert-for-nlp-tasks-39ef4a51c7d3)**: An article that outlines various techniques for fine-tuning BERT models for specific tasks.
|
117 |
+
|
118 |
+
### Tools and Libraries
|
119 |
+
|
120 |
+
- **[Kaggle Datasets](https://www.kaggle.com/datasets)**: Platform used to source the dataset for this project.
|
121 |
+
- **[Git Large File Storage (LFS)](https://git-lfs.github.com/)**: Tool used for managing large files in the Git repository. Very useful for moving big files like the ones that form the model.
|
122 |
+
|
123 |
+
### YouTube channels
|
124 |
+
|
125 |
+
- **[Andrej Karpathy](https://www.youtube.com/@AndrejKarpathy)**: One of my favourite ML/DL YouTube channels with amazing videos. Can't stress enough how much I have learned from this man.
|
126 |
+
- **[DotCSV](https://www.youtube.com/@DotCSV)**: The first AI related YouTube channel I did ever follow. Great spanish speaking channel to keep up with AI news.
|
127 |
+
|
128 |
+
These resources provided valuable information and tools throughout the project's development. If you’re working on similar projects, they might be helpful to you as well.
|
129 |
+
|
130 |
+
## License
|
131 |
+
This project is licensed under the **Apache 2.0 License**. See the [LICENSE](https://github.com/diegovelilla/EssAI/blob/main/LICENSE) file for more details.
|
132 |
+
|
133 |
+
## Contact
|
134 |
+
|
135 |
+
For any questions or feedback please reach out to:
|
136 |
+
|
137 |
+
- **Email**: [[email protected]](mailto:[email protected])
|
138 |
+
- **GitHub Profile**: [diegovelilla](https://github.com/diegovelilla)
|
139 |
+
- **Hugging Face Profile**: [diegovelilla](https://huggingface.co/diegovelilla)
|
140 |
+
- **LinkedIn**: [Diego Velilla Recio](https://www.linkedin.com/in/diego-velilla-recio/)
|
141 |
+
|
142 |
+
Feel free to open an issue on GitHub or contact me in any way if you have any queries or suggestions.
|