--- license: apache-2.0 datasets: - Abirate/french_book_reviews pipeline_tag: text-classification --- ## Model and approach 🤗 #### As I am limited by my personal computer, the training was done on the distilbert-base-multilingual-cased model. This model is 60% faster than the classic BERT model and preserves 95% of the original model's accuracy. #### The dataset provided contains book titles, authors, reviews, and a score for each book. These columns were concatenated to form large context blocks and were used as the input text. The labels, (0, 1, and -1) were normalized to 0, 1, and 2, and finally to NEUTRAL, POSITIVE, and NEGATIVE to help with legibility of the predictions. #### As this exercise is simply to show my capacities to train a model, the model has been trained using 3000 training entries and 300 test entries for 2 epochs. ## Notes on the three classes and the model's bias 📝 #### The distribution of these classes is not equal in the ensemble of this dataset. Although it is shuffled, positive reviews are the most present, and therefore most-often predicted category. In addition, the decision to keep the review score in the text block did have an impact on the biases of the model. **The model can make a prediction based on score alone, a number between 1 and 5.** ### Positive reviews: 2081 ### Negative reviews: 224 ### Neutral reviews: 695