Update README.md
Browse files
README.md
ADDED
@@ -0,0 +1,82 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: en
|
3 |
+
tags:
|
4 |
+
- license
|
5 |
+
- sentence-classification
|
6 |
+
- scancode
|
7 |
+
- license-compliance
|
8 |
+
license: apache-2.0
|
9 |
+
datasets:
|
10 |
+
- bookcorpus
|
11 |
+
- wikipedia
|
12 |
+
- scancode-rules
|
13 |
+
version: 1.0
|
14 |
+
---
|
15 |
+
|
16 |
+
# `lic-class-scancode-bert-base-cased-L32-1`
|
17 |
+
|
18 |
+
## Intended Use
|
19 |
+
|
20 |
+
This model is intended to be used for Sentence Classification which is used for results
|
21 |
+
analysis in [`scancode-results-analyzer`](https://github.com/nexB/scancode-results-analyzer).
|
22 |
+
|
23 |
+
`scancode-results-analyzer` helps detect faulty scans in [`scancode-toolkit`](https://github.com/nexB/scancode-results-analyzer) by using statistics and nlp modeling, among other tools,
|
24 |
+
to make Scancode better.
|
25 |
+
|
26 |
+
## How to Use
|
27 |
+
|
28 |
+
Refer [quickstart](https://github.com/nexB/scancode-results-analyzer#quickstart---local-machine) section in `scancode-results-analyzer` documentation, for installing and getting started.
|
29 |
+
|
30 |
+
- [Link to Code](https://github.com/nexB/scancode-results-analyzer/blob/master/src/results_analyze/nlp_models.py)
|
31 |
+
|
32 |
+
Then in `NLPModelsPredict` class, function `predict_basic_lic_class` uses this classifier to
|
33 |
+
predict sentances as either valid license tags or false positives.
|
34 |
+
|
35 |
+
## Limitations and Bias
|
36 |
+
|
37 |
+
As this model is a fine-tuned version of the [`bert-base-cased`](https://huggingface.co/bert-base-cased) model,
|
38 |
+
it has the same biases, but as the task it is fine-tuned to is a very specific task
|
39 |
+
(license text/notice/tag/referance) without those intended biases, it's safe to assume
|
40 |
+
those don't apply at all here.
|
41 |
+
|
42 |
+
## Training and Fine-Tuning Data
|
43 |
+
|
44 |
+
The BERT model was pretrained on BookCorpus, a dataset consisting of 11,038 unpublished books and English Wikipedia (excluding lists, tables and headers).
|
45 |
+
|
46 |
+
Then this `bert-base-cased` model was fine-tuned on Scancode Rule texts, specifically
|
47 |
+
trained in the context of sentence classification, where the four classes are
|
48 |
+
|
49 |
+
- License Text
|
50 |
+
- License Notice
|
51 |
+
- License Tag
|
52 |
+
- License Referance
|
53 |
+
|
54 |
+
## Training Procedure
|
55 |
+
|
56 |
+
For fine-tuning procedure and training, refer `scancode-results-analyzer` code.
|
57 |
+
|
58 |
+
- [Link to Code](https://github.com/nexB/scancode-results-analyzer/blob/master/src/results_analyze/nlp_models.py)
|
59 |
+
|
60 |
+
In `NLPModelsTrain` class, function `prepare_input_data_false_positive` prepares the
|
61 |
+
training data.
|
62 |
+
|
63 |
+
In `NLPModelsTrain` class, function `train_basic_false_positive_classifier` fine-tunes
|
64 |
+
this classifier.
|
65 |
+
|
66 |
+
1. Model - [BertBaseCased](https://huggingface.co/bert-base-cased) (Weights 0.5 GB)
|
67 |
+
2. Sentence Length - 32
|
68 |
+
3. Labels - 4 (License Text/Notice/Tag/Referance)
|
69 |
+
4. After 4 Epochs of Fine-Tuning with learning rate 2e-5 (60 secs each on an RTX 2060)
|
70 |
+
|
71 |
+
Note: The classes aren't balanced.
|
72 |
+
|
73 |
+
## Eval Results
|
74 |
+
|
75 |
+
- Accuracy on the training data (90%) : 0.98 (+- 0.01)
|
76 |
+
- Accuracy on the validation data (10%) : 0.84 (+- 0.01)
|
77 |
+
|
78 |
+
## Further Work
|
79 |
+
|
80 |
+
1. Apllying Splitting/Aggregation Strategies
|
81 |
+
2. Data Augmentation according to Vaalidation Errors
|
82 |
+
3. Bigger/Better Suited Models
|