ayansinha
/

lic-class-scancode-bert-base-cased-L32-1

sentence-classification

license-compliance

Inference Endpoints

Model card Files Files and versions Community

system HF staff commited on Oct 21, 2020

Commit

46ae5c6

·

1 Parent(s): 63017b4

Update README.md

Files changed (1) hide show

README.md +82 -0

README.md ADDED Viewed

	@@ -0,0 +1,82 @@

+---
+language: en
+tags:
+- license
+- sentence-classification
+- scancode
+- license-compliance
+license: apache-2.0
+datasets:
+- bookcorpus
+- wikipedia
+- scancode-rules
+version: 1.0
+---
+# `lic-class-scancode-bert-base-cased-L32-1`
+## Intended Use
+This model is intended to be used for Sentence Classification which is used for results
+analysis in [`scancode-results-analyzer`](https://github.com/nexB/scancode-results-analyzer).
+`scancode-results-analyzer` helps detect faulty scans in [`scancode-toolkit`](https://github.com/nexB/scancode-results-analyzer) by using statistics and nlp modeling, among other tools,
+to make Scancode better.
+## How to Use
+Refer [quickstart](https://github.com/nexB/scancode-results-analyzer#quickstart---local-machine) section in `scancode-results-analyzer` documentation, for installing and getting started.
+- [Link to Code](https://github.com/nexB/scancode-results-analyzer/blob/master/src/results_analyze/nlp_models.py)
+Then in `NLPModelsPredict` class, function `predict_basic_lic_class` uses this classifier to
+predict sentances as either valid license tags or false positives.
+## Limitations and Bias
+As this model is a fine-tuned version of the [`bert-base-cased`](https://huggingface.co/bert-base-cased) model,
+it has the same biases, but as the task it is fine-tuned to is a very specific task
+(license text/notice/tag/referance) without those intended biases, it's safe to assume
+those don't apply at all here.
+## Training and Fine-Tuning Data
+The BERT model was pretrained on BookCorpus, a dataset consisting of 11,038 unpublished books and English Wikipedia (excluding lists, tables and headers).
+Then this `bert-base-cased` model was fine-tuned on Scancode Rule texts, specifically
+trained in the context of sentence classification, where the four classes are
+	- License Text
+	- License Notice
+	- License Tag
+	- License Referance
+## Training Procedure
+For fine-tuning procedure and training, refer `scancode-results-analyzer` code.
+- [Link to Code](https://github.com/nexB/scancode-results-analyzer/blob/master/src/results_analyze/nlp_models.py)
+In `NLPModelsTrain` class, function `prepare_input_data_false_positive` prepares the
+training data.
+In `NLPModelsTrain` class, function `train_basic_false_positive_classifier` fine-tunes
+this classifier.
+1. Model - [BertBaseCased](https://huggingface.co/bert-base-cased) (Weights 0.5 GB)
+2. Sentence Length - 32
+3. Labels - 4 (License Text/Notice/Tag/Referance)
+4. After 4 Epochs of Fine-Tuning with learning rate 2e-5 (60 secs each on an RTX 2060)
+Note: The classes aren't balanced.
+## Eval Results
+- Accuracy on the training data (90%)   : 0.98 (+- 0.01)
+- Accuracy on the validation data (10%) : 0.84 (+- 0.01)
+## Further Work
+1. Apllying Splitting/Aggregation Strategies
+2. Data Augmentation according to Vaalidation Errors
+3. Bigger/Better Suited Models