π€ About me β’π± Harry.vc β’ π¦ X.com β’ π Papers
π₯· Model Card for King-Harry/NinjaMasker-PII-Redaction
This model is designed for the redaction and masking of Personally Identifiable Information (PII) in complex text scenarios like call transcripts.
News
- π₯π₯π₯[2023/10/06] Building New Dataset creating a significantly improved dataset, fixing stop tokens.
- π₯π₯π₯[2023/10/05] NinjaMasker-PII-Redaction version 1, was released.
Model Details
π Model Description
This model aims to handle complex and difficult instances of PII redaction that traditional classification models struggle with.
- Developed by: Harry Roy McLaughlin
- Model type: Fine-tuned Language Model
- Language(s) (NLP): English
- License: TBD
- Finetuned from the model: NousResearch/Llama-2-7b-chat-hf
π± Model Sources
- Repository: Hosted on HuggingFace
- Demo: Coming soon
π§ͺ Test the model
Log into HuggingFace (if not already)
!pip install transformers
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, logging
from huggingface_hub import notebook_login
notebook_login()
Load Model
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, logging
# Ignore warnings
logging.set_verbosity(logging.CRITICAL)
# Load the model and tokenizer with authentication token
model_name = "King-Harry/NinjaMasker-PII-Redaction"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
Generate Text
# Generate text
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=100)
prompt = "My name is Harry and I live in Winnipeg. My phone number is ummm 204 no 203, ahh 4344, no 4355"
result = pipe(f"<s>[INST] {prompt} [/INST]")
# Print the generated text
print(result[0]['generated_text'])
Uses
π― Direct Use
The model is specifically designed for direct redaction and masking of PII in complex text inputs such as call transcripts.
β¬οΈ Downstream Use
The model has potential for numerous downstream applications, though specific use-cases are yet to be fully explored.
β Out-of-Scope Use
The model is under development; use in critical systems requiring 100% accuracy is not recommended at this stage.
βοΈ Bias, Risks, and Limitations
The model is trained only on English text, which may limit its applicability in multilingual or non-English settings.
π Recommendations
Users should be aware of the model's language-specific training and should exercise caution when using it in critical systems.
ποΈ Training Details
π Training Data
The model was trained on a dataset of 43,000 question/answer pairs that contained various forms of PII. There are 63 labels that the model looks for.
βοΈ Training Hyperparameters
- Training regime: FP16
π Speeds, Sizes, Times
- Hardware: T4 GPU
- Cloud Provider: Google CoLab Pro (for the extra RAM)
- Training Duration: ~4 hours
π Evaluation
Evaluation is pending.
π Environmental Impact
Given the significant computing resources used, the model likely has a substantial carbon footprint. Exact calculations are pending.
- Hardware Type: T4 GPU
- Hours used: ~4
- Cloud Provider: Google CoLab Pro
π Technical Specifications
ποΈ Model Architecture and Objective
The model is a fine-tuned version of LLama 2 7b, tailored for PII redaction tasks.
π₯οΈ Hardware
- Training Hardware: T4 GPU (with extra RAM)
πΎ Software
Environment: Google CoLab Pro
πͺ Disclaimer
This model is in its first generation and will be updated rapidly.
βοΈ Model Card Authors
π Model Card Contact
- Downloads last month
- 18