|
# GPT-NYC-nontoxic |
|
|
|
## About |
|
|
|
GPT2 (small version on HF) fine-tuned on questions and responses from https://reddit.com/r/asknyc |
|
|
|
I filtered comments to ones with scores >= 3, and responding directly |
|
to the original post ( = ignoring responses to other commenters). |
|
I also added many tokens which were common on /r/AskNYC but missing from |
|
GPT2. |
|
|
|
Additional <Toxic> and <NonToxic> tokens control following output. |
|
Toxic comments (about 5.5% of input data) are those which were flagged |
|
by [Perspective API](https://developers.perspectiveapi.com) with toxicity > 0.7, |
|
or by [English DeHateBERT](https://huggingface.co./Hate-speech-CNERG/dehatebert-mono-english), |
|
with <NonToxic> tagging for all comments related to LGBTQ identity |
|
to avoid false positives / more aggressive censorship from these classifiers. |
|
|
|
Try prompting with ```question? - additional info %% <Toxic> ``` |
|
Or ```question? - additional info %% <NonToxic>``` |
|
|
|
## Other options |
|
|
|
The [gpt-nyc-small](https://huggingface.co./monsoon-nlp/gpt-nyc-small) repo is based |
|
on GPT2 [small] but without the <Toxic> and <NonToxic> tags. It is the most |
|
directly comparable model to this one. |
|
|
|
The main [gpt-nyc](https://huggingface.co./monsoon-nlp/gpt-nyc) repo is based |
|
on GPT2-Medium and comes off more accurate. It does not have Toxic/NonToxic tagging. |
|
|
|
## Blog |
|
|
|
Initial model: https://mapmeld.medium.com/gpt-nyc-part-1-9cb698b2e3d |
|
|
|
## Notebooks |
|
|
|
### Data processing / new tokens |
|
|
|
https://colab.research.google.com/drive/13BOw0uekoAYB4jjQtaXTn6J_VHatiRLu |
|
|
|
### Fine-tuning GPT2 (small) |
|
|
|
https://colab.research.google.com/drive/1FnXcAh4H-k8dAzixkV5ieygV96ePh3lR |
|
|
|
### Predictive text and probabilities |
|
|
|
Scroll to end of |
|
|
|
https://colab.research.google.com/drive/1FnXcAh4H-k8dAzixkV5ieygV96ePh3lR |
|
|
|
to see how to install git-lfs and trick ecco into loading this. |
|
|