Privacy-preserving mimic models for clinical named entity recognition in French
In this paper, we propose a Privacy-Preserving Mimic Models architecture that enables the generation of shareable models using the mimic learning approach. The idea of mimic learning is to annotate unlabeled public data through a private teacher model trained on the original sensitive data. The newly labeled public dataset is then used to train the student models. These generated student models could be shared without sharing the data itself or exposing the private teacher model that was directly built on this data.
CAS Privacy-Preserving Named Entity Recognition (NER) Mimic Model
To generate the CAS Privacy-Preserving Mimic Model, we used a private teacher model to annotate the unlabeled CAS clinical French corpus. The private teacher model is an NER model trained on the MERLOT clinical corpus and could not be shared. Using the produced silver annotations, we train the CAS student model, namely the CAS Privacy-Preserving NER Mimic Model. This model might be viewed as a knowledge transfer process between the teacher and the student model in a privacy-preserving manner.
We share only the weights of the CAS student model, which is trained on silver-labeled publicly released data. We argue that no potential attack could reveal information about sensitive private data using the silver annotations generated by the private teacher model on publicly available non-sensitive data.
Our model is constructed based on CamemBERT model using the Natural language structuring (NLstruct) library that implements NER models that handle nested entities.
- Paper: Privacy-preserving mimic models for clinical named entity recognition in French
- Produced gold and silver annotations for the DEFT and CAS French clinical corpora: https://zenodo.org/records/6451361
- Developed by: Nesrine Bannour, Perceval Wajsbürt, Bastien Rance, Xavier Tannier and Aurélie Névéol
- Language: French
- License: cc-by-sa-4.0
Download the CAS Privacy-Preserving NER Mimic Model
fasttext_url = hf_hub_url(repo_id="NesrineBannour/CAS-privacy-preserving-model", filename="CAS-privacy-preserving-model_fasttext.txt")
urllib.request.urlretrieve(fasttext_url, fasttext_url.split('/')[-1])
model_url = hf_hub_url(repo_id="NesrineBannour/CAS-privacy-preserving-model", filename="CAS-privacy-preserving-model.ckpt")
urllib.request.urlretrieve(model_url, "path/to/your/folder/"+ model_url.split('/')[-1])
path_checkpoint = "path/to/your/folder/"+ model_url.split('/')[-1]
1. Load and use the model using only NLstruct
NLstruct is the Python library we used to generate our CAS privacy-preserving NER mimic model and that handles nested entities.
Install the NLstruct library
pip install nlstruct==0.1.0
Use the model
from nlstruct import load_pretrained
from nlstruct.datasets import load_from_brat, export_to_brat
ner_model = load_pretrained(path_checkpoint)
test_data = load_from_brat("path/to/brat/test")
test_predictions = ner_model.predict(test_data)
# Export the predictions into the BRAT standoff format
export_to_brat(test_predictions, filename_prefix="path/to/exported_brat")
2. Load the model using NLstruct and use it with the Medkit library
Medkit is a Python library for facilitating the extraction of features from various modalities of patient data, including textual data.
Install the Medkit library
python -m pip install 'medkit-lib'
Use the model
Our model could be implemented as a Medkit operation module as follows:
import os
from nlstruct import load_pretrained
import urllib.request
from huggingface_hub import hf_hub_url
from medkit.io.brat import BratInputConverter, BratOutputConverter
from medkit.core import Attribute
from medkit.core.text import NEROperation,Entity,Span,Segment, span_utils
class CAS_matcher(NEROperation):
def __init__(self):
# Load the fasttext file
fasttext_url = hf_hub_url(repo_id="NesrineBannour/CAS-privacy-preserving-model", filename="CAS-privacy-preserving-model_fasttext.txt")
if not os.path.exists("CAS-privacy-preserving-model_fasttext.txt"):
urllib.request.urlretrieve(fasttext_url, fasttext_url.split('/')[-1])
# Load the model
model_url = hf_hub_url(repo_id="NesrineBannour/CAS-privacy-preserving-model", filename="CAS-privacy-preserving-model.ckpt")
if not os.path.exists("ner_model/CAS-privacy-preserving-model.ckpt"):
urllib.request.urlretrieve(model_url, "ner_model/"+ model_url.split('/')[-1])
path_checkpoint = "ner_model/"+ model_url.split('/')[-1]
self.model = load_pretrained(path_checkpoint)
self.model.eval()
def run(self, segments):
"""Return entities for each match in `segments`.
Parameters
----------
segments:
List of segments into which to look for matches.
Returns
-------
List[Entity]
Entities found in `segments`.
"""
# get an iterator to all matches, grouped by segment
entities = []
for segment in segments:
matches = self.model.predict({"doc_id":segment.uid,"text":segment.text})
entities.extend([entity
for entity in self._matches_to_entities(matches, segment)
])
return entities
def _matches_to_entities(self, matches, segment: Segment):
for match in matches["entities"]:
text_all,spans_all = [],[]
for fragment in match["fragments"]:
text, spans = span_utils.extract(
segment.text, segment.spans, [(fragment["begin"], fragment["end"])]
)
text_all.append(text)
spans_all.extend(spans)
text_all = "".join(text_all)
entity = Entity(
label=match["label"],
text=text_all,
spans=spans_all,
)
score_attr = Attribute(
label="confidence",
value=float(match["confidence"]),
#metadata=dict(model=self.model.path_checkpoint),
)
entity.attrs.add(score_attr)
yield entity
brat_converter = BratInputConverter()
docs = brat_converter.load("path/to/brat/test")
matcher = CAS_matcher()
for doc in docs:
entities = matcher.run([doc.raw_segment])
for ent in entities:
doc.anns.add(ent)
brat_output_converter = BratOutputConverter(attrs=[])
# To keep the same document names in the output folder
doc_names = [os.path.splitext(os.path.basename(doc.metadata["path_to_text"]))[0] for doc in docs]
brat_output_converter.save(docs, dir_path="path/to/exported_brat, doc_names=doc_names)
Environmental Impact
Carbon emissions are estimated using the Carbontracker tool. The used version at the time of our experiments computes its estimates by using the average carbon intensity in European Union in 2017 instead of the France value (294.21 gCO2eq/kWh vs. 85 gCO2eq/kWh). Therefore, our reported carbon footprint of training both the private model that generated the silver annotations and the CAS student model is overestimated.
- Hardware Type: GPU NVIDIA GTX 1080 Ti
- Compute Region: Gif-sur-Yvette, Île-de-France, France
- Carbon Emitted: 292 gCO2eq
Acknowledgements
We thank the institutions and colleagues who made it possible to use the datasets described in this study: the Biomedical Informatics Department at the Rouen University Hospital provided access to the LERUDI corpus, and Dr. Grabar (Université de Lille, CNRS, STL) granted permission to use the DEFT/CAS corpus. We would also like to thank the ITMO Cancer Aviesan for funding our research, and the HeKA research team for integrating our model into their library Medkit.
Citation
If you use this model in your research, please make sure to cite our paper:
@article{BANNOUR2022104073,
title = {Privacy-preserving mimic models for clinical named entity recognition in French},
journal = {Journal of Biomedical Informatics},
volume = {130},
pages = {104073},
year = {2022},
issn = {1532-0464},
doi = {https://doi.org/10.1016/j.jbi.2022.104073},
url = {https://www.sciencedirect.com/science/article/pii/S1532046422000892}}
}