| | --- |
| | language: en |
| | license: apache-2.0 |
| | library_name: transformers |
| | tags: |
| | - bert |
| | - text-classification |
| | - privacy-policy |
| | - gdpr |
| | - torchscript |
| | datasets: |
| | - MAPP-116 |
| | metrics: |
| | - f1 |
| | model-index: |
| | - name: PARENT BERT |
| | results: |
| | - task: |
| | type: text-classification |
| | dataset: |
| | name: MAPP-116 |
| | type: text |
| | metrics: |
| | - name: f1 |
| | type: score |
| | value: 0.80 |
| | --- |
| | |
| |
|
| |
|
| |
|
| | # PARENT BERT Models for Privacy Policy Analysis |
| |
|
| | This repository contains **TorchScript versions of 15 fine-tuned BERT models** used in the PARENT project to analyse mobile app privacy policies. These models identify **what data is collected, why it is collected, and how it is processed**, helping assess GDPR compliance. |
| |
|
| | They are part of a hybrid framework designed for non-technical users, particularly parents concerned about children’s privacy. |
| |
|
| | --- |
| |
|
| | ## Model Purpose |
| |
|
| | - Segment privacy policies to detect: |
| | - Data collection types (e.g., contact info, location) |
| | - Purpose of data collection |
| | - How data is processed |
| | - Support GDPR compliance evaluation |
| | - Detect potential third-party sharing (in combination with a logistic regression model) |
| |
|
| | --- |
| | ## References |
| |
|
| | - **MAPP Dataset:** Arora, S., Hosseini, H., Utz, C., Bannihatti Kumar, V., Dhellemmes, T., Ravichander, A., Story, P., Mangat, J., Chen, R., Degeling, M., Norton, T.B., Hupperich, T., Wilson, S., & Sadeh, N.M. (2022). *A tale of two regulatory regimes: Creation and analysis of a bilingual privacy policy corpus*. Proceedings of the International Conference on Language Resources and Evaluation (LREC 2022). [PDF link](https://aclanthology.org/2022.lrec-1.585.pdf) [Accessed 12 July 2025]. |
| | --- |
| |
|
| | ## Usage |
| |
|
| | ```python |
| | import torch |
| | from transformers import BertTokenizerFast |
| | from huggingface_hub import hf_hub_download |
| | |
| | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
| | REPO_ID = "Bnaad/PARENT_bert" |
| | |
| | # Load tokenizer |
| | tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased") |
| | |
| | # Load one TorchScript model from Hugging Face |
| | label_name = "Information Type_Contact information" |
| | safe_label = label_name.replace(" ", "_").replace("/", "_") |
| | filename = f"torchscript_{safe_label}.pt" |
| | model_path = hf_hub_download(repo_id=REPO_ID, filename=filename) |
| | model = torch.jit.load(model_path, map_location=device) |
| | model.to(device) |
| | model.eval() |
| | |
| | # Example inference |
| | sample_text = """For any questions about your account or our services, please contact our customer support team by emailing support@example.com, calling +1-800-555-1234, or visiting our office at 123 Main Street, Springfield, IL, 62701 during business hours""" |
| | inputs = tokenizer( |
| | sample_text, |
| | return_tensors="pt", |
| | truncation=True, |
| | padding="max_length", |
| | max_length=512 |
| | ).to(device) |
| | |
| | with torch.no_grad(): |
| | outputs = model(inputs["input_ids"], inputs["attention_mask"]) |
| | |
| | print("Logits:", outputs) |
| | prob = torch.sigmoid(outputs.squeeze()) |
| | print(prob) |
| | |
| | |
| | |