Cultural Heritage Metadata Accuracy

A BERT-based classifier that scores Italian cultural-heritage metadata descriptions as high quality or low quality — i.e. whether a description follows the ICCD (Istituto Centrale per il Catalogo e la Documentazione) cataloguing guidelines.

Trained on the biglam/cultural_heritage_metadata_accuracy dataset (~100K Italian descriptions from Cultura Italia, the Italian national cultural aggregator).

The dataset labels each description as HIGH quality if the object and subject of the item are both described according to ICCD guidelines, and LOW quality otherwise. Most of the dataset was manually annotated; ~30K descriptions were automatically labeled LOW quality due to length (less than 3 tokens) or provenance from old (pre-2012), non-curated collections.

Intended use

Useful for surfacing Italian metadata records that may benefit from additional human review. Before deploying, validate:

  • How it performs on your specific data.
  • Whether you agree with the original dataset's quality definitions.

Best used in a human-in-the-loop pipeline — flag low-quality records for catalogue review rather than making automatic accept/reject decisions.

Usage

from transformers import pipeline

pipe = pipeline(
    "text-classification",
    model="small-models-for-glam/cultural_heritage_metadata_accuracy",
)
pipe("Elemento di decorazione architettonica a rilievo")

Validation metrics

Metric Value
Accuracy 0.972
Macro F1 0.972
Loss 0.085

Trained via AutoTrain (binary classification). CO2 emissions: 7.17g.

Limitations

  • Italian only. Trained on Italian metadata; will not generalise to other languages without further fine-tuning.
  • Domain-bound. The training data is Cultura Italia records — performance on other Italian cataloguing traditions (e.g. archives, museums with different schema) is unverified.
  • ICCD-defined notion of "quality". This is what the model learned; whether ICCD-compliance matches your definition of quality is a separate question.

Part of the small-models-for-glam collection — task- and domain-specific models for libraries, archives, and museums.

Downloads last month
28
Safetensors
Model size
0.1B params
Tensor type
I64
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train small-models-for-glam/cultural_heritage_metadata_accuracy