Museum collection systems live or die on the shape of the record before it ever reaches a screen. A digital asset arrives as a donor spreadsheet row, a nested archival XML fragment, or an OCR payload, and somewhere between that raw ingress and a public catalogue page it has to become a single, canonical, standards-conformant object record. This page defines the data layer where that transformation happens: a normalized architecture that decouples raw ingestion from public presentation, enforces a canonical schema at the point of entry, and treats every controlled term as a stable URI rather than a free-text string. It is the persistence and modelling core that everything else in the workflow depends on.

Three institutional roles meet at this layer. The collections manager needs the guarantee that every committed record carries the same field definitions, cardinality rules, and vocabulary constraints, so a search for a material or a place returns everything and only what it should. The Python automation engineer needs an executable contract — a schema that fails loudly on drift rather than silently coercing a malformed export into the database. The DAMS administrator needs an auditable chain of custody from source payload to indexed field, with intellectual-property state attached to the record before anything is exposed. This architecture is fed by the automated record ingestion and sync workflows that deliver validated payloads to its door, and it hands its normalized output to rights metadata mapping and licensing automation before a single asset reaches a public endpoint.

Architecture Overview

The core architecture is a deterministic, stateless sequence: raw data enters a staging buffer, passes a schema-validation gate, is normalized against controlled vocabularies, is persisted under a canonical model, and is only then eligible for standards-based delivery. Each arrow in the flow below is an explicit gate, not a best-effort mapping. Raw records arriving as CSV, XML, or API payloads land in a staging buffer that isolates parsing failure from the primary write path. A Pydantic model rejects anything that violates cardinality or typing, routing failures to a quarantine queue with a structured error payload. Valid records proceed to normalization — ISO 8601 date coercion, whitespace trimming, and vocabulary resolution — before an idempotent upsert commits them to the primary repository. Delivery endpoints read from a replica and serialize to IIIF Presentation API 3.0 with RightsStatements.org URIs attached.

The discipline that separates a resilient institutional pipeline from a fragile one-off script is that no stage trusts the stage before it implicitly. The staging buffer absorbs burst pressure and quarantines unparseable input; the validation gate enforces the canonical schema; the normalization stage guarantees every controlled term resolves to an authority URI; the delivery boundary redacts restricted fields by rights tier. The sections that follow specify the standards each gate honours, the canonical Python that models the record, and the deeper workflows that implement each stage.

Standards Landscape

The taxonomy layer is not neutral plumbing. Every record must arrive and depart conformant to the heritage standards that downstream discovery, aggregation, and rights enforcement depend on, and enforcing conformance at the modelling boundary is far cheaper than reconciling drift after thousands of records are committed. The table below maps each standard to the pipeline stage where this architecture enforces it.

Standard	Version	Enforced at stage	Use in this architecture
LIDO	1.1	Interchange → transformation	Harvestable interchange envelope; the nested source schema flattened into canonical relational fields, retained as an archival blob for provenance
CIDOC CRM	7.1.3	Transformation	Event and provenance property mapping when flattening nested LIDO into indexed fields
Getty AAT / TGN	Linked Open Data (current)	Normalization	Resolving free-text materials, subjects, styles, and place names to stable authority URIs
Dublin Core / DCTERMS	DCMI 2020	Transformation	Baseline crosswalk fields for cross-aggregator interoperability
RightsStatements.org	1.0	Validation (rights block)	Machine-readable rights URI asserted on every record before it is eligible to publish
IIIF Presentation API	3.0	Delivery	Manifest structure, canvas/service endpoints, and rights block for image-derived records

Version pinning is load-bearing. LIDO 1.1 changed element nesting enough that a mapper written against an earlier draft will silently drop provenance events, so the transformation layer must target the exact schema version your harvest partners publish — the canonical definition lives at lido-schema.org. Getty vocabularies are consumed as Linked Open Data rather than a versioned file, which means the normalization stage must cache resolved URIs and periodically reconcile deprecated terms against the published authority files at getty.edu. IIIF 3.0 restructured the manifest enough that a validator tuned for Presentation API 2.1 will accept payloads that later fail in a compliant viewer, so the delivery boundary must target the exact major version your viewer serves — see the IIIF Presentation API 3.0 specification.

Core Implementation Pattern

The canonical task for this layer is the object record itself: a strict Pydantic v2 model that functions as an executable contract at the validation gate. It enforces cardinality and typing, coerces dates to ISO 8601, requires a machine-readable rights URI, and stores every controlled term as a resolved authority URI paired with a human-readable label. Explicit field definitions are preferred over flexible JSON blobs so that query performance and long-term preservation are guaranteed rather than hoped for. The model below uses Python 3.9+ syntax (PEP 604 | unions) and the Pydantic v2 API (model_config, field_validator, model_dump).

python

from __future__ import annotations

import re
from datetime import date
from enum import Enum

from pydantic import BaseModel, ConfigDict, Field, field_validator

GETTY_URI = re.compile(r"^http://vocab\.getty\.edu/(aat|tgn|ulan)/\d+$")
RIGHTS_URI = re.compile(r"^https?://rightsstatements\.org/vocab/\w+/1\.0/$")


class RightsTier(str, Enum):
    PUBLIC_DOMAIN = "public_domain"
    IN_COPYRIGHT = "in_copyright"
    RESTRICTED = "restricted"


class ControlledTerm(BaseModel):
    """A vocabulary term stored as a stable URI plus its human-readable label."""

    model_config = ConfigDict(strict=True, extra="forbid", frozen=True)

    uri: str
    label: str = Field(min_length=1)

    @field_validator("uri")
    @classmethod
    def _must_be_getty_uri(cls, v: str) -> str:
        if not GETTY_URI.match(v):
            raise ValueError(f"not a resolvable Getty authority URI: {v!r}")
        return v


class CanonicalObjectRecord(BaseModel):
    """The institutional object contract enforced at the validation gate."""

    model_config = ConfigDict(strict=True, extra="forbid")

    accession_number: str = Field(min_length=1)
    title: str = Field(min_length=1)
    object_type: ControlledTerm            # AAT concept, required
    materials: list[ControlledTerm] = Field(min_length=1)
    place_of_origin: ControlledTerm | None = None  # TGN concept
    date_created: date | None = None       # coerced to ISO 8601
    rights_tier: RightsTier
    rights_uri: str
    lido_archival: str | None = None       # original LIDO XML, retained verbatim

    @field_validator("rights_uri")
    @classmethod
    def _must_be_rights_statement(cls, v: str) -> str:
        if not RIGHTS_URI.match(v):
            raise ValueError(f"not a RightsStatements.org URI: {v!r}")
        return v

    @field_validator("accession_number")
    @classmethod
    def _normalize_accession(cls, v: str) -> str:
        return v.strip()


def validate_record(payload: dict) -> CanonicalObjectRecord:
    """Validate one raw payload against the canonical contract.

    Raises pydantic.ValidationError on cardinality, typing, or URI
    violations — the caller routes those failures to the quarantine queue.
    """
    return CanonicalObjectRecord.model_validate(payload)

Two design decisions carry most of the weight. First, strict=True with extra="forbid" means a drifted export — an added column, a coerced integer where a string is required, a missing material — fails at the gate instead of writing a partial record. Second, controlled terms are modelled as a frozen ControlledTerm rather than a bare string, which makes it structurally impossible to persist a free-text subject or place name; the URI validator rejects anything that is not a resolvable Getty concept. The lido_archival field preserves the original interchange payload verbatim, so no provenance is lost when the nested structure is flattened into indexed fields.

Explore the Collection Taxonomy Workflows

This architecture is implemented across the workflows below, each owning one gate in the flow above.

Modelling the record. Designing museum object schemas is the deepest treatment of the validation gate itself: how to structure the Pydantic contract so it enforces routing fields, rejects schema drift, separates structural requirements from semantic enrichment, and generates JSON Schema for cross-system reuse. Every other workflow in this section validates its output against the model defined there.

Resolving controlled vocabularies. Implementing Getty AAT and TGN covers the normalization stage: asynchronous resolution of free-text materials, subjects, and place names to Art & Architecture Thesaurus and Thesaurus of Geographic Names URIs, with a local cache to minimize external HTTP traffic, exponential backoff on rate limits, and nightly reconciliation of deprecated terms.

Flattening the interchange envelope. Mapping LIDO to internal databases is the transformation layer: memory-efficient streaming extraction of LIDO XML with iterparse, field-level translation into relational tables or document fields, namespace handling, and idempotent async upsert with rights-aware routing — all while retaining the original structure in an archival column.

Securing the delivery boundary. Security boundaries for collection APIs defines the enforcement layer between the repository and the outside world: a stateless gate that authenticates consumers, resolves rights tiers, redacts restricted provenance fields before serialization, and routes payloads through IIIF-compliant delivery so aggregators never touch raw tables.

Choosing among schema standards. Before you can flatten LIDO or design a canonical model, you have to decide which standard governs it. Comparing collection schema standards weighs LIDO, CIDOC-CRM, Dublin Core, and the TMS and CollectiveAccess data models against one another, and shows how to pick a canonical internal schema and define the crosswalks that feed the validation gate above.

Integration Boundaries

This layer is the hinge between ingestion and publication, and its contracts are what keep the two halves of the pipeline decoupled. On the inbound side, the record ingestion and sync workflows deliver payloads to the staging buffer; this architecture owns everything from the validation gate onward. The handoff contract is the CanonicalObjectRecord model — ingestion is responsible for delivery and deduplication, this layer is responsible for the record ever being well-formed. Keeping that boundary sharp means a change to the broker or the CSV adapter never forces a change to the schema, and vice versa.

On the outbound side, a committed record is not automatically publishable. The rights_tier and rights_uri fields set by this layer are the input to rights metadata mapping and licensing automation, which decides embargo routing, access tier, and the rights block that the IIIF image delivery and manifest generation layer writes into every published manifest. The delivery endpoint reads from a replica and serializes through the security boundary, so the intersection of a consumer’s rights tier and a record’s rights tier determines which fields are redacted before the manifest is generated. The result is a single directional flow — validated record → normalized record → rights-routed record → IIIF manifest — where each stage owns exactly one transformation and hands a well-defined contract to the next.

Operational Checklist

Before promoting this data layer to production, confirm every gate below. These are the checks that keep the canonical schema from silently accumulating drift.

The CanonicalObjectRecord model runs with strict=True and extra="forbid" so an added or coerced source column fails at the gate instead of writing a partial record.
Every controlled term persists as a resolvable authority URI plus a human-readable label; no free-text subject, material, or place name reaches the primary repository.
A dedicated vocabulary mapping table stores resolved Getty URIs alongside labels, and a nightly job flags deprecated terms against the published authority files.
The rights_uri field is required and pattern-validated against RightsStatements.org; no record commits without a machine-readable rights assertion.
Dates are coerced to ISO 8601 at normalization; ambiguous or partial dates are quarantined rather than guessed.
The original LIDO payload is retained verbatim in an archival column so flattening into indexed fields never loses provenance.
Failed records route to a quarantine queue with the original payload, a structured error list, and a UTC timestamp — never dropped.
Public reads are served from a read-only replica; administrative mutations require mutual TLS and scoped OAuth2 tokens.
The delivery boundary redacts restricted fields by the intersection of consumer and record rights tier before IIIF serialization.
Every metadata mutation and access event is written to an immutable audit log for provenance and rollback.

Failure Modes

The failures below recur across institutions running this architecture. Each has a deterministic remediation and a workflow that treats it in depth.

Failure pattern	Root cause	Remediation
Free-text terms that never aggregate	Vocabulary resolution skipped; labels stored as raw strings	Resolve to authority URIs at normalization; see implementing Getty AAT and TGN
Silent partial records	Loose model coercion accepted a drifted export	Set `strict=True` / `extra="forbid"`; see designing museum object schemas
Lost provenance events after harvest	Nested LIDO flattened without retaining the source	Keep an archival LIDO column and target LIDO 1.1; see mapping LIDO to internal databases
Crosswalk fields that fail aggregator validation	Dublin Core mapping drifted from the source schema	Validate the crosswalk explicitly; see validating Dublin Core against CollectionBase
Restricted provenance leaked to a public consumer	Delivery boundary serialized before rights-tier redaction	Redact by tier intersection at the gate; see security boundaries for collection APIs
Linked-data consumers cannot parse object records	JSON-LD emitted with an inconsistent or missing context	Structure JSON-LD against a fixed context; see how to structure JSON-LD for museum objects

Designing Museum Object Schemas — executable Pydantic contracts
Implementing Getty AAT & TGN — controlled vocabulary URIs
Mapping LIDO to Internal Databases — interchange flattening layer
Security Boundaries for Collection APIs — tiered delivery enforcement
Comparing Collection Schema Standards — choosing a canonical schema
Automated Record Ingestion & Sync Workflows — upstream validated ingress

Core Architecture & Collection Taxonomy

Architecture Overview #

Standards Landscape #

Core Implementation Pattern #

Explore the Collection Taxonomy Workflows #

Integration Boundaries #

Operational Checklist #

Failure Modes #

Related #

Explore this section