Designing Museum Object Schemas for Automated Ingestion and…

Workflow Context

Museum digital asset pipelines fail when object schemas function as static documentation rather than executable contracts. A schema written in a wiki page or a spreadsheet cannot reject a malformed accession number, cannot coerce a legacy date string, and cannot route a restricted object away from a public delivery tier. Production-grade schemas must do all three at the moment of ingestion: enforce structural consistency, normalize controlled vocabularies, and route assets based on intellectual property constraints before a single record reaches the collection management system.

This page specifies that contract for the Core Architecture & Collection Taxonomy pipeline stage. It sits upstream of every other transformation: when curators submit batch records or vendors push CSV exports, the model described here validates, transforms, and routes without manual intervention. The design separates structural requirements — the fields that must be present and correctly typed for a record to exist at all — from semantic enrichment, the vocabulary resolution and identifier lookups that can be deferred to downstream stages. That separation is what makes high-throughput ingestion possible while preserving provenance and rights metadata. The output of this schema is consumed by the LIDO-to-database mapping layer, enriched against Getty authorities in Implementing Getty AAT & TGN, and serialized for the web following How to Structure JSON-LD for Museum Objects. The role that owns this code is the Python automation engineer; the role that consumes its rejection reports is the collections manager reconciling a drifted export.

Prerequisites

Before building the ingestion contract, confirm the following are in place:

Python 3.9+ for PEP 604 union syntax (str | None) and the native generic type hints used throughout.
Pydantic v2 (pip install "pydantic>=2.6"). The v1 @validator and class Config APIs are not compatible with the field_validator / model_config patterns here; consult the Pydantic v2 documentation for the migration surface.
The LIDO 1.1 element vocabulary you validate against, so field names and cardinality match the interchange envelope. See the LIDO schema for element definitions.
A RightsStatements.org / controlled rights list to constrain the rights_category field to a closed enum rather than free text.
An asyncio event loop and a thread pool if you validate in batch — CPU-bound validation must not stall the ingestion loop.
A quarantine table or dead-letter store with columns for the raw payload, the structured error list, and a UTC timestamp.
A JSON Schema consumer (frontend form validator, CI contract test, or an API gateway) that will read the model’s exported schema.

Schema Reference

The model mirrors the institutional metadata contract one field at a time. Each field carries an explicit type, a constraint, and — where routing depends on it — a note on how the value steers downstream access tiers. The table below is the specification the code enforces.

Field	Type	Constraint	Source / role
`accession_number`	`str`	regex `^[A-Za-z]{2,4}-\d{4}-\d{1,6}$`, upper-cased	Institutional accession scheme; primary key
`object_name`	`str`	`min_length=2`, `max_length=255`	LIDO `objectWorkType` / title
`creation_date`	`date \| None`	optional, ISO 8601 coerced	LIDO `eventDate`; lenient parse
`medium`	`str \| None`	optional	LIDO `termMaterialsTech`; later resolved to a Getty AAT URI
`rights_category`	`RightsCategory`	closed enum	Drives access-tier routing
`embargo_date`	`date \| None`	optional, future-dated allowed	Holds an object out of public tiers until it passes
`source_system`	`Literal[...]`	one of `tms`, `collectiveaccess`, `manual_entry`	Provenance of the record

Schema Architecture and Type Enforcement

Define the ingestion contract using Pydantic v2 for native JSON Schema generation and runtime validation. The model enforces mandatory routing fields while deferring optional enrichment to downstream normalization. Strict mode prevents silent coercion of malformed data, and extra-field rejection ensures schema drift does not propagate to production databases.

python

import logging
from datetime import date
from typing import Optional, Literal
from enum import Enum
from pydantic import BaseModel, Field, field_validator, ConfigDict

logger = logging.getLogger("museum_pipeline.schema")


class RightsCategory(str, Enum):
    PUBLIC_DOMAIN = "public_domain"
    IN_COPYRIGHT = "in_copyright"
    ORPHAN_WORK = "orphan_work"
    RESTRICTED = "restricted"


class MuseumObject(BaseModel):
    model_config = ConfigDict(
        strict=True,
        extra="forbid",
        json_schema_extra={"lido_mapping": "recordWrap"},
    )

    accession_number: str = Field(pattern=r"^[A-Za-z]{2,4}-\d{4}-\d{1,6}$")
    object_name: str = Field(min_length=2, max_length=255)
    creation_date: Optional[date] = Field(default=None, strict=False)
    medium: Optional[str] = None
    rights_category: RightsCategory
    embargo_date: Optional[date] = Field(default=None, strict=False)
    source_system: Literal["tms", "collectiveaccess", "manual_entry"]

    @field_validator("accession_number")
    @classmethod
    def normalize_accession(cls, v: str) -> str:
        return v.strip().upper()

This configuration aligns with the LIDO-to-database mapping layer by embedding explicit mapping metadata via json_schema_extra. Strict mode rejects silent coercion on the mandatory fields, while the date fields opt out with strict=False so ISO 8601 strings from CSV and JSON exports parse into date objects rather than being rejected. Because Pydantic evaluates Field(pattern=...) before field validators, the accession pattern is deliberately case-insensitive ([A-Za-z]); the normalize_accession validator then canonicalizes the value to uppercase before persistence, so bos-2019-4 and BOS-2019-4 resolve to the same primary key.

Step-by-Step Implementation

1. Validate and route a single record

Wrap validation in a function that never raises to the caller. A failed record must be captured, not lost, so the batch can continue and the quarantine store gets a structured error.

python

from typing import Any
from pydantic import ValidationError


def validate_and_route(record: dict[str, Any]) -> dict[str, Any]:
    try:
        validated = MuseumObject.model_validate(record)
        return {"status": "valid", "data": validated.model_dump(mode="json")}
    except ValidationError as e:
        logger.warning(
            "Validation failed for %s: %s",
            record.get("accession_number"),
            e.error_count(),
        )
        return {"status": "failed", "errors": e.errors(), "input": record}

Catching ValidationError specifically — rather than a bare Exception — means genuine programming errors still surface, while model_dump(mode="json") produces a JSON-safe payload with date objects already serialized to ISO strings, ready for the next stage.

2. Adapt the three input formats to one shape

Validation is only correct if every source presents the same keys. Normalize CSV, LIDO XML, and API JSON into plain dicts before they reach the model. This is the edge-case surface most pipelines get wrong.

python

import csv
from xml.etree import ElementTree as ET

LIDO_NS = {"lido": "http://www.lido-schema.org/"}


def rows_from_csv(path: str) -> list[dict[str, Any]]:
    # CSV: every value is a string; the model's lenient date fields coerce them.
    with open(path, newline="", encoding="utf-8") as fh:
        return [dict(row) for row in csv.DictReader(fh)]


def rows_from_lido(path: str) -> list[dict[str, Any]]:
    # XML: pull only the fields the schema owns; empty text becomes None.
    tree = ET.parse(path)
    records = []
    for wrap in tree.iterfind(".//lido:lido", LIDO_NS):
        acc = wrap.findtext(".//lido:workID", namespaces=LIDO_NS)
        name = wrap.findtext(".//lido:objectWorkType/lido:term", namespaces=LIDO_NS)
        records.append({
            "accession_number": (acc or "").strip(),
            "object_name": (name or "").strip(),
            "rights_category": "in_copyright",
            "source_system": "collectiveaccess",
        })
    return records

For API payloads, the JSON already arrives as dicts, but nested envelopes must be flattened to the flat field names the model expects, and None must be preserved rather than replaced with empty strings — strict mode rejects "" for an optional date, whereas None passes.

3. Process the batch concurrently with failure isolation

High-volume ingestion requires non-blocking execution so external vocabulary lookups or database writes do not serialize the whole run. The processor chunks incoming records, validates each chunk concurrently, and isolates failures without halting the batch.

python

import asyncio


async def process_batch(records: list[dict], batch_size: int = 50) -> dict:
    valid_objects: list[dict] = []
    failed_records: list[dict] = []

    for i in range(0, len(records), batch_size):
        chunk = records[i:i + batch_size]
        tasks = [asyncio.to_thread(validate_and_route, r) for r in chunk]
        results = await asyncio.gather(*tasks, return_exceptions=True)

        for result in results:
            if isinstance(result, Exception):
                failed_records.append({"error": str(result)})
            elif result["status"] == "valid":
                valid_objects.append(result["data"])
            else:
                failed_records.append(result)

    return {
        "valid_count": len(valid_objects),
        "failed_count": len(failed_records),
        "payloads": valid_objects,
        "quarantine": failed_records,
    }

Because Pydantic validation is CPU-bound, asyncio.to_thread offloads each call to the thread pool so the event loop stays free for I/O. Each chunk processes independently, allowing horizontal scaling across worker nodes, and failed records route to a quarantine queue for manual review rather than aborting the run.

4. Export the JSON Schema for contract testing

The same model is the source of truth for frontend forms and CI contract tests. Export it once and check the artifact into the repository so a breaking field change fails review, not production.

python

import json

if __name__ == "__main__":
    schema = MuseumObject.model_json_schema()
    print(json.dumps(schema, indent=2))

Processing Logic Diagram

Rights and Access Routing

Intellectual property constraints dictate downstream visibility and IIIF manifest generation. The schema routes each object by rights_category and embargo_date, so an access decision is made at the structural boundary rather than in application code far downstream. Restricted assets never reach a public delivery tier; public-domain records bypass review and proceed directly to delivery.

python

def determine_routing_path(obj: MuseumObject) -> str:
    if obj.rights_category == RightsCategory.RESTRICTED:
        return "secure_storage"
    if obj.embargo_date and obj.embargo_date > date.today():
        return "embargo_queue"
    if obj.rights_category == RightsCategory.PUBLIC_DOMAIN:
        return "public_api"
    return "rights_review"

The order of the checks is the policy: a restricted object is quarantined even if its embargo has lapsed, and an embargo holds an otherwise-public object out of the delivery tier until its date passes. This deterministic routing enforces the same access tiers used by Security Boundaries for Collection APIs, and the resolved path becomes an input to the rights block in How to Structure JSON-LD for Museum Objects. The logic executes synchronously within the validation pipeline, which keeps routing decisions zero-latency.

Vocabulary Normalization and Downstream Mapping

Controlled vocabulary resolution occurs after structural validation. The medium and object-classification fields require mapping to Getty AAT and TGN identifiers, but those lookups hit an external API and must not block ingestion. The pipeline therefore queues the raw string for asynchronous enrichment and retains the original curator input in a separate audit field, so a later re-resolution can be diffed against what the curator actually typed.

Normalized terms integrate directly with Implementing Getty AAT & TGN workflows: the enriched identifier populates an aat_id field on successful resolution, while an unresolved term stays flagged for review without ever blocking the structural write. This two-stage split — validate now, enrich later — is what keeps throughput deterministic under load.

Verification and Testing

Confirm the contract behaves before wiring it into a pipeline. The following assert-based test exercises normalization, lenient date coercion, extra-field rejection, and the routing policy in one pass.

python

from datetime import date, timedelta


def test_schema_contract() -> None:
    obj = MuseumObject.model_validate({
        "accession_number": "bos-2019-4",       # lower-case input
        "object_name": "Study for a Portrait",
        "creation_date": "1889-06-01",           # ISO string from CSV
        "rights_category": "public_domain",
        "source_system": "tms",
    })
    assert obj.accession_number == "BOS-2019-4"   # normalized to upper
    assert obj.creation_date == date(1889, 6, 1)  # coerced to date
    assert determine_routing_path(obj) == "public_api"

    # Extra fields are rejected under extra="forbid".
    try:
        MuseumObject.model_validate({
            "accession_number": "BOS-2019-5",
            "object_name": "Untitled",
            "rights_category": "restricted",
            "source_system": "tms",
            "unexpected_column": "drift",
        })
        raise AssertionError("schema drift was not rejected")
    except ValidationError:
        pass

    # A future embargo holds a public object out of the delivery tier.
    embargoed = MuseumObject.model_validate({
        "accession_number": "BOS-2020-1",
        "object_name": "Loaned Manuscript",
        "rights_category": "public_domain",
        "embargo_date": (date.today() + timedelta(days=30)).isoformat(),
        "source_system": "manual_entry",
    })
    assert determine_routing_path(embargoed) == "embargo_queue"

    print("schema contract OK")


if __name__ == "__main__":
    test_schema_contract()

Run it directly with python -m pytest schema_test.py -q or as a plain script; a green run means the field constraints, the normalization validator, and the routing policy all hold together.

Production Deployment Checklist

Export MuseumObject.model_json_schema() and commit the artifact so contract changes fail code review, not production.
Enforce strict mode and extra="forbid" in CI so a new source column raises a ValidationError instead of silently dropping data.
Monitor validation failure rates through structured logging and alert on a rising quarantine ratio.
Declare per-field aliases with Field(alias=...) (and populate_by_name=True) to bridge deprecated legacy column names during a migration.
Re-run the embargo routing rules on a schedule so lapsed embargoes release into their correct tier.
Verify final serialized output conforms to the JSON-LD structure spec before publishing.

Explore This Topic Further

How to Structure JSON-LD for Museum Objects — serialize the validated payload into W3C-compliant JSON-LD with a canonical @context, resolved namespace collisions, and memory-safe graph expansion for triplestore aggregation.

FAQ

Why does strict mode reject my ISO date strings from a CSV?

Strict mode blocks type coercion globally, but each date field is declared with Field(default=None, strict=False), which re-enables lenient parsing for that field only. If you copied the model without the per-field strict=False, an ISO string like "1889-06-01" is rejected because strict mode will not coerce a str into a date. Keep the mandatory fields strict and opt individual date fields out.

My accession-number regex passes lowercase input — is that a bug?

No, it is deliberate. The pattern uses [A-Za-z] so both bos-2019-4 and BOS-2019-4 validate, then the normalize_accession field validator upper-cases the value. Pydantic runs Field(pattern=...) before field validators, so the regex must accept both cases; canonicalization to a single primary-key form happens in the validator immediately after.

Extra columns from a legacy export are being rejected — how do I bridge them?

extra="forbid" rejects any key the model does not declare, which is what catches schema drift. For a known renamed column, do not loosen the config — declare an alias with Field(alias="OLD_COLUMN_NAME") and set populate_by_name=True in ConfigDict so both the legacy and canonical names resolve to one field. Truly unexpected columns should still be quarantined for review.

Why validate with `asyncio.to_thread` instead of calling the model directly in the coroutine?

Pydantic validation is CPU-bound. Calling it inline inside a coroutine blocks the event loop for the duration of every record, serializing the batch and starving concurrent I/O such as database writes or authority-file lookups. asyncio.to_thread offloads each validation to the thread pool, keeping the loop responsive so chunks genuinely overlap.

How do I stop a restricted object from ever reaching the public API?

The determine_routing_path function checks RESTRICTED first, before the embargo and public-domain branches, so a restricted object routes to secure_storage regardless of any other field. Because routing is evaluated inside the validation pipeline — not in downstream application code — there is no window in which a restricted record is visible to a public delivery tier.

External Standards Reference

Core Architecture & Collection Taxonomy — parent pipeline stage
Mapping LIDO to Internal Databases — consumes validated records
Implementing Getty AAT & TGN — vocabulary enrichment stage
Security Boundaries for Collection APIs — access-tier enforcement
How to Structure JSON-LD for Museum Objects — serialize the output

Designing Museum Object Schemas for Automated Ingestion and Rights Routing

Workflow Context #

Prerequisites #

Schema Reference #

Schema Architecture and Type Enforcement #

Step-by-Step Implementation #

1. Validate and route a single record #

2. Adapt the three input formats to one shape #

3. Process the batch concurrently with failure isolation #

4. Export the JSON Schema for contract testing #

Processing Logic Diagram #

Rights and Access Routing #

Vocabulary Normalization and Downstream Mapping #

Verification and Testing #

Production Deployment Checklist #

Explore This Topic Further #

FAQ #

Why does strict mode reject my ISO date strings from a CSV? #

My accession-number regex passes lowercase input — is that a bug? #

Extra columns from a legacy export are being rejected — how do I bridge them? #

Why validate with asyncio.to_thread instead of calling the model directly in the coroutine? #

How do I stop a restricted object from ever reaching the public API? #

External Standards Reference #

Related #

Explore this section

Workflow Context

Prerequisites

Schema Reference

Schema Architecture and Type Enforcement

Step-by-Step Implementation

1. Validate and route a single record

2. Adapt the three input formats to one shape

3. Process the batch concurrently with failure isolation

4. Export the JSON Schema for contract testing

Processing Logic Diagram

Rights and Access Routing

Vocabulary Normalization and Downstream Mapping

Verification and Testing

Production Deployment Checklist

Explore This Topic Further

FAQ

Why does strict mode reject my ISO date strings from a CSV?

My accession-number regex passes lowercase input — is that a bug?

Extra columns from a legacy export are being rejected — how do I bridge them?

Why validate with `asyncio.to_thread` instead of calling the model directly in the coroutine?

How do I stop a restricted object from ever reaching the public API?

External Standards Reference

Related