Mapping LIDO to Internal Databases

Workflow Context

LIDO (Lightweight Information Describing Objects) is a harvest format, not a storage schema. A single OAI-PMH response can carry tens of thousands of <lido:lido> records, each a deeply nested event-centric tree, and none of that hierarchy maps cleanly onto the flat, indexed rows an internal collection management system expects. This page defines the sub-problem inside the Core Architecture & Collection Taxonomy data layer that turns a raw LIDO stream into committed database records: a deterministic, state-aware pipeline that decouples XML parsing from persistence so a malformed record can never trigger a partial write.

The pipeline enforces three isolated stages — extraction, transformation, and routing — and treats the LIDO payload as an immutable input stream. Extraction performs memory-efficient XML traversal so a multi-gigabyte harvest never inflates the heap. Transformation flattens hierarchical paths into columns and resolves controlled vocabularies against authority files. Routing evaluates intellectual-property state and queues validated records for batch execution. Keeping these stages separate is what guarantees auditability across large digitization backlogs: every committed row traces back to exactly one source record, and every rejection is logged with its original XML context rather than silently dropped.

Prerequisites

Python 3.9+ — the code below uses PEP 604 union types (str | None) and dataclasses.
lxml 4.9+ — lxml.etree.iterparse is the only supported parser here; xml.etree.ElementTree cannot free processed elements incrementally.
An async database driver — asyncpg (PostgreSQL) or equivalent, with a connection pool and ON CONFLICT upsert support.
LIDO schema v1.0 — the http://www.lido-schema.org/ namespace; confirm your harvest source’s version against the LIDO schema.
Getty AAT/TGN resolution — a reachable authority endpoint or local cache for Getty vocabulary URIs, used during transformation. See Implementing Getty AAT & TGN for the resolver.
A canonical target schema — the collection_objects table must exist with a unique constraint on object_id; its shape follows Designing Museum Object Schemas.
RightsStatements.org / access-tier policy — the routing stage reads <lido:rightsType> and needs a mapping from those terms to your internal access tiers.

Spec Reference: LIDO Paths to Internal Columns

The transformation contract is a fixed map from LIDO XPath expressions to internal column definitions. Every path below is namespace-qualified against the lido: prefix, and every column carries an explicit coercion rule so the parser never relies on implicit string handling.

Internal column	LIDO source path	Type	Coercion / fallback
`object_id`	`.//lido:recordID[@type='local']`	text (PK)	required; `UNKNOWN` sentinel routes to quarantine
`title`	`.//lido:titleSet/lido:appellationValue`	text	first non-empty; `Untitled` fallback
`creation_date`	`.//lido:eventDate/lido:date/lido:earliestDate`	date	ISO 8601 coercion; `NULL` if absent
`rights_status`	`.//lido:rightsType/lido:term`	enum	`unknown` fallback; drives routing
`media_count`	`count(.//lido:resourceWrap/lido:resourceSet)`	integer	derived from digital surrogates only
`subject_uris`	`.//lido:subjectConcept/lido:conceptID`	text[]	resolved against Getty AAT/TGN

Two constraints matter most. First, object_id is the idempotency key for the entire pipeline — it must be extracted with the @type='local' predicate so a record’s institutional identifier is never confused with an aggregator-assigned ID. Second, media_count counts <lido:resourceSet> nodes under <lido:resourceWrap> — the digital surrogates — and deliberately excludes related-work references, because it later drives IIIF presentation routing.

Step-by-Step Implementation

1. Stream records with `iterparse`

Full DOM loading is prohibited for production LIDO ingestion: memory consumption scales linearly with file size and destabilizes bulk harvests. The extraction engine uses lxml.etree.iterparse to process each <lido:lido> element as its closing tag is seen, then frees it immediately. This holds the memory footprint constant regardless of payload volume.

python

import io
import logging
from typing import Any, Iterator
from lxml import etree
from dataclasses import dataclass, field

logger = logging.getLogger("lido_pipeline")

@dataclass
class LidoRecord:
    object_id: str
    title: str
    creation_date: str | None = None
    rights_status: str = "unknown"
    media_count: int = 0
    raw_xml: str = field(default="", repr=False)

LIDO_NS = "http://www.lido-schema.org/"

def extract_lido_records(xml_bytes: bytes) -> Iterator[dict[str, Any]]:
    ns = {"lido": LIDO_NS}
    context = etree.iterparse(
        io.BytesIO(xml_bytes),
        events=("end",),
        tag=f"{{{LIDO_NS}}}lido"
    )
    for _, elem in context:
        obj_id = elem.findtext(".//lido:recordID[@type='local']", namespaces=ns)
        title = elem.findtext(".//lido:titleSet/lido:appellationValue", namespaces=ns)
        rights = elem.findtext(".//lido:rightsType/lido:term", namespaces=ns) or "unknown"
        creation_date = elem.findtext(
            ".//lido:eventDate/lido:date/lido:earliestDate", namespaces=ns
        )
        media = len(elem.findall(".//lido:resourceWrap/lido:resourceSet", namespaces=ns))
        yield {
            "object_id": obj_id or "UNKNOWN",
            "title": title or "Untitled",
            "creation_date": creation_date,
            "rights_status": rights,
            "media_count": media,
            "raw_xml": etree.tostring(elem, encoding="unicode")
        }
        # Free the processed element and its now-unreachable siblings so the
        # memory footprint stays flat across multi-gigabyte harvests.
        elem.clear()
        while elem.getprevious() is not None:
            del elem.getparent()[0]

The generator yields one flat dictionary per record, then clears each element — along with its already-processed siblings — from the parser context. That elem.clear() plus sibling-deletion pattern is the specific technique that keeps memory flat; omit it and lxml’s internal tree accumulates every processed element until the file is exhausted. Input variants: for OAI-PMH responses, the <lido:lido> records sit inside <record><metadata> wrappers, but tag-filtered iterparse skips straight to them, so the same code path handles both bare LIDO files and harvest envelopes.

2. Transform and normalize vocabularies

LIDO nodes rarely align with a relational schema, so the transformation phase maps hierarchical paths to flat columns and coerces types deterministically. Missing dates become SQL NULL, not placeholder strings; unresolved vocabulary terms route to a quarantine table for curator review rather than corrupting the record silently.

Controlled-vocabulary resolution also happens here. <lido:subjectConcept> and <lido:place> elements carry raw strings or external URIs that must be resolved against institutional authority files. Cross-reference them with Implementing Getty AAT & TGN to map legacy free-text terms onto stable AAT/TGN identifiers before persistence.

python

def transform_record(raw: dict[str, Any]) -> LidoRecord:
    # Coerce the harvest date to ISO 8601 or NULL — never a placeholder string.
    date = raw.get("creation_date")
    if date and not date[:4].isdigit():
        date = None  # non-conforming date -> NULL, curator reviews source
    return LidoRecord(
        object_id=raw["object_id"],
        title=raw["title"].strip(),
        creation_date=date,
        rights_status=raw["rights_status"].strip().lower(),
        media_count=raw["media_count"],
        raw_xml=raw["raw_xml"],
    )

Edge cases: multilingual <lido:appellationValue> sets need a language fallback (prefer xml:lang="en", then the first value); CSV or JSON API variants of the same source skip iterparse entirely and feed transform_record directly, which is why the transform stage takes a plain dict rather than an lxml element.

3. Route on rights, then upsert idempotently

Validated records enter the routing layer, which evaluates <lido:rightsType> against institutional access policy: public-domain records proceed to immediate indexing, restricted materials trigger embargo workflows and audit logging. Persistence runs inside an explicit transaction, and the ON CONFLICT clause makes re-ingestion idempotent — a re-harvest updates the existing row instead of duplicating it.

python

async def process_batch(records: list[LidoRecord], db_pool: Any) -> None:
    async with db_pool.acquire() as conn:
        async with conn.transaction():
            query = """
                INSERT INTO collection_objects
                (object_id, title, creation_date, rights_status, media_count)
                VALUES ($1, $2, $3, $4, $5)
                ON CONFLICT (object_id) DO UPDATE SET
                    title = EXCLUDED.title,
                    media_count = EXCLUDED.media_count,
                    updated_at = NOW()
            """
            params = [
                (r.object_id, r.title, r.creation_date, r.rights_status, r.media_count)
                for r in records
            ]
            await conn.executemany(query, params)
            logger.info("Batch committed: %d records", len(records))

Database operations execute within explicit transaction boundaries, connection pooling prevents resource exhaustion under concurrent load, and the upsert key (object_id) is the same institutional identifier extracted in step 1. This aligns with Designing Museum Object Schemas for consistent primary-key management. Post-insert, normalized records are compared against baseline Dublin Core profiles and discrepancies generate reconciliation tickets — the full validation workflow is detailed in Validating Dublin Core Against CollectionBase.

Pipeline Flow

The three stages compose into a single stream: raw XML enters extraction, flows through transformation, and diverges at the rights-routing gate before every path converges on the same idempotent upsert.

Rights and Access Routing

The routing gate is where intellectual-property metadata determines an object’s downstream fate. The rights_status column, normalized from <lido:rightsType>/<lido:term>, is not decorative — it selects the access tier at which the record is committed. Public-domain records are eligible for immediate indexing and public IIIF delivery; anything carrying a restricted or in-copyright term is committed at a non-public tier and diverted into an embargo workflow that logs the decision for audit.

Because LIDO rights terms are free text at the source, they must be mapped onto a controlled set before routing — the same discipline the rest of this data layer applies to subjects and places. Unrecognized terms default to unknown, which is treated as restricted (fail-closed), never as public. IIIF manifest generation reads directly from media_count: each <lido:resourceSet> must resolve to a persistent identifier, and the IIIF Presentation API 3.0 specification governs how those references become viewer payloads with their rights block attached.

Verification and Testing

Confirm each stage in isolation with an assert-based test that exercises extraction and transformation on a minimal fixture, then a routing assertion on the rights fallback. This runs as a plain script or under pytest.

python

SAMPLE = b"""<lido:lido xmlns:lido="http://www.lido-schema.org/">
  <lido:descriptiveMetadata>
    <lido:objectIdentificationWrap><lido:titleWrap><lido:titleSet>
      <lido:appellationValue>Bronze Ewer</lido:appellationValue>
    </lido:titleSet></lido:titleWrap></lido:objectIdentificationWrap>
  </lido:descriptiveMetadata>
  <lido:administrativeMetadata>
    <lido:recordWrap><lido:recordID type="local">OBJ-42</lido:recordID></lido:recordWrap>
  </lido:administrativeMetadata>
</lido:lido>"""

def test_extract_and_route():
    raw = next(extract_lido_records(SAMPLE))
    rec = transform_record(raw)
    assert rec.object_id == "OBJ-42"          # local recordID, not aggregator ID
    assert rec.title == "Bronze Ewer"
    assert rec.creation_date is None           # absent date -> NULL, not ""
    assert rec.rights_status == "unknown"      # missing rights fail closed
    assert rec.media_count == 0                # no resourceSet nodes

if __name__ == "__main__":
    test_extract_and_route()
    print("ok")

A green run proves the idempotency key resolves, the date coercion produces None rather than an empty string, and the rights fallback is unknown (restricted) rather than an accidental public grant. For a memory check, run the harvest through extract_lido_records under /usr/bin/time -v and confirm maximum resident set size stays flat as file size grows — proof the clear() pattern is doing its job.

Deeper Reference

Validating Dublin Core Against CollectionBase — the pre-commit validation gate that reconciles flattened LIDO output against the target relational schema and eliminates downstream 422 rejections.

FAQ

Why does memory keep climbing during a large LIDO harvest even with iterparse?

iterparse alone does not free memory — it only defers tree construction. Without the elem.clear() call plus the while elem.getprevious() is not None: del elem.getparent()[0] loop after each yield, lxml retains every processed element and its siblings in the parser’s internal tree until the file ends. Add both and resident memory stays flat regardless of harvest size.

My records import but `object_id` is `UNKNOWN` for everything. What went wrong?

The recordID lookup uses the [@type='local'] predicate. If your source tags its institutional identifier with a different type attribute (for example type='museum' or no type at all), the findtext returns nothing and the sentinel fires. Inspect one raw record, confirm the actual @type value, and adjust the predicate — do not drop it, or you risk keying on an aggregator-assigned ID.

Why are my dates being stored as NULL instead of the harvest value?

The transform coerces any value whose first four characters are not digits to NULL rather than persisting a non-conforming string. LIDO earliestDate fields frequently contain values like circa 1650 or n.d.; those legitimately become NULL and route to curator review. If a genuine ISO date is being nulled, check for leading whitespace or a BOM in the source element.

A single malformed record aborts the whole batch. How do I isolate it?

Validate and transform records before they reach process_batch, and divert failures to a quarantine table instead of letting them enter the batch list. The transaction in process_batch is all-or-nothing by design — that protects against partial writes — so failure isolation must happen upstream, at the transformation gate, not inside the database call.

Re-running the harvest duplicates rows. Isn’t the upsert supposed to prevent that?

Duplication means the ON CONFLICT (object_id) target has no matching unique constraint. ON CONFLICT requires a unique or primary-key index on object_id to detect the conflict; without it PostgreSQL raises an error or silently inserts. Confirm the constraint exists on the collection_objects table before ingesting.

Core Architecture & Collection Taxonomy — parent data-layer overview
Validating Dublin Core Against CollectionBase — pre-commit validation gate
Implementing Getty AAT & TGN — vocabulary URI resolution
Designing Museum Object Schemas — canonical target schema
Security Boundaries for Collection APIs — access-tier enforcement

Mapping LIDO to Internal Databases: A Streaming Ingestion Pipeline

Workflow Context #

Prerequisites #

Spec Reference: LIDO Paths to Internal Columns #

Step-by-Step Implementation #

1. Stream records with iterparse #

2. Transform and normalize vocabularies #

3. Route on rights, then upsert idempotently #

Pipeline Flow #

Rights and Access Routing #

Verification and Testing #

Deeper Reference #

FAQ #

Why does memory keep climbing during a large LIDO harvest even with iterparse? #

My records import but object_id is UNKNOWN for everything. What went wrong? #

Why are my dates being stored as NULL instead of the harvest value? #

A single malformed record aborts the whole batch. How do I isolate it? #

Re-running the harvest duplicates rows. Isn’t the upsert supposed to prevent that? #

Related #

Explore this section

Workflow Context

Prerequisites

Spec Reference: LIDO Paths to Internal Columns

Step-by-Step Implementation

1. Stream records with `iterparse`

2. Transform and normalize vocabularies

3. Route on rights, then upsert idempotently

Pipeline Flow

Rights and Access Routing

Verification and Testing

Deeper Reference

FAQ

Why does memory keep climbing during a large LIDO harvest even with iterparse?

My records import but `object_id` is `UNKNOWN` for everything. What went wrong?

Why are my dates being stored as NULL instead of the harvest value?

A single malformed record aborts the whole batch. How do I isolate it?

Re-running the harvest duplicates rows. Isn’t the upsert supposed to prevent that?

Related