Workflow Context

Museum DAMS pipelines ingest heterogeneous rights statements from legacy CMS exports, donor contracts, and digitization logs. Routing Creative Commons licenses requires deterministic normalization of free-text fields into machine-actionable URIs. The process must enforce strict version control between CC 3.0 and 4.0 specifications. Non-compliant records require isolation before publication. This subsystem operates within broader Rights Metadata Mapping & Licensing Automation architectures. High-throughput queues demand idempotent processing and explicit fallback routing.

Architecture and Data Flow

The routing engine decouples ingestion, normalization, validation, and distribution into discrete async stages. A semaphore-controlled worker pool prevents upstream CMS overload during batch operations. Metadata flows through a strict validation gate before reaching IIIF manifest generators. LIDO export routines consume the same normalized payload. Concurrent processing requires thread-safe dead-letter queues for malformed inputs. Automating Copyright Status Checks provides upstream signals that feed this routing stage.

flowchart LR
    St["Raw rights statement"] --> Nm["Normalize<br/>lowercase · 'cc'"]
    Nm --> Rs{"Resolve license<br/>longest match first"}
    Rs -->|matched| Mp["Map to canonical CC URI"]
    Rs -->|none| DLQ["Dead-letter queue"]
    Mp --> Rt{"Allows reuse<br/>(contains 'by')?"}
    Rt -->|yes| IIIF["IIIF manifest"]
    Rt -->|no| Arch["Internal archive"]

Core Routing Implementation

Production implementations rely on asyncio for concurrency and pydantic for schema enforcement. The following engine normalizes raw statements, maps them to canonical URIs, and routes assets to designated endpoints. Python 3.9+ type hints and modern validation patterns ensure type safety.

python
import asyncio
import hashlib
import logging
from enum import Enum
from typing import Any

from pydantic import BaseModel, Field, field_validator

logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
logger = logging.getLogger("cc_license_router")

class CCLicense(str, Enum):
    CC0 = "https://creativecommons.org/publicdomain/zero/1.0/"
    CC_BY = "https://creativecommons.org/licenses/by/4.0/"
    CC_BY_SA = "https://creativecommons.org/licenses/by-sa/4.0/"
    CC_BY_NC = "https://creativecommons.org/licenses/by-nc/4.0/"
    CC_BY_ND = "https://creativecommons.org/licenses/by-nd/4.0/"
    CC_BY_NC_SA = "https://creativecommons.org/licenses/by-nc-sa/4.0/"
    CC_BY_NC_ND = "https://creativecommons.org/licenses/by-nc-nd/4.0/"

class AssetRightsPayload(BaseModel):
    asset_id: str = Field(..., min_length=3, pattern=r"^[A-Z0-9\-]+$")
    raw_rights_statement: str
    rights_source: str = "legacy_cms"
    embargo_until: str | None = None

    @field_validator("raw_rights_statement", mode="before")
    @classmethod
    def normalize_statement(cls, v: str) -> str:
        return v.strip().lower().replace("creative commons", "cc")

class RoutedAsset(BaseModel):
    asset_id: str
    license_uri: str
    routing_destination: str
    validation_status: str = "passed"
    metadata_hash: str | None = None

class LicenseRouter:
    def __init__(self, max_concurrency: int = 10):
        self.semaphore = asyncio.Semaphore(max_concurrency)
        self.dead_letter_queue: list[dict[str, Any]] = []
        self._mapping = {
            "cc0": CCLicense.CC0,
            "cc by": CCLicense.CC_BY,
            "cc by-sa": CCLicense.CC_BY_SA,
            "cc by-nc": CCLicense.CC_BY_NC,
            "cc by-nd": CCLicense.CC_BY_ND,
            "cc by-nc-sa": CCLicense.CC_BY_NC_SA,
            "cc by-nc-nd": CCLicense.CC_BY_NC_ND,
        }

    def _resolve_license(self, statement: str) -> CCLicense | None:
        # Match the most specific key first: "cc by" is a substring of
        # "cc by-nc-nd", so test longest keys before shortest to avoid
        # mis-tagging every by-* variant as plain CC BY.
        for key in sorted(self._mapping, key=len, reverse=True):
            if key in statement:
                return self._mapping[key]
        return None

    async def process_asset(self, payload: AssetRightsPayload) -> RoutedAsset | None:
        async with self.semaphore:
            try:
                license_uri = self._resolve_license(payload.raw_rights_statement)
                if not license_uri:
                    raise ValueError("Unrecognized CC license pattern")
                
                destination = "iiif_manifest" if "by" in payload.raw_rights_statement else "internal_archive"
                metadata_hash = hashlib.sha256(f"{payload.asset_id}:{license_uri}".encode()).hexdigest()
                
                return RoutedAsset(
                    asset_id=payload.asset_id,
                    license_uri=license_uri,
                    routing_destination=destination,
                    metadata_hash=metadata_hash
                )
            except Exception as exc:
                logger.warning("Routing failed for %s: %s", payload.asset_id, exc)
                self.dead_letter_queue.append({
                    "asset_id": payload.asset_id,
                    "error": str(exc),
                    "raw_statement": payload.raw_rights_statement
                })
                return None

    async def run_batch(self, payloads: list[AssetRightsPayload]) -> list[RoutedAsset]:
        tasks = [self.process_asset(p) for p in payloads]
        # return_exceptions=True keeps one failure from cancelling the batch;
        # keep only successfully routed assets.
        results = await asyncio.gather(*tasks, return_exceptions=True)
        return [r for r in results if isinstance(r, RoutedAsset)]

IIIF and LIDO Schema Alignment

Normalized URIs must map directly to presentation layer specifications. IIIF Presentation API 3.0 expects the rights property to contain a resolvable URI. LIDO requires <rightsWork> blocks with <rightsType> and <rightsHolder> elements. The router output feeds directly into these XML/JSON structures. Implementing Embargo Workflows handles temporal restrictions before IIIF manifest generation.

Fallback Chains and Telemetry

Ambiguous rights statements require deterministic fallback routing. The engine routes unrecognized patterns to a staging endpoint for curator review. Structured telemetry captures routing decisions, validation failures, and hash collisions. Audit logs must comply with institutional retention policies. Automating CC-BY-NC-ND Tagging in Python extends this pipeline for restricted commercial use cases.

External validation against the IIIF Presentation API 3.0 specification ensures manifest compatibility. LIDO schema definitions are maintained at lido-schema.org. Canonical CC RDF vocabulary provides authoritative URI resolution paths.

Conclusion

The license resolver’s correctness depends on matching longest keys first. Without that sort, "cc by" matches before "cc by-nc-nd" and every derivative of CC BY gets mis-tagged as plain attribution. The sorted(self._mapping, key=len, reverse=True) pattern is the specific fix. Unrecognized statements route to the dead-letter queue for curator review rather than defaulting to any license — including CC0, which would inadvertently surrender rights.